Filesystem Interface (legacy)

Warning

This section documents the deprecated filesystem layer. You should use the new filesystem layer instead.

Hadoop File System (HDFS)

PyArrow comes with bindings to a C++-based interface to the Hadoop File System. You connect like so:

import pyarrow as pa
fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
with fs.open(path, 'rb') as f:
    # Do something with f

By default, pyarrow.hdfs.HadoopFileSystem uses libhdfs, a JNI-based interface to the Java Hadoop client. This library is loaded at runtime (rather than at link / library load time, since the library may not be in your LD_LIBRARY_PATH), and relies on some environment variables.

  • HADOOP_HOME: the root of your installed Hadoop distribution. Often has lib/native/libhdfs.so.

  • JAVA_HOME: the location of your Java SDK installation.

  • ARROW_LIBHDFS_DIR (optional): explicit location of libhdfs.so if it is installed somewhere other than $HADOOP_HOME/lib/native.

  • CLASSPATH: must contain the Hadoop jars. You can set these using:

export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`

If CLASSPATH is not set, then it will be set automatically if the hadoop executable is in your system path, or if HADOOP_HOME is set.

You can also use libhdfs3, a thirdparty C++ library for HDFS from Pivotal Labs:

fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path,
                    driver='libhdfs3')

HDFS API

hdfs.connect([host, port, user, …])

Connect to an HDFS cluster.

HadoopFileSystem.cat(path)

Return contents of file as a bytes object.

HadoopFileSystem.chmod(self, path, mode)

Change file permissions

HadoopFileSystem.chown(self, path[, owner, …])

Change file permissions

HadoopFileSystem.delete(path[, recursive])

Delete the indicated file or directory.

HadoopFileSystem.df(self)

Return free space on disk, like the UNIX df command

HadoopFileSystem.disk_usage(path)

Compute bytes used by all contents under indicated path in file tree.

HadoopFileSystem.download(self, path, stream)

HadoopFileSystem.exists(path)

Return True if path exists.

HadoopFileSystem.get_capacity(self)

Get reported total capacity of file system

HadoopFileSystem.get_space_used(self)

Get space used on file system

HadoopFileSystem.info(self, path)

Return detailed HDFS information for path

HadoopFileSystem.ls(path[, detail])

Retrieve directory contents and metadata, if requested.

HadoopFileSystem.mkdir(path, **kwargs)

Create directory in HDFS.

HadoopFileSystem.open(self, path[, mode, …])

Open HDFS file for reading or writing

HadoopFileSystem.rename(path, new_path)

Rename file, like UNIX mv command.

HadoopFileSystem.rm(path[, recursive])

Alias for FileSystem.delete.

HadoopFileSystem.upload(self, path, stream)

Upload file-like object to HDFS path

HdfsFile