File System Interfaces

In this section, we discuss filesystem-like interfaces in PyArrow.

Hadoop File System (HDFS)

PyArrow comes with bindings to a C++-based interface to the Hadoop File System. You connect like so:

import pyarrow as pa
fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
with fs.open(path, 'rb') as f:
    # Do something with f

By default, pyarrow.hdfs.HadoopFileSystem uses libhdfs, a JNI-based interface to the Java Hadoop client. This library is loaded at runtime (rather than at link / library load time, since the library may not be in your LD_LIBRARY_PATH), and relies on some environment variables.

  • HADOOP_HOME: the root of your installed Hadoop distribution. Often has lib/native/libhdfs.so.
  • JAVA_HOME: the location of your Java SDK installation.
  • ARROW_LIBHDFS_DIR (optional): explicit location of libhdfs.so if it is installed somewhere other than $HADOOP_HOME/lib/native.
  • CLASSPATH: must contain the Hadoop jars. You can set these using:
export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`

If CLASSPATH is not set, then it will be set automatically if the hadoop executable is in your system path, or if HADOOP_HOME is set.

You can also use libhdfs3, a thirdparty C++ library for HDFS from Pivotal Labs:

fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path,
                    driver='libhdfs3')

HDFS API

hdfs.connect([host, port, user, …]) Connect to an HDFS cluster.
HadoopFileSystem.cat(path) Return contents of file as a bytes object
HadoopFileSystem.chmod(self, path, mode) Change file permissions
HadoopFileSystem.chown(self, path[, owner, …]) Change file permissions
HadoopFileSystem.delete(path[, recursive]) Delete the indicated file or directory
HadoopFileSystem.df(self) Return free space on disk, like the UNIX df command
HadoopFileSystem.disk_usage(path) Compute bytes used by all contents under indicated path in file tree
HadoopFileSystem.download(self, path, stream)
HadoopFileSystem.exists(path)
HadoopFileSystem.get_capacity(self) Get reported total capacity of file system
HadoopFileSystem.get_space_used(self) Get space used on file system
HadoopFileSystem.info(self, path) Return detailed HDFS information for path
HadoopFileSystem.ls(path[, detail]) Retrieve directory contents and metadata, if requested.
HadoopFileSystem.mkdir(path, **kwargs) Create directory in HDFS
HadoopFileSystem.open(self, path[, mode, …]) Open HDFS file for reading or writing
HadoopFileSystem.rename(path, new_path) Rename file, like UNIX mv command
HadoopFileSystem.rm(path[, recursive]) Alias for FileSystem.delete
HadoopFileSystem.upload(self, path, stream) Upload file-like object to HDFS path
HdfsFile