File System Interfaces¶

In this section, we discuss filesystem-like interfaces in PyArrow.

PyArrow comes with bindings to a C++-based interface to the Hadoop File System. You connect like so:

import pyarrow as pa
fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
with fs.open(path, 'rb') as f:
# Do something with f


By default, pyarrow.hdfs.HadoopFileSystem uses libhdfs, a JNI-based interface to the Java Hadoop client. This library is loaded at runtime (rather than at link / library load time, since the library may not be in your LD_LIBRARY_PATH), and relies on some environment variables.

• HADOOP_HOME: the root of your installed Hadoop distribution. Often has lib/native/libhdfs.so.
• JAVA_HOME: the location of your Java SDK installation.
• ARROW_LIBHDFS_DIR (optional): explicit location of libhdfs.so if it is installed somewhere other than $HADOOP_HOME/lib/native. • CLASSPATH: must contain the Hadoop jars. You can set these using: export CLASSPATH=$HADOOP_HOME/bin/hdfs classpath --glob


If CLASSPATH is not set, then it will be set automatically if the hadoop executable is in your system path, or if HADOOP_HOME is set.

You can also use libhdfs3, a thirdparty C++ library for HDFS from Pivotal Labs:

fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path,
driver='libhdfs3')


HDFS API¶

 hdfs.connect([host, port, user, …]) Connect to an HDFS cluster. HadoopFileSystem.cat(path) Return contents of file as a bytes object HadoopFileSystem.chmod(self, path, mode) Change file permissions HadoopFileSystem.chown(self, path[, owner, …]) Change file permissions HadoopFileSystem.delete(path[, recursive]) Delete the indicated file or directory HadoopFileSystem.df(self) Return free space on disk, like the UNIX df command HadoopFileSystem.disk_usage(path) Compute bytes used by all contents under indicated path in file tree HadoopFileSystem.download(self, path, stream) HadoopFileSystem.exists(self, path) Returns True if the path is known to the cluster, False if it does not (or there is an RPC error) HadoopFileSystem.get_capacity(self) Get reported total capacity of file system HadoopFileSystem.get_space_used(self) Get space used on file system HadoopFileSystem.info(self, path) Return detailed HDFS information for path HadoopFileSystem.ls(path[, detail]) Retrieve directory contents and metadata, if requested. HadoopFileSystem.mkdir(path, **kwargs) Create directory in HDFS HadoopFileSystem.open(self, path[, mode, …]) Open HDFS file for reading or writing HadoopFileSystem.rename(path, new_path) Rename file, like UNIX mv command HadoopFileSystem.rm(path[, recursive]) Alias for FileSystem.delete HadoopFileSystem.upload(self, path, stream) Upload file-like object to HDFS path HdfsFile