Filesystem Interface (legacy)

Note

This section documents the deprecated filesystem layer. It is highly recommended to use the new filesystem layer instead.

Hadoop File System (HDFS)

PyArrow comes with bindings to a C++-based interface to the Hadoop File System. You connect like so:

import pyarrow as pa
fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
with fs.open(path, 'rb') as f:
    # Do something with f

By default, pyarrow.hdfs.HadoopFileSystem uses libhdfs, a JNI-based interface to the Java Hadoop client. This library is loaded at runtime (rather than at link / library load time, since the library may not be in your LD_LIBRARY_PATH), and relies on some environment variables.

  • HADOOP_HOME: the root of your installed Hadoop distribution. Often has lib/native/libhdfs.so.

  • JAVA_HOME: the location of your Java SDK installation.

  • ARROW_LIBHDFS_DIR (optional): explicit location of libhdfs.so if it is installed somewhere other than $HADOOP_HOME/lib/native.

  • CLASSPATH: must contain the Hadoop jars. You can set these using:

export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`

If CLASSPATH is not set, then it will be set automatically if the hadoop executable is in your system path, or if HADOOP_HOME is set.

You can also use libhdfs3, a thirdparty C++ library for HDFS from Pivotal Labs:

fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path,
                    driver='libhdfs3')

HDFS API

hdfs.connect([host, port, user, …])

Connect to an HDFS cluster.

HadoopFileSystem.cat(path)

Return contents of file as a bytes object

HadoopFileSystem.chmod(self, path, mode)

Change file permissions

HadoopFileSystem.chown(self, path[, owner, …])

Change file permissions

HadoopFileSystem.delete(path[, recursive])

Delete the indicated file or directory

HadoopFileSystem.df(self)

Return free space on disk, like the UNIX df command

HadoopFileSystem.disk_usage(path)

Compute bytes used by all contents under indicated path in file tree

HadoopFileSystem.download(self, path, stream)

HadoopFileSystem.exists(self, path)

Returns True if the path is known to the cluster, False if it does not (or there is an RPC error)

HadoopFileSystem.get_capacity(self)

Get reported total capacity of file system

HadoopFileSystem.get_space_used(self)

Get space used on file system

HadoopFileSystem.info(self, path)

Return detailed HDFS information for path

HadoopFileSystem.ls(path[, detail])

Retrieve directory contents and metadata, if requested.

HadoopFileSystem.mkdir(path, **kwargs)

Create directory in HDFS

HadoopFileSystem.open(self, path[, mode, …])

Open HDFS file for reading or writing

HadoopFileSystem.rename(path, new_path)

Rename file, like UNIX mv command

HadoopFileSystem.rm(path[, recursive])

Alias for FileSystem.delete

HadoopFileSystem.upload(self, path, stream)

Upload file-like object to HDFS path

HdfsFile