Filesystem Interfaces

In this section, we discuss filesystem-like interfaces in PyArrow.

Hadoop File System (HDFS)

PyArrow comes with bindings to a C++-based interface to the Hadoop File System. You connect like so:

import pyarrow as pa
hdfs = pa.HdfsClient(host, port, user=user, kerb_ticket=ticket_cache_path)

By default, pyarrow.HdfsClient uses libhdfs, a JNI-based interface to the Java Hadoop client. This library is loaded at runtime (rather than at link / library load time, since the library may not be in your LD_LIBRARY_PATH), and relies on some environment variables.

  • HADOOP_HOME: the root of your installed Hadoop distribution. Often has lib/native/
  • JAVA_HOME: the location of your Java SDK installation.
  • ARROW_LIBHDFS_DIR (optional): explicit location of if it is installed somewhere other than $HADOOP_HOME/lib/native.
  • CLASSPATH: must contain the Hadoop jars. You can set these using:
export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`

You can also use libhdfs3, a thirdparty C++ library for HDFS from Pivotal Labs:

hdfs3 = pa.HdfsClient(host, port, user=user, kerb_ticket=ticket_cache_path,