Filesystem Interface

PyArrow comes with an abstract filesystem interface, as well as concrete implementations for various storage types.

The filesystem interface provides input and output streams as well as directory operations. A simplified view of the underlying data storage is exposed. Data paths are represented as abstract paths, which are /-separated, even on Windows, and shouldn’t include special path components such as . and ... Symbolic links, if supported by the underlying storage, are automatically dereferenced. Only basic metadata about file entries, such as the file size and modification time, is made available.

Types

The core interface is represented by the base class FileSystem. Concrete subclasses are available for various kinds of storage: local filesystem access, HDFS and Amazon S3-compatible storage.

Example

Assuming your S3 credentials are correctly configured (for example by setting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables), here is how you can read contents from a S3 bucket:

>>> from pyarrow import fs
>>> s3 = fs.S3FileSystem(region='eu-west-3')

# List all contents in a bucket, recursively
>>> s3.get_target_stats(fs.FileSelector('my-test-bucket', recursive=True))
[<FileStats for 'my-test-bucket/File1': type=FileType.File, size=10>,
 <FileStats for 'my-test-bucket/File5': type=FileType.File, size=10>,
 <FileStats for 'my-test-bucket/Dir1': type=FileType.Directory>,
 <FileStats for 'my-test-bucket/Dir2': type=FileType.Directory>,
 <FileStats for 'my-test-bucket/EmptyDir': type=FileType.Directory>,
 <FileStats for 'my-test-bucket/Dir1/File2': type=FileType.File, size=11>,
 <FileStats for 'my-test-bucket/Dir1/Subdir': type=FileType.Directory>,
 <FileStats for 'my-test-bucket/Dir2/Subdir': type=FileType.Directory>,
 <FileStats for 'my-test-bucket/Dir2/Subdir/File3': type=FileType.File, size=10>]

# Open a file for reading and download its contents
>>> f = s3.open_input_stream('my-test-bucket/Dir1/File2')
>>> f.readall()
b'some data'