Filesystem Interface¶
PyArrow comes with an abstract filesystem interface, as well as concrete implementations for various storage types.
The filesystem interface provides input and output streams as well as
directory operations. A simplified view of the underlying data
storage is exposed. Data paths are represented as abstract paths, which
are /
-separated, even on Windows, and shouldn’t include special path
components such as .
and ..
. Symbolic links, if supported by the
underlying storage, are automatically dereferenced. Only basic
metadata
about file entries, such as the file size
and modification time, is made available.
Types¶
The core interface is represented by the base class FileSystem
.
Concrete subclasses are available for various kinds of storage:
local filesystem access
,
HDFS
and
Amazon S3-compatible storage
.
Example¶
Assuming your S3 credentials are correctly configured (for example by setting
the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables),
here is how you can read contents from a S3 bucket:
>>> from pyarrow import fs
>>> s3 = fs.S3FileSystem(region='eu-west-3')
# List all contents in a bucket, recursively
>>> s3.get_target_stats(fs.FileSelector('my-test-bucket', recursive=True))
[<FileStats for 'my-test-bucket/File1': type=FileType.File, size=10>,
<FileStats for 'my-test-bucket/File5': type=FileType.File, size=10>,
<FileStats for 'my-test-bucket/Dir1': type=FileType.Directory>,
<FileStats for 'my-test-bucket/Dir2': type=FileType.Directory>,
<FileStats for 'my-test-bucket/EmptyDir': type=FileType.Directory>,
<FileStats for 'my-test-bucket/Dir1/File2': type=FileType.File, size=11>,
<FileStats for 'my-test-bucket/Dir1/Subdir': type=FileType.Directory>,
<FileStats for 'my-test-bucket/Dir2/Subdir': type=FileType.Directory>,
<FileStats for 'my-test-bucket/Dir2/Subdir/File3': type=FileType.File, size=10>]
# Open a file for reading and download its contents
>>> f = s3.open_input_stream('my-test-bucket/Dir1/File2')
>>> f.readall()
b'some data'