Filesystems

Interface

enum class arrow::fs::FileType : int8_t

FileSystem entry type.

Values:

enumerator NotFound

Entry is not found.

enumerator Unknown

Entry exists but its type is unknown.

This can designate a special file such as a Unix socket or character device, or Windows NUL / CON / …

enumerator File

Entry is a regular file.

enumerator Directory

Entry is a directory.

struct FileInfo : public arrow::util::EqualityComparable<FileInfo>

FileSystem entry info.

Public Functions

inline FileType type() const

The file type.

inline const std::string &path() const

The full file path in the filesystem.

std::string base_name() const

The file base name (component after the last directory separator)

inline int64_t size() const

The size in bytes, if available.

Only regular files are guaranteed to have a size.

std::string extension() const

The file extension (excluding the dot)

inline TimePoint mtime() const

The time of last modification, if available.

struct ByPath

Function object implementing less-than comparison and hashing by path, to support sorting infos, using them as keys, and other interactions with the STL.

struct FileSelector

File selector for filesystem APIs.

Public Members

std::string base_dir

The directory in which to select files.

If the path exists but doesn’t point to a directory, this should be an error.

bool allow_not_found

The behavior if base_dir isn’t found in the filesystem.

If false, an error is returned. If true, an empty selection is returned.

bool recursive

Whether to recurse into subdirectories.

int32_t max_recursion

The maximum number of subdirectories to recurse into.

class FileSystem : public std::enable_shared_from_this<FileSystem>

Abstract file system API.

Subclassed by arrow::fs::GcsFileSystem, arrow::fs::HadoopFileSystem, arrow::fs::internal::MockFileSystem, arrow::fs::LocalFileSystem, arrow::fs::S3FileSystem, arrow::fs::SlowFileSystem, arrow::fs::SubTreeFileSystem

Public Functions

inline const io::IOContext &io_context() const

EXPERIMENTAL: The IOContext associated with this filesystem.

virtual Result<std::string> NormalizePath(std::string path)

Normalize path for the given filesystem.

The default implementation of this method is a no-op, but subclasses may allow normalizing irregular path forms (such as Windows local paths).

virtual Result<std::string> PathFromUri(const std::string &uri_string) const

Ensure a URI (or path) is compatible with the given filesystem and return the path.

This method will check to ensure the given filesystem is compatible with the URI. This can be useful when the user provides both a URI and a filesystem or when a user provides multiple URIs that should be compatible with the same filesystem.

uri_string can be an absolute path instead of a URI. In that case it will ensure the filesystem (if supplied) is the local filesystem (or some custom filesystem that is capable of reading local paths) and will normalize the path’s file separators.

Note, this method only checks to ensure the URI scheme is valid. It will not detect inconsistencies like a mismatching region or endpoint override.

Parameters:

uri_string – A URI representing a resource in the given filesystem.

Returns:

The path inside the filesystem that is indicated by the URI.

virtual Result<FileInfo> GetFileInfo(const std::string &path) = 0

Get info for the given target.

Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).

virtual Result<FileInfoVector> GetFileInfo(const std::vector<std::string> &paths)

Same, for many targets at once.

virtual Result<FileInfoVector> GetFileInfo(const FileSelector &select) = 0

Same, according to a selector.

The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see FileSelector::allow_not_found.

virtual Future<FileInfoVector> GetFileInfoAsync(const std::vector<std::string> &paths)

Async version of GetFileInfo.

virtual FileInfoGenerator GetFileInfoGenerator(const FileSelector &select)

Streaming async version of GetFileInfo.

The returned generator is not async-reentrant, i.e. you need to wait for the returned future to complete before calling the generator again.

virtual Status CreateDir(const std::string &path, bool recursive = true) = 0

Create a directory and subdirectories.

This function succeeds if the directory already exists.

virtual Status DeleteDir(const std::string &path) = 0

Delete a directory and its contents, recursively.

virtual Status DeleteDirContents(const std::string &path, bool missing_dir_ok = false) = 0

Delete a directory’s contents, recursively.

Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (”” or “/”) is disallowed, see DeleteRootDirContents.

virtual Future DeleteDirContentsAsync(const std::string &path, bool missing_dir_ok = false)

Async version of DeleteDirContents.

virtual Status DeleteRootDirContents() = 0

EXPERIMENTAL: Delete the root directory’s contents, recursively.

Implementations may decide to raise an error if this operation is too dangerous.

virtual Status DeleteFile(const std::string &path) = 0

Delete a file.

virtual Status DeleteFiles(const std::vector<std::string> &paths)

Delete many files.

The default implementation issues individual delete operations in sequence.

virtual Status Move(const std::string &src, const std::string &dest) = 0

Move / rename a file or directory.

If the destination exists:

  • if it is a non-empty directory, an error is returned

  • otherwise, if it has the same type as the source, it is replaced

  • otherwise, behavior is unspecified (implementation-dependent).

virtual Status CopyFile(const std::string &src, const std::string &dest) = 0

Copy a file.

If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.

virtual Result<std::shared_ptr<io::InputStream>> OpenInputStream(const std::string &path) = 0

Open an input stream for sequential reading.

virtual Result<std::shared_ptr<io::InputStream>> OpenInputStream(const FileInfo &info)

Open an input stream for sequential reading.

This override assumes the given FileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).

virtual Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const std::string &path) = 0

Open an input file for random access reading.

virtual Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const FileInfo &info)

Open an input file for random access reading.

This override assumes the given FileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).

virtual Future<std::shared_ptr<io::InputStream>> OpenInputStreamAsync(const std::string &path)

Async version of OpenInputStream.

virtual Future<std::shared_ptr<io::InputStream>> OpenInputStreamAsync(const FileInfo &info)

Async version of OpenInputStream.

virtual Future<std::shared_ptr<io::RandomAccessFile>> OpenInputFileAsync(const std::string &path)

Async version of OpenInputFile.

virtual Future<std::shared_ptr<io::RandomAccessFile>> OpenInputFileAsync(const FileInfo &info)

Async version of OpenInputFile.

virtual Result<std::shared_ptr<io::OutputStream>> OpenOutputStream(const std::string &path, const std::shared_ptr<const KeyValueMetadata> &metadata) = 0

Open an output stream for sequential writing.

If the target already exists, existing data is truncated.

virtual Result<std::shared_ptr<io::OutputStream>> OpenAppendStream(const std::string &path, const std::shared_ptr<const KeyValueMetadata> &metadata) = 0

Open an output stream for appending.

If the target doesn’t exist, a new empty file is created.

Note: some filesystem implementations do not support efficient appending to an existing file, in which case this method will return NotImplemented. Consider writing to multiple files (using e.g. the dataset layer) instead.

High-level factory function

Result<std::shared_ptr<FileSystem>> FileSystemFromUri(const std::string &uri, std::string *out_path = NULLPTR)

Create a new FileSystem by URI.

Recognized schemes are “file”, “mock”, “hdfs”, “viewfs”, “s3”, “gs” and “gcs”.

Parameters:
  • uri[in] a URI-based path, ex: file:///some/local/path

  • out_path[out] (optional) Path inside the filesystem.

Returns:

out_fs FileSystem instance.

Result<std::shared_ptr<FileSystem>> FileSystemFromUri(const std::string &uri, const io::IOContext &io_context, std::string *out_path = NULLPTR)

Create a new FileSystem by URI with a custom IO context.

Recognized schemes are “file”, “mock”, “hdfs”, “viewfs”, “s3”, “gs” and “gcs”.

Parameters:
  • uri[in] a URI-based path, ex: file:///some/local/path

  • io_context[in] an IOContext which will be associated with the filesystem

  • out_path[out] (optional) Path inside the filesystem.

Returns:

out_fs FileSystem instance.

Result<std::shared_ptr<FileSystem>> FileSystemFromUriOrPath(const std::string &uri, std::string *out_path = NULLPTR)

Create a new FileSystem by URI.

Same as FileSystemFromUri, but in addition also recognize non-URIs and treat them as local filesystem paths. Only absolute local filesystem paths are allowed.

Result<std::shared_ptr<FileSystem>> FileSystemFromUriOrPath(const std::string &uri, const io::IOContext &io_context, std::string *out_path = NULLPTR)

Create a new FileSystem by URI with a custom IO context.

Same as FileSystemFromUri, but in addition also recognize non-URIs and treat them as local filesystem paths. Only absolute local filesystem paths are allowed.

Concrete implementations

“Subtree” filesystem wrapper

class SubTreeFileSystem : public arrow::fs::FileSystem

A FileSystem implementation that delegates to another implementation after prepending a fixed base path.

This is useful to expose a logical view of a subtree of a filesystem, for example a directory in a LocalFileSystem. This works on abstract paths, i.e. paths using forward slashes and and a single root “/”. Windows paths are not guaranteed to work. This makes no security guarantee. For example, symlinks may allow to “escape” the subtree and access other parts of the underlying filesystem.

Public Functions

virtual Result<std::string> NormalizePath(std::string path) override

Normalize path for the given filesystem.

The default implementation of this method is a no-op, but subclasses may allow normalizing irregular path forms (such as Windows local paths).

virtual Result<std::string> PathFromUri(const std::string &uri_string) const override

Ensure a URI (or path) is compatible with the given filesystem and return the path.

This method will check to ensure the given filesystem is compatible with the URI. This can be useful when the user provides both a URI and a filesystem or when a user provides multiple URIs that should be compatible with the same filesystem.

uri_string can be an absolute path instead of a URI. In that case it will ensure the filesystem (if supplied) is the local filesystem (or some custom filesystem that is capable of reading local paths) and will normalize the path’s file separators.

Note, this method only checks to ensure the URI scheme is valid. It will not detect inconsistencies like a mismatching region or endpoint override.

Parameters:

uri_string – A URI representing a resource in the given filesystem.

Returns:

The path inside the filesystem that is indicated by the URI.

virtual Result<FileInfo> GetFileInfo(const std::string &path) override

Get info for the given target.

Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).

virtual Result<FileInfoVector> GetFileInfo(const FileSelector &select) override

Same, according to a selector.

The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see FileSelector::allow_not_found.

virtual FileInfoGenerator GetFileInfoGenerator(const FileSelector &select) override

Streaming async version of GetFileInfo.

The returned generator is not async-reentrant, i.e. you need to wait for the returned future to complete before calling the generator again.

virtual Status CreateDir(const std::string &path, bool recursive = true) override

Create a directory and subdirectories.

This function succeeds if the directory already exists.

virtual Status DeleteDir(const std::string &path) override

Delete a directory and its contents, recursively.

virtual Status DeleteDirContents(const std::string &path, bool missing_dir_ok = false) override

Delete a directory’s contents, recursively.

Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (”” or “/”) is disallowed, see DeleteRootDirContents.

virtual Status DeleteRootDirContents() override

EXPERIMENTAL: Delete the root directory’s contents, recursively.

Implementations may decide to raise an error if this operation is too dangerous.

virtual Status DeleteFile(const std::string &path) override

Delete a file.

virtual Status Move(const std::string &src, const std::string &dest) override

Move / rename a file or directory.

If the destination exists:

  • if it is a non-empty directory, an error is returned

  • otherwise, if it has the same type as the source, it is replaced

  • otherwise, behavior is unspecified (implementation-dependent).

virtual Status CopyFile(const std::string &src, const std::string &dest) override

Copy a file.

If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.

virtual Result<std::shared_ptr<io::InputStream>> OpenInputStream(const std::string &path) override

Open an input stream for sequential reading.

virtual Result<std::shared_ptr<io::InputStream>> OpenInputStream(const FileInfo &info) override

Open an input stream for sequential reading.

This override assumes the given FileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).

virtual Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const std::string &path) override

Open an input file for random access reading.

virtual Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const FileInfo &info) override

Open an input file for random access reading.

This override assumes the given FileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).

virtual Future<std::shared_ptr<io::InputStream>> OpenInputStreamAsync(const std::string &path) override

Async version of OpenInputStream.

virtual Future<std::shared_ptr<io::InputStream>> OpenInputStreamAsync(const FileInfo &info) override

Async version of OpenInputStream.

virtual Future<std::shared_ptr<io::RandomAccessFile>> OpenInputFileAsync(const std::string &path) override

Async version of OpenInputFile.

virtual Future<std::shared_ptr<io::RandomAccessFile>> OpenInputFileAsync(const FileInfo &info) override

Async version of OpenInputFile.

virtual Result<std::shared_ptr<io::OutputStream>> OpenOutputStream(const std::string &path, const std::shared_ptr<const KeyValueMetadata> &metadata = {}) override

Open an output stream for sequential writing.

If the target already exists, existing data is truncated.

virtual Result<std::shared_ptr<io::OutputStream>> OpenAppendStream(const std::string &path, const std::shared_ptr<const KeyValueMetadata> &metadata = {}) override

Open an output stream for appending.

If the target doesn’t exist, a new empty file is created.

Note: some filesystem implementations do not support efficient appending to an existing file, in which case this method will return NotImplemented. Consider writing to multiple files (using e.g. the dataset layer) instead.

Local filesystem

struct LocalFileSystemOptions

Options for the LocalFileSystem implementation.

Public Members

bool use_mmap = false

Whether OpenInputStream and OpenInputFile return a mmap’ed file, or a regular one.

int32_t directory_readahead = kDefaultDirectoryReadahead

Options related to GetFileInfoGenerator interface.

EXPERIMENTAL: The maximum number of directories processed in parallel by GetFileInfoGenerator.

int32_t file_info_batch_size = kDefaultFileInfoBatchSize

EXPERIMENTAL: The maximum number of entries aggregated into each FileInfoVector chunk by GetFileInfoGenerator.

Since each FileInfo entry needs a separate stat system call, a directory with a very large number of files may take a lot of time to process entirely. By generating a FileInfoVector after this chunk size is reached, we ensure FileInfo entries can start being consumed from the FileInfoGenerator with less initial latency.

Public Static Functions

static LocalFileSystemOptions Defaults()

Initialize with defaults.

class LocalFileSystem : public arrow::fs::FileSystem

A FileSystem implementation accessing files on the local machine.

This class handles only /-separated paths. If desired, conversion from Windows backslash-separated paths should be done by the caller. Details such as symlinks are abstracted away (symlinks are always followed, except when deleting an entry).

Public Functions

virtual Result<std::string> NormalizePath(std::string path) override

Normalize path for the given filesystem.

The default implementation of this method is a no-op, but subclasses may allow normalizing irregular path forms (such as Windows local paths).

virtual Result<std::string> PathFromUri(const std::string &uri_string) const override

Ensure a URI (or path) is compatible with the given filesystem and return the path.

This method will check to ensure the given filesystem is compatible with the URI. This can be useful when the user provides both a URI and a filesystem or when a user provides multiple URIs that should be compatible with the same filesystem.

uri_string can be an absolute path instead of a URI. In that case it will ensure the filesystem (if supplied) is the local filesystem (or some custom filesystem that is capable of reading local paths) and will normalize the path’s file separators.

Note, this method only checks to ensure the URI scheme is valid. It will not detect inconsistencies like a mismatching region or endpoint override.

Parameters:

uri_string – A URI representing a resource in the given filesystem.

Returns:

The path inside the filesystem that is indicated by the URI.

virtual Result<FileInfo> GetFileInfo(const std::string &path) override

Get info for the given target.

Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).

virtual Result<std::vector<FileInfo>> GetFileInfo(const FileSelector &select) override

Same, according to a selector.

The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see FileSelector::allow_not_found.

virtual FileInfoGenerator GetFileInfoGenerator(const FileSelector &select) override

Streaming async version of GetFileInfo.

The returned generator is not async-reentrant, i.e. you need to wait for the returned future to complete before calling the generator again.

virtual Status CreateDir(const std::string &path, bool recursive = true) override

Create a directory and subdirectories.

This function succeeds if the directory already exists.

virtual Status DeleteDir(const std::string &path) override

Delete a directory and its contents, recursively.

virtual Status DeleteDirContents(const std::string &path, bool missing_dir_ok = false) override

Delete a directory’s contents, recursively.

Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (”” or “/”) is disallowed, see DeleteRootDirContents.

virtual Status DeleteRootDirContents() override

EXPERIMENTAL: Delete the root directory’s contents, recursively.

Implementations may decide to raise an error if this operation is too dangerous.

virtual Status DeleteFile(const std::string &path) override

Delete a file.

virtual Status Move(const std::string &src, const std::string &dest) override

Move / rename a file or directory.

If the destination exists:

  • if it is a non-empty directory, an error is returned

  • otherwise, if it has the same type as the source, it is replaced

  • otherwise, behavior is unspecified (implementation-dependent).

virtual Status CopyFile(const std::string &src, const std::string &dest) override

Copy a file.

If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.

virtual Result<std::shared_ptr<io::InputStream>> OpenInputStream(const std::string &path) override

Open an input stream for sequential reading.

virtual Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const std::string &path) override

Open an input file for random access reading.

virtual Result<std::shared_ptr<io::OutputStream>> OpenOutputStream(const std::string &path, const std::shared_ptr<const KeyValueMetadata> &metadata = {}) override

Open an output stream for sequential writing.

If the target already exists, existing data is truncated.

virtual Result<std::shared_ptr<io::OutputStream>> OpenAppendStream(const std::string &path, const std::shared_ptr<const KeyValueMetadata> &metadata = {}) override

Open an output stream for appending.

If the target doesn’t exist, a new empty file is created.

Note: some filesystem implementations do not support efficient appending to an existing file, in which case this method will return NotImplemented. Consider writing to multiple files (using e.g. the dataset layer) instead.

S3 filesystem

struct S3Options

Options for the S3FileSystem implementation.

Public Functions

void ConfigureDefaultCredentials()

Configure with the default AWS credentials provider chain.

void ConfigureAnonymousCredentials()

Configure with anonymous credentials. This will only let you access public buckets.

void ConfigureAccessKey(const std::string &access_key, const std::string &secret_key, const std::string &session_token = "")

Configure with explicit access and secret key.

void ConfigureAssumeRoleCredentials(const std::string &role_arn, const std::string &session_name = "", const std::string &external_id = "", int load_frequency = 900, const std::shared_ptr<Aws::STS::STSClient> &stsClient = NULLPTR)

Configure with credentials from an assumed role.

void ConfigureAssumeRoleWithWebIdentityCredentials()

Configure with credentials from role assumed using a web identitiy token.

Public Members

std::string region

AWS region to connect to.

If unset, the AWS SDK will choose a default value. The exact algorithm depends on the SDK version. Before 1.8, the default is hardcoded to “us-east-1”. Since 1.8, several heuristics are used to determine the region (environment variables, configuration profile, EC2 metadata server).

double connect_timeout = -1

Socket connection timeout, in seconds.

If negative, the AWS SDK default value is used (typically 1 second).

double request_timeout = -1

Socket read timeout on Windows and macOS, in seconds.

If negative, the AWS SDK default value is used (typically 3 seconds). This option is ignored on non-Windows, non-macOS systems.

std::string endpoint_override

If non-empty, override region with a connect string such as “localhost:9000”.

std::string scheme = "https"

S3 connection transport, default “https”.

std::string role_arn

ARN of role to assume.

std::string session_name

Optional identifier for an assumed role session.

std::string external_id

Optional external idenitifer to pass to STS when assuming a role.

int load_frequency = 900

Frequency (in seconds) to refresh temporary credentials from assumed role.

S3ProxyOptions proxy_options

If connection is through a proxy, set options here.

std::shared_ptr<Aws::Auth::AWSCredentialsProvider> credentials_provider

AWS credentials provider.

S3CredentialsKind credentials_kind = S3CredentialsKind::Default

Type of credentials being used. Set along with credentials_provider.

bool background_writes = true

Whether OutputStream writes will be issued in the background, without blocking.

bool allow_bucket_creation = false

Whether to allow creation of buckets.

When S3FileSystem creates new buckets, it does not pass any non-default settings. In AWS S3, the bucket and all objects will be not publicly visible, and there will be no bucket policies and no resource tags. To have more control over how buckets are created, use a different API to create them.

bool allow_bucket_deletion = false

Whether to allow deletion of buckets.

std::shared_ptr<const KeyValueMetadata> default_metadata

Default metadata for OpenOutputStream.

This will be ignored if non-empty metadata is passed to OpenOutputStream.

std::shared_ptr<S3RetryStrategy> retry_strategy

Optional retry strategy to determine which error types should be retried, and the delay between retries.

Public Static Functions

static S3Options Defaults()

Initialize with default credentials provider chain.

This is recommended if you use the standard AWS environment variables and/or configuration file.

static S3Options Anonymous()

Initialize with anonymous credentials.

This will only let you access public buckets.

static S3Options FromAccessKey(const std::string &access_key, const std::string &secret_key, const std::string &session_token = "")

Initialize with explicit access and secret key.

Optionally, a session token may also be provided for temporary credentials (from STS).

static S3Options FromAssumeRole(const std::string &role_arn, const std::string &session_name = "", const std::string &external_id = "", int load_frequency = 900, const std::shared_ptr<Aws::STS::STSClient> &stsClient = NULLPTR)

Initialize from an assumed role.

static S3Options FromAssumeRoleWithWebIdentity()

Initialize from an assumed role with web-identity.

Uses the AWS SDK which uses environment variables to generate temporary credentials.

class S3FileSystem : public arrow::fs::FileSystem

S3-backed FileSystem implementation.

Some implementation notes:

  • buckets are special and the operations available on them may be limited or more expensive than desired.

Public Functions

S3Options options() const

Return the original S3 options when constructing the filesystem.

std::string region() const

Return the actual region this filesystem connects to.

virtual Result<std::string> PathFromUri(const std::string &uri_string) const override

Ensure a URI (or path) is compatible with the given filesystem and return the path.

This method will check to ensure the given filesystem is compatible with the URI. This can be useful when the user provides both a URI and a filesystem or when a user provides multiple URIs that should be compatible with the same filesystem.

uri_string can be an absolute path instead of a URI. In that case it will ensure the filesystem (if supplied) is the local filesystem (or some custom filesystem that is capable of reading local paths) and will normalize the path’s file separators.

Note, this method only checks to ensure the URI scheme is valid. It will not detect inconsistencies like a mismatching region or endpoint override.

Parameters:

uri_string – A URI representing a resource in the given filesystem.

Returns:

The path inside the filesystem that is indicated by the URI.

virtual Result<FileInfo> GetFileInfo(const std::string &path) override

Get info for the given target.

Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).

virtual Result<std::vector<FileInfo>> GetFileInfo(const FileSelector &select) override

Same, according to a selector.

The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see FileSelector::allow_not_found.

virtual FileInfoGenerator GetFileInfoGenerator(const FileSelector &select) override

Streaming async version of GetFileInfo.

The returned generator is not async-reentrant, i.e. you need to wait for the returned future to complete before calling the generator again.

virtual Status CreateDir(const std::string &path, bool recursive = true) override

Create a directory and subdirectories.

This function succeeds if the directory already exists.

virtual Status DeleteDir(const std::string &path) override

Delete a directory and its contents, recursively.

virtual Status DeleteDirContents(const std::string &path, bool missing_dir_ok = false) override

Delete a directory’s contents, recursively.

Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (”” or “/”) is disallowed, see DeleteRootDirContents.

virtual Future DeleteDirContentsAsync(const std::string &path, bool missing_dir_ok = false) override

Async version of DeleteDirContents.

virtual Status DeleteRootDirContents() override

EXPERIMENTAL: Delete the root directory’s contents, recursively.

Implementations may decide to raise an error if this operation is too dangerous.

virtual Status DeleteFile(const std::string &path) override

Delete a file.

virtual Status Move(const std::string &src, const std::string &dest) override

Move / rename a file or directory.

If the destination exists:

  • if it is a non-empty directory, an error is returned

  • otherwise, if it has the same type as the source, it is replaced

  • otherwise, behavior is unspecified (implementation-dependent).

virtual Status CopyFile(const std::string &src, const std::string &dest) override

Copy a file.

If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.

virtual Result<std::shared_ptr<io::InputStream>> OpenInputStream(const std::string &path) override

Create a sequential input stream for reading from a S3 object.

NOTE: Reads from the stream will be synchronous and unbuffered. You way want to wrap the stream in a BufferedInputStream or use a custom readahead strategy to avoid idle waits.

virtual Result<std::shared_ptr<io::InputStream>> OpenInputStream(const FileInfo &info) override

Create a sequential input stream for reading from a S3 object.

This override avoids a HEAD request by assuming the FileInfo contains correct information.

virtual Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const std::string &path) override

Create a random access file for reading from a S3 object.

See OpenInputStream for performance notes.

virtual Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const FileInfo &info) override

Create a random access file for reading from a S3 object.

This override avoids a HEAD request by assuming the FileInfo contains correct information.

virtual Result<std::shared_ptr<io::OutputStream>> OpenOutputStream(const std::string &path, const std::shared_ptr<const KeyValueMetadata> &metadata = {}) override

Create a sequential output stream for writing to a S3 object.

NOTE: Writes to the stream will be buffered. Depending on S3Options.background_writes, they can be synchronous or not. It is recommended to enable background_writes unless you prefer implementing your own background execution strategy.

virtual Result<std::shared_ptr<io::OutputStream>> OpenAppendStream(const std::string &path, const std::shared_ptr<const KeyValueMetadata> &metadata = {}) override

Open an output stream for appending.

If the target doesn’t exist, a new empty file is created.

Note: some filesystem implementations do not support efficient appending to an existing file, in which case this method will return NotImplemented. Consider writing to multiple files (using e.g. the dataset layer) instead.

Public Static Functions

static Result<std::shared_ptr<S3FileSystem>> Make(const S3Options &options, const io::IOContext& = io::default_io_context())

Create a S3FileSystem instance from the given options.

Hadoop filesystem

struct HdfsOptions

Options for the HDFS implementation.

Public Members

io::HdfsConnectionConfig connection_config

Hdfs configuration options, contains host, port, driver.

int32_t buffer_size = 0

Used by Hdfs OpenWritable Interface.

class HadoopFileSystem : public arrow::fs::FileSystem

HDFS-backed FileSystem implementation.

implementation notes:

  • This is a wrapper of arrow/io/hdfs, so we can use FileSystem API to handle hdfs.

Public Functions

virtual Result<std::string> PathFromUri(const std::string &uri_string) const override

Ensure a URI (or path) is compatible with the given filesystem and return the path.

This method will check to ensure the given filesystem is compatible with the URI. This can be useful when the user provides both a URI and a filesystem or when a user provides multiple URIs that should be compatible with the same filesystem.

uri_string can be an absolute path instead of a URI. In that case it will ensure the filesystem (if supplied) is the local filesystem (or some custom filesystem that is capable of reading local paths) and will normalize the path’s file separators.

Note, this method only checks to ensure the URI scheme is valid. It will not detect inconsistencies like a mismatching region or endpoint override.

Parameters:

uri_string – A URI representing a resource in the given filesystem.

Returns:

The path inside the filesystem that is indicated by the URI.

virtual Result<FileInfo> GetFileInfo(const std::string &path) override

Get info for the given target.

Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).

virtual Result<std::vector<FileInfo>> GetFileInfo(const FileSelector &select) override

Same, according to a selector.

The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see FileSelector::allow_not_found.

virtual Status CreateDir(const std::string &path, bool recursive = true) override

Create a directory and subdirectories.

This function succeeds if the directory already exists.

virtual Status DeleteDir(const std::string &path) override

Delete a directory and its contents, recursively.

virtual Status DeleteDirContents(const std::string &path, bool missing_dir_ok = false) override

Delete a directory’s contents, recursively.

Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (”” or “/”) is disallowed, see DeleteRootDirContents.

virtual Status DeleteRootDirContents() override

EXPERIMENTAL: Delete the root directory’s contents, recursively.

Implementations may decide to raise an error if this operation is too dangerous.

virtual Status DeleteFile(const std::string &path) override

Delete a file.

virtual Status Move(const std::string &src, const std::string &dest) override

Move / rename a file or directory.

If the destination exists:

  • if it is a non-empty directory, an error is returned

  • otherwise, if it has the same type as the source, it is replaced

  • otherwise, behavior is unspecified (implementation-dependent).

virtual Status CopyFile(const std::string &src, const std::string &dest) override

Copy a file.

If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.

virtual Result<std::shared_ptr<io::InputStream>> OpenInputStream(const std::string &path) override

Open an input stream for sequential reading.

virtual Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const std::string &path) override

Open an input file for random access reading.

virtual Result<std::shared_ptr<io::OutputStream>> OpenOutputStream(const std::string &path, const std::shared_ptr<const KeyValueMetadata> &metadata = {}) override

Open an output stream for sequential writing.

If the target already exists, existing data is truncated.

virtual Result<std::shared_ptr<io::OutputStream>> OpenAppendStream(const std::string &path, const std::shared_ptr<const KeyValueMetadata> &metadata = {}) override

Open an output stream for appending.

If the target doesn’t exist, a new empty file is created.

Note: some filesystem implementations do not support efficient appending to an existing file, in which case this method will return NotImplemented. Consider writing to multiple files (using e.g. the dataset layer) instead.

Public Static Functions

static Result<std::shared_ptr<HadoopFileSystem>> Make(const HdfsOptions &options, const io::IOContext& = io::default_io_context())

Create a HdfsFileSystem instance from the given options.

Google Cloud Storage filesystem

struct GcsOptions

Options for the GcsFileSystem implementation.

Public Functions

GcsOptions()

Equivalent to GcsOptions::Defaults().

Public Members

std::string default_bucket_location

Location to use for creating buckets.

std::optional<double> retry_limit_seconds

If set used to control total time allowed for retrying underlying errors.

The default policy is to retry for up to 15 minutes.

std::shared_ptr<const KeyValueMetadata> default_metadata

Default metadata for OpenOutputStream.

This will be ignored if non-empty metadata is passed to OpenOutputStream.

std::optional<std::string> project_id

The project to use for creating buckets.

If not set, the library uses the GOOGLE_CLOUD_PROJECT environment variable. Most I/O operations do not need a project id, only applications that create new buckets need a project id.

Public Static Functions

static GcsOptions Defaults()

Initialize with Google Default Credentials.

Create options configured to use Application Default Credentials. The details of this mechanism are too involved to describe here, but suffice is to say that applications can override any defaults using an environment variable (GOOGLE_APPLICATION_CREDENTIALS), and that the defaults work with most Google Cloud Platform deployment environments (GCE, GKE, Cloud Run, etc.), and that have the same behavior as the gcloud CLI tool on your workstation.

static GcsOptions Anonymous()

Initialize with anonymous credentials.

static GcsOptions FromAccessToken(const std::string &access_token, TimePoint expiration)

Initialize with access token.

These credentials are useful when using an out-of-band mechanism to fetch access tokens. Note that access tokens are time limited, you will need to manually refresh the tokens created by the out-of-band mechanism.

static GcsOptions FromImpersonatedServiceAccount(const GcsCredentials &base_credentials, const std::string &target_service_account)

Initialize with service account impersonation.

Service account impersonation allows one principal (a user or service account) to impersonate a service account. It requires that the calling principal has the necessary permissions on the service account.

static GcsOptions FromServiceAccountCredentials(const std::string &json_object)

Creates service account credentials from a JSON object in string form.

The json_object is expected to be in the format described by aip/4112. Such an object contains the identity of a service account, as well as a private key that can be used to sign tokens, showing the caller was holding the private key.

In GCP one can create several “keys” for each service account, and these keys are downloaded as a JSON “key file”. The contents of such a file are in the format required by this function. Remember that key files and their contents should be treated as any other secret with security implications, think of them as passwords (because they are!), don’t store them or output them where unauthorized persons may read them.

Most applications should probably use default credentials, maybe pointing them to a file with these contents. Using this function may be useful when the json object is obtained from a Cloud Secret Manager or a similar service.

static Result<GcsOptions> FromUri(const arrow::internal::Uri &uri, std::string *out_path)

Initialize from URIs such as “gs://bucket/object”.

class GcsFileSystem : public arrow::fs::FileSystem

GCS-backed FileSystem implementation.

GCS (Google Cloud Storage - https://cloud.google.com/storage) is a scalable object storage system for any amount of data. The main abstractions in GCS are buckets and objects. A bucket is a namespace for objects, buckets can store any number of objects, tens of millions and even billions is not uncommon. Each object contains a single blob of data, up to 5TiB in size. Buckets are typically configured to keep a single version of each object, but versioning can be enabled. Versioning is important because objects are immutable, once created one cannot append data to the object or modify the object data in any way.

GCS buckets are in a global namespace, if a Google Cloud customer creates a bucket named foo no other customer can create a bucket with the same name. Note that a principal (a user or service account) may only list the buckets they are entitled to, and then only within a project. It is not possible to list “all” the buckets.

Within each bucket objects are in flat namespace. GCS does not have folders or directories. However, following some conventions it is possible to emulate directories. To this end, this class:

  • All buckets are treated as directories at the “root”

  • Creating a root directory results in a new bucket being created, this may be slower than most GCS operations.

  • The class creates marker objects for a directory, using a metadata attribute to annotate the file.

  • GCS can list all the objects with a given prefix, this is used to emulate listing of directories.

  • In object lists GCS can summarize all the objects with a common prefix as a single entry, this is used to emulate non-recursive lists. Note that GCS list time is proportional to the number of objects in the prefix. Listing recursively takes almost the same time as non-recursive lists.

Public Functions

virtual Result<std::string> PathFromUri(const std::string &uri_string) const override

Ensure a URI (or path) is compatible with the given filesystem and return the path.

This method will check to ensure the given filesystem is compatible with the URI. This can be useful when the user provides both a URI and a filesystem or when a user provides multiple URIs that should be compatible with the same filesystem.

uri_string can be an absolute path instead of a URI. In that case it will ensure the filesystem (if supplied) is the local filesystem (or some custom filesystem that is capable of reading local paths) and will normalize the path’s file separators.

Note, this method only checks to ensure the URI scheme is valid. It will not detect inconsistencies like a mismatching region or endpoint override.

Parameters:

uri_string – A URI representing a resource in the given filesystem.

Returns:

The path inside the filesystem that is indicated by the URI.

virtual Result<FileInfo> GetFileInfo(const std::string &path) override

Get info for the given target.

Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).

virtual Result<FileInfoVector> GetFileInfo(const FileSelector &select) override

Same, according to a selector.

The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see FileSelector::allow_not_found.

virtual Status CreateDir(const std::string &path, bool recursive) override

Create a directory and subdirectories.

This function succeeds if the directory already exists.

virtual Status DeleteDir(const std::string &path) override

Delete a directory and its contents, recursively.

virtual Status DeleteDirContents(const std::string &path, bool missing_dir_ok = false) override

Delete a directory’s contents, recursively.

Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (”” or “/”) is disallowed, see DeleteRootDirContents.

virtual Status DeleteRootDirContents() override

This is not implemented in GcsFileSystem, as it would be too dangerous.

virtual Status DeleteFile(const std::string &path) override

Delete a file.

virtual Status Move(const std::string &src, const std::string &dest) override

Move / rename a file or directory.

If the destination exists:

  • if it is a non-empty directory, an error is returned

  • otherwise, if it has the same type as the source, it is replaced

  • otherwise, behavior is unspecified (implementation-dependent).

virtual Status CopyFile(const std::string &src, const std::string &dest) override

Copy a file.

If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.

virtual Result<std::shared_ptr<io::InputStream>> OpenInputStream(const std::string &path) override

Open an input stream for sequential reading.

virtual Result<std::shared_ptr<io::InputStream>> OpenInputStream(const FileInfo &info) override

Open an input stream for sequential reading.

This override assumes the given FileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).

virtual Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const std::string &path) override

Open an input file for random access reading.

virtual Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const FileInfo &info) override

Open an input file for random access reading.

This override assumes the given FileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).

virtual Result<std::shared_ptr<io::OutputStream>> OpenOutputStream(const std::string &path, const std::shared_ptr<const KeyValueMetadata> &metadata) override

Open an output stream for sequential writing.

If the target already exists, existing data is truncated.

virtual Result<std::shared_ptr<io::OutputStream>> OpenAppendStream(const std::string &path, const std::shared_ptr<const KeyValueMetadata> &metadata) override

Open an output stream for appending.

If the target doesn’t exist, a new empty file is created.

Note: some filesystem implementations do not support efficient appending to an existing file, in which case this method will return NotImplemented. Consider writing to multiple files (using e.g. the dataset layer) instead.

Public Static Functions

static std::shared_ptr<GcsFileSystem> Make(const GcsOptions &options, const io::IOContext& = io::default_io_context())

Create a GcsFileSystem instance from the given options.