Filesystems¶
Interface¶
-
enum
arrow::fs
::
FileType
¶ FileSystem entry type.
Values:
-
enumerator
NotFound
¶ Entry is not found.
-
enumerator
Unknown
¶ Entry exists but its type is unknown.
This can designate a special file such as a Unix socket or character device, or Windows NUL / CON / …
-
enumerator
File
¶ Entry is a regular file.
-
enumerator
Directory
¶ Entry is a directory.
-
enumerator
-
struct
arrow::fs
::
FileInfo
: public arrow::util::EqualityComparable<FileInfo>¶ FileSystem entry info.
Public Functions
-
inline const std::string &
path
() const¶ The full file path in the filesystem.
-
std::string
base_name
() const¶ The file base name (component after the last directory separator)
-
inline int64_t
size
() const¶ The size in bytes, if available.
Only regular files are guaranteed to have a size.
-
std::string
extension
() const¶ The file extension (excluding the dot)
-
inline TimePoint
mtime
() const¶ The time of last modification, if available.
-
struct
ByPath
¶ Function object implementing less-than comparison and hashing by path, to support sorting infos, using them as keys, and other interactions with the STL.
-
inline const std::string &
-
struct
arrow::fs
::
FileSelector
¶ File selector for filesystem APIs.
Public Members
-
std::string
base_dir
¶ The directory in which to select files.
If the path exists but doesn’t point to a directory, this should be an error.
-
bool
allow_not_found
¶ The behavior if
base_dir
isn’t found in the filesystem.If false, an error is returned. If true, an empty selection is returned.
-
bool
recursive
¶ Whether to recurse into subdirectories.
-
int32_t
max_recursion
¶ The maximum number of subdirectories to recurse into.
-
std::string
-
class
arrow::fs
::
FileSystem
: public std::enable_shared_from_this<FileSystem>¶ Abstract file system API.
Subclassed by arrow::fs::GcsFileSystem, arrow::fs::HadoopFileSystem, arrow::fs::internal::MockFileSystem, arrow::fs::LocalFileSystem, arrow::fs::S3FileSystem, arrow::fs::SlowFileSystem, arrow::fs::SubTreeFileSystem, arrow::py::fs::PyFileSystem
Public Functions
-
inline const io::IOContext &
io_context
() const¶ EXPERIMENTAL: The IOContext associated with this filesystem.
-
virtual Result<std::string>
NormalizePath
(std::string path)¶ Normalize path for the given filesystem.
The default implementation of this method is a no-op, but subclasses may allow normalizing irregular path forms (such as Windows local paths).
-
virtual Result<FileInfo>
GetFileInfo
(const std::string &path) = 0¶ Get info for the given target.
Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).
-
virtual Result<FileInfoVector>
GetFileInfo
(const std::vector<std::string> &paths)¶ Same, for many targets at once.
-
virtual Result<FileInfoVector>
GetFileInfo
(const FileSelector &select) = 0¶ Same, according to a selector.
The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see
FileSelector::allow_not_found
.
-
virtual Future<FileInfoVector>
GetFileInfoAsync
(const std::vector<std::string> &paths)¶ Async version of GetFileInfo.
-
virtual FileInfoGenerator
GetFileInfoGenerator
(const FileSelector &select)¶ Streaming async version of GetFileInfo.
The returned generator is not async-reentrant, i.e. you need to wait for the returned future to complete before calling the generator again.
-
virtual Status
CreateDir
(const std::string &path, bool recursive = true) = 0¶ Create a directory and subdirectories.
This function succeeds if the directory already exists.
-
virtual Status
DeleteDir
(const std::string &path) = 0¶ Delete a directory and its contents, recursively.
-
virtual Status
DeleteDirContents
(const std::string &path) = 0¶ Delete a directory’s contents, recursively.
Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (“” or “/”) is disallowed, see DeleteRootDirContents.
-
virtual Status
DeleteRootDirContents
() = 0¶ EXPERIMENTAL: Delete the root directory’s contents, recursively.
Implementations may decide to raise an error if this operation is too dangerous.
-
virtual Status
DeleteFiles
(const std::vector<std::string> &paths)¶ Delete many files.
The default implementation issues individual delete operations in sequence.
-
virtual Status
Move
(const std::string &src, const std::string &dest) = 0¶ Move / rename a file or directory.
If the destination exists:
if it is a non-empty directory, an error is returned
otherwise, if it has the same type as the source, it is replaced
otherwise, behavior is unspecified (implementation-dependent).
-
virtual Status
CopyFile
(const std::string &src, const std::string &dest) = 0¶ Copy a file.
If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.
-
virtual Result<std::shared_ptr<io::InputStream>>
OpenInputStream
(const std::string &path) = 0¶ Open an input stream for sequential reading.
-
virtual Result<std::shared_ptr<io::InputStream>>
OpenInputStream
(const FileInfo &info)¶ Open an input stream for sequential reading.
This override assumes the given FileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).
-
virtual Result<std::shared_ptr<io::RandomAccessFile>>
OpenInputFile
(const std::string &path) = 0¶ Open an input file for random access reading.
-
virtual Result<std::shared_ptr<io::RandomAccessFile>>
OpenInputFile
(const FileInfo &info)¶ Open an input file for random access reading.
This override assumes the given FileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).
-
virtual Future<std::shared_ptr<io::InputStream>>
OpenInputStreamAsync
(const std::string &path)¶ Async version of OpenInputStream.
-
virtual Future<std::shared_ptr<io::InputStream>>
OpenInputStreamAsync
(const FileInfo &info)¶ Async version of OpenInputStream.
-
virtual Future<std::shared_ptr<io::RandomAccessFile>>
OpenInputFileAsync
(const std::string &path)¶ Async version of OpenInputFile.
-
virtual Future<std::shared_ptr<io::RandomAccessFile>>
OpenInputFileAsync
(const FileInfo &info)¶ Async version of OpenInputFile.
Open an output stream for sequential writing.
If the target already exists, existing data is truncated.
Open an output stream for appending.
If the target doesn’t exist, a new empty file is created.
-
inline const io::IOContext &
High-level factory function¶
-
Result<std::shared_ptr<FileSystem>>
FileSystemFromUri
(const std::string &uri, std::string *out_path = NULLPTR)¶ Create a new FileSystem by URI.
Recognized schemes are “file”, “mock”, “hdfs” and “s3fs”.
- Parameters
[in] uri – a URI-based path, ex: file:///some/local/path
[out] out_path – (optional) Path inside the filesystem.
- Returns
out_fs FileSystem instance.
-
Result<std::shared_ptr<FileSystem>>
FileSystemFromUri
(const std::string &uri, const io::IOContext &io_context, std::string *out_path = NULLPTR)¶ Create a new FileSystem by URI with a custom IO context.
Recognized schemes are “file”, “mock”, “hdfs” and “s3fs”.
- Parameters
[in] uri – a URI-based path, ex: file:///some/local/path
[in] io_context – an IOContext which will be associated with the filesystem
[out] out_path – (optional) Path inside the filesystem.
- Returns
out_fs FileSystem instance.
-
Result<std::shared_ptr<FileSystem>>
FileSystemFromUriOrPath
(const std::string &uri, std::string *out_path = NULLPTR)¶ Create a new FileSystem by URI.
Same as FileSystemFromUri, but in addition also recognize non-URIs and treat them as local filesystem paths. Only absolute local filesystem paths are allowed.
-
Result<std::shared_ptr<FileSystem>>
FileSystemFromUriOrPath
(const std::string &uri, const io::IOContext &io_context, std::string *out_path = NULLPTR)¶ Create a new FileSystem by URI with a custom IO context.
Same as FileSystemFromUri, but in addition also recognize non-URIs and treat them as local filesystem paths. Only absolute local filesystem paths are allowed.
Concrete implementations¶
-
class
arrow::fs
::
SubTreeFileSystem
: public arrow::fs::FileSystem¶ A FileSystem implementation that delegates to another implementation after prepending a fixed base path.
This is useful to expose a logical view of a subtree of a filesystem, for example a directory in a LocalFileSystem. This works on abstract paths, i.e. paths using forward slashes and and a single root “/”. Windows paths are not guaranteed to work. This makes no security guarantee. For example, symlinks may allow to “escape” the subtree and access other parts of the underlying filesystem.
Public Functions
-
virtual Result<std::string>
NormalizePath
(std::string path) override¶ Normalize path for the given filesystem.
The default implementation of this method is a no-op, but subclasses may allow normalizing irregular path forms (such as Windows local paths).
-
virtual Result<FileInfo>
GetFileInfo
(const std::string &path) override¶ Get info for the given target.
Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).
-
virtual Result<FileInfoVector>
GetFileInfo
(const FileSelector &select) override¶ Same, according to a selector.
The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see
FileSelector::allow_not_found
.
-
virtual FileInfoGenerator
GetFileInfoGenerator
(const FileSelector &select) override¶ Streaming async version of GetFileInfo.
The returned generator is not async-reentrant, i.e. you need to wait for the returned future to complete before calling the generator again.
-
virtual Status
CreateDir
(const std::string &path, bool recursive = true) override¶ Create a directory and subdirectories.
This function succeeds if the directory already exists.
-
virtual Status
DeleteDir
(const std::string &path) override¶ Delete a directory and its contents, recursively.
-
virtual Status
DeleteDirContents
(const std::string &path) override¶ Delete a directory’s contents, recursively.
Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (“” or “/”) is disallowed, see DeleteRootDirContents.
-
virtual Status
DeleteRootDirContents
() override¶ EXPERIMENTAL: Delete the root directory’s contents, recursively.
Implementations may decide to raise an error if this operation is too dangerous.
-
virtual Status
Move
(const std::string &src, const std::string &dest) override¶ Move / rename a file or directory.
If the destination exists:
if it is a non-empty directory, an error is returned
otherwise, if it has the same type as the source, it is replaced
otherwise, behavior is unspecified (implementation-dependent).
-
virtual Status
CopyFile
(const std::string &src, const std::string &dest) override¶ Copy a file.
If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.
-
virtual Result<std::shared_ptr<io::InputStream>>
OpenInputStream
(const std::string &path) override¶ Open an input stream for sequential reading.
-
virtual Result<std::shared_ptr<io::InputStream>>
OpenInputStream
(const FileInfo &info) override¶ Open an input stream for sequential reading.
This override assumes the given FileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).
-
virtual Result<std::shared_ptr<io::RandomAccessFile>>
OpenInputFile
(const std::string &path) override¶ Open an input file for random access reading.
-
virtual Result<std::shared_ptr<io::RandomAccessFile>>
OpenInputFile
(const FileInfo &info) override¶ Open an input file for random access reading.
This override assumes the given FileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).
-
virtual Future<std::shared_ptr<io::InputStream>>
OpenInputStreamAsync
(const std::string &path) override¶ Async version of OpenInputStream.
-
virtual Future<std::shared_ptr<io::InputStream>>
OpenInputStreamAsync
(const FileInfo &info) override¶ Async version of OpenInputStream.
-
virtual Future<std::shared_ptr<io::RandomAccessFile>>
OpenInputFileAsync
(const std::string &path) override¶ Async version of OpenInputFile.
-
virtual Future<std::shared_ptr<io::RandomAccessFile>>
OpenInputFileAsync
(const FileInfo &info) override¶ Async version of OpenInputFile.
Open an output stream for sequential writing.
If the target already exists, existing data is truncated.
Open an output stream for appending.
If the target doesn’t exist, a new empty file is created.
-
virtual Result<std::string>
-
struct
arrow::fs
::
LocalFileSystemOptions
¶ Options for the LocalFileSystem implementation.
Public Members
-
bool
use_mmap
= false¶ Whether OpenInputStream and OpenInputFile return a mmap’ed file, or a regular one.
Public Static Functions
-
static LocalFileSystemOptions
Defaults
()¶ Initialize with defaults.
-
bool
-
class
arrow::fs
::
LocalFileSystem
: public arrow::fs::FileSystem¶ A FileSystem implementation accessing files on the local machine.
This class handles only
/
-separated paths. If desired, conversion from Windows backslash-separated paths should be done by the caller. Details such as symlinks are abstracted away (symlinks are always followed, except when deleting an entry).Public Functions
-
virtual Result<std::string>
NormalizePath
(std::string path) override¶ Normalize path for the given filesystem.
The default implementation of this method is a no-op, but subclasses may allow normalizing irregular path forms (such as Windows local paths).
-
virtual Result<FileInfo>
GetFileInfo
(const std::string &path) override¶ Get info for the given target.
Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).
-
virtual Result<std::vector<FileInfo>>
GetFileInfo
(const FileSelector &select) override¶ Same, according to a selector.
The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see
FileSelector::allow_not_found
.
-
virtual Status
CreateDir
(const std::string &path, bool recursive = true) override¶ Create a directory and subdirectories.
This function succeeds if the directory already exists.
-
virtual Status
DeleteDir
(const std::string &path) override¶ Delete a directory and its contents, recursively.
-
virtual Status
DeleteDirContents
(const std::string &path) override¶ Delete a directory’s contents, recursively.
Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (“” or “/”) is disallowed, see DeleteRootDirContents.
-
virtual Status
DeleteRootDirContents
() override¶ EXPERIMENTAL: Delete the root directory’s contents, recursively.
Implementations may decide to raise an error if this operation is too dangerous.
-
virtual Status
Move
(const std::string &src, const std::string &dest) override¶ Move / rename a file or directory.
If the destination exists:
if it is a non-empty directory, an error is returned
otherwise, if it has the same type as the source, it is replaced
otherwise, behavior is unspecified (implementation-dependent).
-
virtual Status
CopyFile
(const std::string &src, const std::string &dest) override¶ Copy a file.
If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.
-
virtual Result<std::shared_ptr<io::InputStream>>
OpenInputStream
(const std::string &path) override¶ Open an input stream for sequential reading.
-
virtual Result<std::shared_ptr<io::RandomAccessFile>>
OpenInputFile
(const std::string &path) override¶ Open an input file for random access reading.
Open an output stream for sequential writing.
If the target already exists, existing data is truncated.
Open an output stream for appending.
If the target doesn’t exist, a new empty file is created.
-
virtual Result<std::string>
-
struct
arrow::fs
::
S3Options
¶ Options for the S3FileSystem implementation.
Public Functions
-
void
ConfigureDefaultCredentials
()¶ Configure with the default AWS credentials provider chain.
-
void
ConfigureAnonymousCredentials
()¶ Configure with anonymous credentials. This will only let you access public buckets.
-
void
ConfigureAccessKey
(const std::string &access_key, const std::string &secret_key, const std::string &session_token = "")¶ Configure with explicit access and secret key.
Configure with credentials from an assumed role.
-
void
ConfigureAssumeRoleWithWebIdentityCredentials
()¶ Configure with credentials from role assumed using a web identitiy token.
Public Members
-
std::string
region
¶ AWS region to connect to.
If unset, the AWS SDK will choose a default value. The exact algorithm depends on the SDK version. Before 1.8, the default is hardcoded to “us-east-1”. Since 1.8, several heuristics are used to determine the region (environment variables, configuration profile, EC2 metadata server).
-
std::string
endpoint_override
¶ If non-empty, override region with a connect string such as “localhost:9000”.
-
std::string
scheme
= "https"¶ S3 connection transport, default “https”.
-
std::string
role_arn
¶ ARN of role to assume.
-
std::string
session_name
¶ Optional identifier for an assumed role session.
-
std::string
external_id
¶ Optional external idenitifer to pass to STS when assuming a role.
-
int
load_frequency
¶ Frequency (in seconds) to refresh temporary credentials from assumed role.
-
S3ProxyOptions
proxy_options
¶ If connection is through a proxy, set options here.
-
std::shared_ptr<Aws::Auth::AWSCredentialsProvider>
credentials_provider
¶ AWS credentials provider.
-
S3CredentialsKind
credentials_kind
= S3CredentialsKind::Default¶ Type of credentials being used. Set along with credentials_provider.
-
bool
background_writes
= true¶ Whether OutputStream writes will be issued in the background, without blocking.
-
std::shared_ptr<const KeyValueMetadata>
default_metadata
¶ Default metadata for OpenOutputStream.
This will be ignored if non-empty metadata is passed to OpenOutputStream.
-
std::shared_ptr<S3RetryStrategy>
retry_strategy
¶ Optional retry strategy to determine which error types should be retried, and the delay between retries.
Public Static Functions
-
static S3Options
Defaults
()¶ Initialize with default credentials provider chain.
This is recommended if you use the standard AWS environment variables and/or configuration file.
-
static S3Options
Anonymous
()¶ Initialize with anonymous credentials.
This will only let you access public buckets.
-
static S3Options
FromAccessKey
(const std::string &access_key, const std::string &secret_key, const std::string &session_token = "")¶ Initialize with explicit access and secret key.
Optionally, a session token may also be provided for temporary credentials (from STS).
Initialize from an assumed role.
-
void
-
class
arrow::fs
::
S3FileSystem
: public arrow::fs::FileSystem¶ S3-backed FileSystem implementation.
Some implementation notes:
buckets are special and the operations available on them may be limited or more expensive than desired.
Public Functions
-
std::string
region
() const¶ Return the actual region this filesystem connects to.
-
virtual Result<FileInfo>
GetFileInfo
(const std::string &path) override¶ Get info for the given target.
Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).
-
virtual Result<std::vector<FileInfo>>
GetFileInfo
(const FileSelector &select) override¶ Same, according to a selector.
The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see
FileSelector::allow_not_found
.
-
virtual FileInfoGenerator
GetFileInfoGenerator
(const FileSelector &select) override¶ Streaming async version of GetFileInfo.
The returned generator is not async-reentrant, i.e. you need to wait for the returned future to complete before calling the generator again.
-
virtual Status
CreateDir
(const std::string &path, bool recursive = true) override¶ Create a directory and subdirectories.
This function succeeds if the directory already exists.
-
virtual Status
DeleteDir
(const std::string &path) override¶ Delete a directory and its contents, recursively.
-
virtual Status
DeleteDirContents
(const std::string &path) override¶ Delete a directory’s contents, recursively.
Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (“” or “/”) is disallowed, see DeleteRootDirContents.
-
virtual Status
DeleteRootDirContents
() override¶ EXPERIMENTAL: Delete the root directory’s contents, recursively.
Implementations may decide to raise an error if this operation is too dangerous.
-
virtual Status
Move
(const std::string &src, const std::string &dest) override¶ Move / rename a file or directory.
If the destination exists:
if it is a non-empty directory, an error is returned
otherwise, if it has the same type as the source, it is replaced
otherwise, behavior is unspecified (implementation-dependent).
-
virtual Status
CopyFile
(const std::string &src, const std::string &dest) override¶ Copy a file.
If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.
-
virtual Result<std::shared_ptr<io::InputStream>>
OpenInputStream
(const std::string &path) override¶ Create a sequential input stream for reading from a S3 object.
NOTE: Reads from the stream will be synchronous and unbuffered. You way want to wrap the stream in a BufferedInputStream or use a custom readahead strategy to avoid idle waits.
-
virtual Result<std::shared_ptr<io::InputStream>>
OpenInputStream
(const FileInfo &info) override¶ Create a sequential input stream for reading from a S3 object.
This override avoids a HEAD request by assuming the FileInfo contains correct information.
-
virtual Result<std::shared_ptr<io::RandomAccessFile>>
OpenInputFile
(const std::string &path) override¶ Create a random access file for reading from a S3 object.
See OpenInputStream for performance notes.
-
virtual Result<std::shared_ptr<io::RandomAccessFile>>
OpenInputFile
(const FileInfo &info) override¶ Create a random access file for reading from a S3 object.
This override avoids a HEAD request by assuming the FileInfo contains correct information.
Create a sequential output stream for writing to a S3 object.
NOTE: Writes to the stream will be buffered. Depending on S3Options.background_writes, they can be synchronous or not. It is recommended to enable background_writes unless you prefer implementing your own background execution strategy.
Open an output stream for appending.
If the target doesn’t exist, a new empty file is created.
Public Static Functions
-
static Result<std::shared_ptr<S3FileSystem>>
Make
(const S3Options &options, const io::IOContext& = io::default_io_context())¶ Create a S3FileSystem instance from the given options.
-
struct
arrow::fs
::
HdfsOptions
¶ Options for the HDFS implementation.
-
class
arrow::fs
::
HadoopFileSystem
: public arrow::fs::FileSystem¶ HDFS-backed FileSystem implementation.
implementation notes:
This is a wrapper of arrow/io/hdfs, so we can use FileSystem API to handle hdfs.
Public Functions
-
virtual Result<FileInfo>
GetFileInfo
(const std::string &path) override¶ Get info for the given target.
Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).
-
virtual Result<std::vector<FileInfo>>
GetFileInfo
(const FileSelector &select) override¶ Same, according to a selector.
The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see
FileSelector::allow_not_found
.
-
virtual Status
CreateDir
(const std::string &path, bool recursive = true) override¶ Create a directory and subdirectories.
This function succeeds if the directory already exists.
-
virtual Status
DeleteDir
(const std::string &path) override¶ Delete a directory and its contents, recursively.
-
virtual Status
DeleteDirContents
(const std::string &path) override¶ Delete a directory’s contents, recursively.
Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (“” or “/”) is disallowed, see DeleteRootDirContents.
-
virtual Status
DeleteRootDirContents
() override¶ EXPERIMENTAL: Delete the root directory’s contents, recursively.
Implementations may decide to raise an error if this operation is too dangerous.
-
virtual Status
Move
(const std::string &src, const std::string &dest) override¶ Move / rename a file or directory.
If the destination exists:
if it is a non-empty directory, an error is returned
otherwise, if it has the same type as the source, it is replaced
otherwise, behavior is unspecified (implementation-dependent).
-
virtual Status
CopyFile
(const std::string &src, const std::string &dest) override¶ Copy a file.
If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.
-
virtual Result<std::shared_ptr<io::InputStream>>
OpenInputStream
(const std::string &path) override¶ Open an input stream for sequential reading.
-
virtual Result<std::shared_ptr<io::RandomAccessFile>>
OpenInputFile
(const std::string &path) override¶ Open an input file for random access reading.
Open an output stream for sequential writing.
If the target already exists, existing data is truncated.
Open an output stream for appending.
If the target doesn’t exist, a new empty file is created.
Public Static Functions
-
static Result<std::shared_ptr<HadoopFileSystem>>
Make
(const HdfsOptions &options, const io::IOContext& = io::default_io_context())¶ Create a HdfsFileSystem instance from the given options.