Input / output and filesystems#

Arrow provides a range of C++ interfaces abstracting the concrete details of input / output operations. They operate on streams of untyped binary data. Those abstractions are used for various purposes such as reading CSV or Parquet data, transmitting IPC streams, and more.

Reading binary data#

Interfaces for reading binary data come in two flavours:

Sequential reading: the InputStream interface provides Read methods; it is recommended to Read to a Buffer as it may in some cases avoid a memory copy.
Random access reading: the RandomAccessFile interface provides additional facilities for positioning and, most importantly, the ReadAt methods which allow parallel reading from multiple threads.

Concrete implementations are available for in-memory reads, unbuffered file reads, memory-mapped file reads, buffered reads, compressed reads.

Writing binary data#

Writing binary data is mostly done through the OutputStream interface.

Concrete implementations are available for in-memory writes, unbuffered file writes, memory-mapped file writes, buffered writes, compressed writes.

Filesystems#

The filesystem interface allows abstracted access over various data storage backends such as the local filesystem or a S3 bucket. It provides input and output streams as well as directory operations.

Defining new filesystems#

Support for additional URI schemes can be added to the FromUri factories by registering a factory for each new URI scheme with RegisterFileSystemFactory(). To enable the common case wherein it is preferred that registration be automatic, an instance of FileSystemRegistrar can be defined at namespace scope, which will register a factory whenever the instance is loaded:

auto kExampleFileSystemModule = ARROW_REGISTER_FILESYSTEM(
  "example",
  [](const Uri& uri, const io::IOContext& io_context,
      std::string* out_path) -> Result<std::shared_ptr<arrow::fs::FileSystem>> {
    EnsureExampleFileSystemInitialized();
    return std::make_shared<ExampleFileSystem>();
  },
  &EnsureExampleFileSystemFinalized
);

If a filesystem implementation requires initialization before any instances may be constructed, this should be included in the corresponding factory or otherwise automatically ensured before the factory is invoked. Likewise if a filesystem implementation requires tear down before the process ends, this can be wrapped in a function and registered alongside the factory. All finalizers will be called by EnsureFinalized().

Build complexity can be decreased by compartmentalizing a filesystem implementation into a separate shared library, which applications may link or load dynamically. Arrow’s built-in filesystem implementations also follow this pattern. If a shared library containing instances of FileSystemRegistrar must be dynamically loaded, LoadFileSystemFactories() should be used to load it. If such a library might link statically to arrow, it should have exactly one of its sources #include "arrow/filesystem/filesystem_library.h" in order to ensure the presence of the symbol on which LoadFileSystemFactories() depends.