Input / output and filesystems#
Arrow provides a range of C++ interfaces abstracting the concrete details of input / output operations. They operate on streams of untyped binary data. Those abstractions are used for various purposes such as reading CSV or Parquet data, transmitting IPC streams, and more.
Reading binary data#
Interfaces for reading binary data come in two flavours:
Sequential reading: the
Readmethods; it is recommended to
Bufferas it may in some cases avoid a memory copy.
Random access reading: the
RandomAccessFileinterface provides additional facilities for positioning and, most importantly, the
ReadAtmethods which allow parallel reading from multiple threads.
Writing binary data#
Writing binary data is mostly done through the
filesystem interface allows abstracted access over
various data storage backends such as the local filesystem or a S3 bucket.
It provides input and output streams as well as directory operations.
The filesystem interface exposes a simplified view of the underlying data
storage. Data paths are represented as abstract paths, which are
/-separated, even on Windows, and shouldn’t include special path
components such as
... Symbolic links, if supported by the
underlying storage, are automatically dereferenced. Only basic
metadata about file entries, such as the file size
and modification time, is made available.
Tasks that use filesystems will typically run on the I/O thread pool. For filesystems that support high levels of concurrency you may get a benefit from increasing the size of the I/O thread pool.