Reading/writing columnar storage formats
Many Arrow libraries provide convenient methods for reading and writing columnar file formats, including the Arrow IPC file format (“Feather”) and the Apache Parquet format.
In addition to single-file readers, some libraries (C++, Python, R) support reading entire directories of files and treating them as a single dataset. These datasets may be on the local file system or on a remote storage system, such as HDFS, S3, etc.
Sharing memory locally
Arrow IPC files can be memory-mapped locally, which allow you to work with data bigger than memory and to share data across languages and processes.
The Arrow project includes Plasma, a shared-memory object store written in C++ and exposed in Python. Plasma holds immutable objects in shared memory so that they can be accessed efficiently by many clients across process boundaries.
The Arrow format also defines a C data interface,
which allows zero-copy data sharing inside a single process without any
build-time or link-time dependency requirements. This allows, for example,
R users to access
Moving data over the network
The Arrow format allows serializing and shipping columnar data over the network - or any kind of streaming transport. Apache Spark uses Arrow as a data interchange format, and both PySpark and sparklyr can take advantage of Arrow for significant performance gains when transferring data. Google BigQuery, TensorFlow, AWS Athena, and others also use Arrow similarly.
The Arrow project also defines Flight, a client-server RPC framework to build rich services exchanging data according to application-defined semantics.
In-memory data structure for analytics
The Arrow format is designed to enable fast computation. Some projects have begun to take advantage of that design. Within the Apache Arrow project, DataFusion is a query engine using Arrow data built in Rust.