High-Level Overview#

The Arrow C++ library is comprised of different parts, each of which serves a specific purpose.

The physical layer#

Memory management abstractions provide a uniform API over memory that may be allocated through various means, such as heap allocation, the memory mapping of a file or a static memory area. In particular, the buffer abstraction represents a contiguous area of physical data.

The one-dimensional layer#

Data types govern the logical interpretation of physical data. Many operations in Arrow are parametered, at compile-time or at runtime, by a data type.

Arrays assemble one or several buffers with a data type, allowing to view them as a logical contiguous sequence of values (possibly nested).

Chunked arrays are a generalization of arrays, comprising several same-type arrays into a longer logical sequence of values.

The two-dimensional layer#

Schemas describe a logical collection of several pieces of data, each with a distinct name and type, and optional metadata.

Tables are collections of chunked array in accordance to a schema. They are the most capable dataset-providing abstraction in Arrow.

Record batches are collections of contiguous arrays, described by a schema. They allow incremental construction or serialization of tables.

The compute layer#

Datums are flexible dataset references, able to hold for example an array or table reference.

Kernels are specialized computation functions running in a loop over a given set of datums representing input and output parameters to the functions.

Acero (pronounced [aˈsɜɹo] / ah-SERR-oh) is a streaming execution engine that allows computation to be expressed as a graph of operators which can transform streams of data.

The IO layer#

Streams allow untyped sequential or seekable access over external data of various kinds (for example compressed or memory-mapped).

The Inter-Process Communication (IPC) layer#

A messaging format allows interchange of Arrow data between processes, using as few copies as possible.

The file formats layer#

Reading and writing Arrow data from/to various file formats is possible, for example Parquet, CSV, Orc or the Arrow-specific Feather format.

The devices layer#

Basic CUDA integration is provided, allowing to describe Arrow data backed by GPU-allocated memory.

The filesystem layer#

A filesystem abstraction allows reading and writing data from different storage backends, such as the local filesystem or a S3 bucket.