High-Level Overview

The Arrow C++ library is comprised of different parts, each of which serves a specific purpose.

The physical layer

Memory management abstractions provide a uniform API over memory that may be allocated through various means, such as heap allocation, the memory mapping of a file or a static memory area. In particular, the buffer abstraction represents a contiguous area of physical data.

The one-dimensional layer

Data types govern the logical interpretation of physical data. Many operations in Arrow are parametered, at compile-time or at runtime, by a data type.

Arrays assemble one or several buffers with a data type, allowing to view them as a logical contiguous sequence of values (possibly nested).

Chunked arrays are a generalization of arrays, comprising several same-type arrays into a longer logical sequence of values.

The two-dimensional layer

Schemas describe a logical collection of several pieces of data, each with a distinct name and type, and optional metadata.

Tables are collections of chunked array in accordance to a schema. They are the most capable dataset-providing abstraction in Arrow.

Record batches are collections of contiguous arrays, described by a schema. They allow incremental construction or serialization of tables.

The compute layer

Datums are flexible dataset references, able to hold for example an array or table reference.

Kernels are specialized computation functions running in a loop over a given set of datums representing input and output parameters to the functions.

Acero (pronounced [aˈsɜɹo] / ah-SERR-oh) is a streaming execution engine that allows computation to be expressed as a graph of operators which can transform streams of data.

The IO layer

Streams allow untyped sequential or seekable access over external data of various kinds (for example compressed or memory-mapped).

The Inter-Process Communication (IPC) layer

A messaging format allows interchange of Arrow data between processes, using as few copies as possible.

The file formats layer

Reading and writing Arrow data from/to various file formats is possible, for example Parquet, CSV, Orc or the Arrow-specific Feather format.

The devices layer

Basic CUDA integration is provided, allowing to describe Arrow data backed by GPU-allocated memory.

The filesystem layer

A filesystem abstraction allows reading and writing data from different storage backends, such as the local filesystem or a S3 bucket.