High-Level Overview#
The Arrow C++ library is comprised of different parts, each of which serves a specific purpose.
The physical layer#
Memory management abstractions provide a uniform API over memory that may be allocated through various means, such as heap allocation, the memory mapping of a file or a static memory area. In particular, the buffer abstraction represents a contiguous area of physical data.
The one-dimensional layer#
Data types govern the logical interpretation of physical data. Many operations in Arrow are parameterized, at compile-time or at runtime, by a data type.
Arrays assemble one or several buffers with a data type, allowing to view them as a logical contiguous sequence of values (possibly nested).
Chunked arrays are a generalization of arrays, comprising several same-type arrays into a longer logical sequence of values.
The two-dimensional layer#
Schemas describe a logical collection of several pieces of data, each with a distinct name and type, and optional metadata.
Tables are collections of chunked array in accordance to a schema. They are the most capable dataset-providing abstraction in Arrow.
Record batches are collections of contiguous arrays, described by a schema. They allow incremental construction or serialization of tables.
The compute layer#
Datums are flexible dataset references, able to hold for example an array or table reference.
Kernels are specialized computation functions running in a loop over a given set of datums representing input and output parameters to the functions.
Acero (pronounced [aˈsɜɹo] / ah-SERR-oh) is a streaming execution engine that allows computation to be expressed as a graph of operators which can transform streams of data.
The IO layer#
Streams allow untyped sequential or seekable access over external data of various kinds (for example compressed or memory-mapped).
The Inter-Process Communication (IPC) layer#
A messaging format allows interchange of Arrow data between processes, using as few copies as possible.
The file formats layer#
Reading and writing Arrow data from/to various file formats is possible, for example Parquet, CSV, Orc or the Arrow-specific Feather format.
The devices layer#
Basic CUDA integration is provided, allowing to describe Arrow data backed by GPU-allocated memory.
The filesystem layer#
A filesystem abstraction allows reading and writing data from different storage backends, such as the local filesystem or a S3 bucket.