Tabular Data#

While arrays and chunked arrays represent a one-dimensional sequence of homogeneous values, data often comes in the form of two-dimensional sets of heterogeneous data (such as database tables, CSV files…). Arrow provides several abstractions to handle such data conveniently and efficiently.

Fields#

Fields are used to denote the particular columns of a table (and also the particular members of a nested data type such as arrow::StructType). A field, i.e. an instance of arrow::Field, holds together a data type, a field name and some optional metadata.

The recommended way to create a field is to call the arrow::field() factory function.

Schemas#

A schema describes the overall structure of a two-dimensional dataset such as a table. It holds a sequence of fields together with some optional schema-wide metadata (in addition to per-field metadata). The recommended way to create a schema is to call one the arrow::schema() factory function overloads:

// Create a schema describing datasets with two columns:
// a int32 column "A" and a utf8-encoded string column "B"
std::shared_ptr<arrow::Field> field_a, field_b;
std::shared_ptr<arrow::Schema> schema;

field_a = arrow::field("A", arrow::int32());
field_b = arrow::field("B", arrow::utf8());
schema = arrow::schema({field_a, field_b});

Tables#

A arrow::Table is a two-dimensional dataset with chunked arrays for columns, together with a schema providing field names. Also, each chunked column must have the same logical length in number of elements (although each column can be chunked in a different way).

Record Batches#

A arrow::RecordBatch is a two-dimensional dataset of a number of contiguous arrays, each the same length. Like a table, a record batch also has a schema which must match its arrays’ datatypes.

Record batches are a convenient unit of work for various serialization and computation functions, possibly incremental.

A graphical representation of an Arrow Table and a Record Batch, with structure as described in text above.

Record batches can be sent between implementations, such as via IPC or via the C Data Interface. Tables and chunked arrays, on the other hand, are concepts in the C++ implementation, not in the Arrow format itself, so they aren’t directly portable.

However, a table can be converted to and built from a sequence of record batches easily without needing to copy the underlying array buffers. A table can be streamed as an arbitrary number of record batches using a arrow::TableBatchReader. Conversely, a logical sequence of record batches can be assembled to form a table using one of the arrow::Table::FromRecordBatches() factory function overloads.