Tabular Data¶
See also
While arrays and chunked arrays represent a one-dimensional sequence of homogeneous values, data often comes in the form of two-dimensional sets of heterogeneous data (such as database tables, CSV files…). Arrow provides several abstractions to handle such data conveniently and efficiently.
Fields¶
Fields are used to denote the particular columns of a table (and also
the particular members of a nested data type such as arrow::StructType
).
A field, i.e. an instance of arrow::Field
, holds together a data
type, a field name and some optional metadata.
The recommended way to create a field is to call the arrow::field()
factory function.
Schemas¶
A schema describes the overall structure of a two-dimensional dataset such
as a table. It holds a sequence of fields together with some optional
schema-wide metadata (in addition to per-field metadata). The recommended
way to create a schema is to call one the arrow::schema()
factory
function overloads:
// Create a schema describing datasets with two columns:
// a int32 column "A" and a utf8-encoded string column "B"
std::shared_ptr<arrow::Field> field_a, field_b;
std::shared_ptr<arrow::Schema> schema;
field_a = arrow::field("A", arrow::int32());
field_b = arrow::field("B", arrow::utf8());
schema = arrow::schema({field_a, field_b});
Tables¶
A arrow::Table
is a two-dimensional dataset with chunked arrays for
columns, together with a schema providing field names. Also, each chunked
column must have the same logical length in number of elements (although each
column can be chunked in a different way).
Record Batches¶
A arrow::RecordBatch
is a two-dimensional dataset of a number of
contiguous arrays, each the same length. Like a table, a record batch also
has a schema which must match its arrays’ datatypes.
Record batches are a convenient unit of work for various serialization and computation functions, possibly incremental.
Record batches can be sent between implementations, such as via IPC or via the C Data Interface. Tables and chunked arrays, on the other hand, are concepts in the C++ implementation, not in the Arrow format itself, so they aren’t directly portable.
However, a table can be converted to and built from a sequence of record
batches easily without needing to copy the underlying array buffers.
A table can be streamed as an arbitrary number of record batches using
a arrow::TableBatchReader
. Conversely, a logical sequence of
record batches can be assembled to form a table using one of the
arrow::Table::FromRecordBatches()
factory function overloads.