Two-dimensional Datasets

Columns

class Column

An immutable column data structure consisting of a field (type metadata) and a chunked data array.

Public Functions

Column(const std::shared_ptr<Field> &field, const ArrayVector &chunks)

Construct a column from a vector of arrays.

The array chunks’ datatype must match the field’s datatype.

Column(const std::shared_ptr<Field> &field, const std::shared_ptr<ChunkedArray> &data)

Construct a column from a chunked array.

The chunked array’s datatype must match the field’s datatype.

Column(const std::shared_ptr<Field> &field, const std::shared_ptr<Array> &data)

Construct a column from a single array.

The array’s datatype must match the field’s datatype.

Column(const std::string &name, const std::shared_ptr<Array> &data)

Construct a column from a name and an array.

A field with the given name and the array’s datatype is automatically created.

Column(const std::string &name, const std::shared_ptr<ChunkedArray> &data)

Construct a column from a name and a chunked array.

A field with the given name and the array’s datatype is automatically created.

const std::string &name() const

The column name.

Return

the column’s name in the passed metadata

std::shared_ptr<DataType> type() const

The column type.

Return

the column’s type according to the metadata

std::shared_ptr<ChunkedArray> data() const

The column data as a chunked array.

Return

the column’s data as a chunked logical array

std::shared_ptr<Column> Slice(int64_t offset, int64_t length) const

Construct a zero-copy slice of the column with the indicated offset and length.

Return

a new object wrapped in std::shared_ptr<Column>

Parameters
  • [in] offset: the position of the first element in the constructed slice

  • [in] length: the length of the slice. If there are not enough elements in the column, the length will be adjusted accordingly

std::shared_ptr<Column> Slice(int64_t offset) const

Slice from offset until end of the column.

Status Flatten(MemoryPool *pool, std::vector<std::shared_ptr<Column>> *out) const

Flatten this column as a vector of columns.

Parameters
  • [in] pool: The pool for buffer allocations, if any

  • [out] out: The resulting vector of arrays

bool Equals(const Column &other) const

Determine if two columns are equal.

Two columns can be equal only if they have equal datatypes. However, they may be equal even if they have different chunkings.

bool Equals(const std::shared_ptr<Column> &other) const

Determine if the two columns are equal.

Status ValidateData()

Verify that the column’s array data is consistent with the passed field’s metadata.

Tables

class Table

Logical table as sequence of chunked arrays.

Public Functions

std::shared_ptr<Schema> schema() const

Return the table schema.

virtual std::shared_ptr<Column> column(int i) const = 0

Return a column by index.

virtual std::shared_ptr<Table> Slice(int64_t offset, int64_t length) const = 0

Construct a zero-copy slice of the table with the indicated offset and length.

Return

a new object wrapped in std::shared_ptr<Table>

Parameters
  • [in] offset: the index of the first row in the constructed slice

  • [in] length: the number of rows of the slice. If there are not enough rows in the table, the length will be adjusted accordingly

std::shared_ptr<Table> Slice(int64_t offset) const

Slice from first row at offset until end of the table.

std::shared_ptr<Column> GetColumnByName(const std::string &name) const

Return a column by name.

Return

an Array or null if no field was found

Parameters
  • [in] name: field name

virtual Status RemoveColumn(int i, std::shared_ptr<Table> *out) const = 0

Remove column from the table, producing a new Table.

virtual Status AddColumn(int i, const std::shared_ptr<Column> &column, std::shared_ptr<Table> *out) const = 0

Add column to the table, producing a new Table.

virtual Status SetColumn(int i, const std::shared_ptr<Column> &column, std::shared_ptr<Table> *out) const = 0

Replace a column in the table, producing a new Table.

std::vector<std::string> ColumnNames() const

Return names of all columns.

Status RenameColumns(const std::vector<std::string> &names, std::shared_ptr<Table> *out) const

Rename columns with provided names.

virtual std::shared_ptr<Table> ReplaceSchemaMetadata(const std::shared_ptr<const KeyValueMetadata> &metadata) const = 0

Replace schema key-value metadata with new metadata (EXPERIMENTAL)

Since

0.5.0

Return

new Table

Parameters
  • [in] metadata: new KeyValueMetadata

virtual Status Flatten(MemoryPool *pool, std::shared_ptr<Table> *out) const = 0

Flatten the table, producing a new Table.

Any column with a struct type will be flattened into multiple columns

Parameters
  • [in] pool: The pool for buffer allocations, if any

  • [out] out: The returned table

virtual Status Validate() const = 0

Perform any checks to validate the input arguments.

int num_columns() const

Return the number of columns in the table.

int64_t num_rows() const

Return the number of rows (equal to each column’s logical length)

bool Equals(const Table &other) const

Determine if tables are equal.

Two tables can be equal only if they have equal schemas. However, they may be equal even if they have different chunkings.

Status CombineChunks(MemoryPool *pool, std::shared_ptr<Table> *out) const

Make a new table by combining the chunks this table has.

All the underlying chunks in the ChunkedArray of each column are concatenated into zero or one chunk.

Parameters
  • [in] pool: The pool for buffer allocations

  • [out] out: The table with chunks combined

Public Static Functions

static std::shared_ptr<Table> Make(const std::shared_ptr<Schema> &schema, const std::vector<std::shared_ptr<Column>> &columns, int64_t num_rows = -1)

Construct a Table from schema and columns If columns is zero-length, the table’s number of rows is zero.

Parameters
  • schema: The table schema (column types)

  • columns: The table’s columns

  • num_rows: number of rows in table, -1 (default) to infer from columns

static std::shared_ptr<Table> Make(const std::vector<std::shared_ptr<Column>> &columns, int64_t num_rows = -1)

Construct a Table from columns, schema is assembled from column fields If columns is zero-length, the table’s number of rows is zero.

Parameters
  • columns: The table’s columns

  • num_rows: number of rows in table, -1 (default) to infer from columns

static std::shared_ptr<Table> Make(const std::shared_ptr<Schema> &schema, const std::vector<std::shared_ptr<Array>> &arrays, int64_t num_rows = -1)

Construct a Table from schema and arrays.

Parameters
  • schema: The table schema (column types)

  • arrays: The table’s columns as arrays

  • num_rows: number of rows in table, -1 (default) to infer from columns

static Status FromRecordBatches(const std::vector<std::shared_ptr<RecordBatch>> &batches, std::shared_ptr<Table> *table)

Construct a Table from RecordBatches, using schema supplied by the first RecordBatch.

Return

Status Returns Status::Invalid if there is some problem

Parameters
  • [in] batches: a std::vector of record batches

  • [out] table: the returned table

static Status FromRecordBatches(const std::shared_ptr<Schema> &schema, const std::vector<std::shared_ptr<RecordBatch>> &batches, std::shared_ptr<Table> *table)

Construct a Table from RecordBatches, using supplied schema.

There may be zero record batches

Return

Status

Parameters
  • [in] schema: the arrow::Schema for each batch

  • [in] batches: a std::vector of record batches

  • [out] table: the returned table

static Status FromChunkedStructArray(const std::shared_ptr<ChunkedArray> &array, std::shared_ptr<Table> *table)

Construct a Table from a chunked StructArray.

One column will be produced for each field of the StructArray.

Return

Status

Parameters
  • [in] array: a chunked StructArray

  • [out] table: the returned table

Status arrow::ConcatenateTables(const std::vector<std::shared_ptr<Table>> &tables, std::shared_ptr<Table> *table)

Construct table from multiple input tables.

The tables are concatenated vertically. Therefore, all tables should have the same schema. Each column in the output table is the result of concatenating the corresponding columns in all input tables.

Record Batches

class RecordBatch

Collection of equal-length arrays matching a particular Schema.

A record batch is table-like data structure that is semantically a sequence of fields, each a contiguous Arrow array

Public Functions

bool Equals(const RecordBatch &other) const

Determine if two record batches are exactly equal.

Return

true if batches are equal

bool ApproxEquals(const RecordBatch &other) const

Determine if two record batches are approximately equal.

std::shared_ptr<Schema> schema() const

Return

true if batches are equal

virtual std::shared_ptr<Array> column(int i) const = 0

Retrieve an array from the record batch.

Return

an Array object

Parameters
  • [in] i: field index, does not boundscheck

std::shared_ptr<Array> GetColumnByName(const std::string &name) const

Retrieve an array from the record batch.

Return

an Array or null if no field was found

Parameters
  • [in] name: field name

virtual std::shared_ptr<ArrayData> column_data(int i) const = 0

Retrieve an array’s internaldata from the record batch.

Return

an internal ArrayData object

Parameters
  • [in] i: field index, does not boundscheck

virtual Status AddColumn(int i, const std::shared_ptr<Field> &field, const std::shared_ptr<Array> &column, std::shared_ptr<RecordBatch> *out) const = 0

Add column to the record batch, producing a new RecordBatch.

Parameters
  • [in] i: field index, which will be boundschecked

  • [in] field: field to be added

  • [in] column: column to be added

  • [out] out: record batch with column added

virtual Status AddColumn(int i, const std::string &field_name, const std::shared_ptr<Array> &column, std::shared_ptr<RecordBatch> *out) const

Add new nullable column to the record batch, producing a new RecordBatch.

For non-nullable columns, use the Field-based version of this method.

Parameters
  • [in] i: field index, which will be boundschecked

  • [in] field_name: name of field to be added

  • [in] column: column to be added

  • [out] out: record batch with column added

virtual Status RemoveColumn(int i, std::shared_ptr<RecordBatch> *out) const = 0

Remove column from the record batch, producing a new RecordBatch.

Parameters
  • [in] i: field index, does boundscheck

  • [out] out: record batch with column removed

const std::string &column_name(int i) const

Name in i-th column.

int num_columns() const

Return

the number of columns in the table

int64_t num_rows() const

Return

the number of rows (the corresponding length of each column)

virtual std::shared_ptr<RecordBatch> Slice(int64_t offset) const

Slice each of the arrays in the record batch.

Return

new record batch

Parameters
  • [in] offset: the starting offset to slice, through end of batch

virtual std::shared_ptr<RecordBatch> Slice(int64_t offset, int64_t length) const = 0

Slice each of the arrays in the record batch.

Return

new record batch

Parameters
  • [in] offset: the starting offset to slice

  • [in] length: the number of elements to slice from offset

virtual Status Validate() const

Check for schema or length inconsistencies.

Return

Status

Public Static Functions

static std::shared_ptr<RecordBatch> Make(const std::shared_ptr<Schema> &schema, int64_t num_rows, const std::vector<std::shared_ptr<Array>> &columns)

Parameters
  • [in] schema: The record batch schema

  • [in] num_rows: length of fields in the record batch. Each array should have the same length as num_rows

  • [in] columns: the record batch fields as vector of arrays

static std::shared_ptr<RecordBatch> Make(const std::shared_ptr<Schema> &schema, int64_t num_rows, std::vector<std::shared_ptr<Array>> &&columns)

Move-based constructor for a vector of Array instances.

static std::shared_ptr<RecordBatch> Make(const std::shared_ptr<Schema> &schema, int64_t num_rows, std::vector<std::shared_ptr<ArrayData>> &&columns)

Construct record batch from vector of internal data structures.

This class is only provided with an rvalue-reference for the input data, and is intended for internal use, or advanced users.

Since

0.5.0

Parameters
  • schema: the record batch schema

  • num_rows: the number of semantic rows in the record batch. This should be equal to the length of each field

  • columns: the data for the batch’s columns

static std::shared_ptr<RecordBatch> Make(const std::shared_ptr<Schema> &schema, int64_t num_rows, const std::vector<std::shared_ptr<ArrayData>> &columns)

Construct record batch by copying vector of array data.

Since

0.5.0

class RecordBatchReader

Abstract interface for reading stream of record batches.

Subclassed by arrow::ipc::RecordBatchStreamReader, arrow::TableBatchReader

Public Functions

virtual std::shared_ptr<Schema> schema() const = 0

Return

the shared schema of the record batches in the stream

virtual Status ReadNext(std::shared_ptr<RecordBatch> *batch) = 0

Read the next record batch in the stream.

Return null for batch when reaching end of stream

Return

Status

Parameters
  • [out] batch: the next loaded batch, null at end of stream

Status ReadAll(std::vector<std::shared_ptr<RecordBatch>> *batches)

Consume entire stream as a vector of record batches.

Status ReadAll(std::shared_ptr<Table> *table)

Read all batches and concatenate as arrow::Table.

class TableBatchReader : public arrow::RecordBatchReader

Compute a stream of record batches from a (possibly chunked) Table.

The conversion is zero-copy: each record batch is a view over a slice of the table’s columns.

Public Functions

TableBatchReader(const Table &table)

Construct a TableBatchReader for the given table.

std::shared_ptr<Schema> schema() const

Return

the shared schema of the record batches in the stream

Status ReadNext(std::shared_ptr<RecordBatch> *batch)

Read the next record batch in the stream.

Return null for batch when reaching end of stream

Return

Status

Parameters
  • [out] batch: the next loaded batch, null at end of stream

void set_chunksize(int64_t chunksize)

Set the desired maximum chunk size of record batches.

The actual chunk size of each record batch may be smaller, depending on actual chunking characteristics of each table column.