Two-dimensional Datasets#

Record Batches#

class arrow::RecordBatch#

Collection of equal-length arrays matching a particular Schema.

A record batch is table-like data structure that is semantically a sequence of fields, each a contiguous Arrow array

Public Functions

Result<std::shared_ptr<StructArray>> ToStructArray() const#

Convert record batch to struct array.

Create a struct array whose child arrays are the record batch’s columns. Note that the record batch’s top-level field metadata cannot be reflected in the resulting struct array.

bool Equals(const RecordBatch &other, bool check_metadata = false) const#

Determine if two record batches are exactly equal.

Parameters
  • other[in] the RecordBatch to compare with

  • check_metadata[in] if true, check that Schema metadata is the same

Returns

true if batches are equal

bool ApproxEquals(const RecordBatch &other) const#

Determine if two record batches are approximately equal.

inline const std::shared_ptr<Schema> &schema() const#
Returns

the record batch’s schema

virtual const std::vector<std::shared_ptr<Array>> &columns() const = 0#

Retrieve all columns at once.

virtual std::shared_ptr<Array> column(int i) const = 0#

Retrieve an array from the record batch.

Parameters

i[in] field index, does not boundscheck

Returns

an Array object

std::shared_ptr<Array> GetColumnByName(const std::string &name) const#

Retrieve an array from the record batch.

Parameters

name[in] field name

Returns

an Array or null if no field was found

virtual std::shared_ptr<ArrayData> column_data(int i) const = 0#

Retrieve an array’s internal data from the record batch.

Parameters

i[in] field index, does not boundscheck

Returns

an internal ArrayData object

virtual const ArrayDataVector &column_data() const = 0#

Retrieve all arrays’ internal data from the record batch.

virtual Result<std::shared_ptr<RecordBatch>> AddColumn(int i, const std::shared_ptr<Field> &field, const std::shared_ptr<Array> &column) const = 0#

Add column to the record batch, producing a new RecordBatch.

Parameters
  • i[in] field index, which will be boundschecked

  • field[in] field to be added

  • column[in] column to be added

virtual Result<std::shared_ptr<RecordBatch>> AddColumn(int i, std::string field_name, const std::shared_ptr<Array> &column) const#

Add new nullable column to the record batch, producing a new RecordBatch.

For non-nullable columns, use the Field-based version of this method.

Parameters
  • i[in] field index, which will be boundschecked

  • field_name[in] name of field to be added

  • column[in] column to be added

virtual Result<std::shared_ptr<RecordBatch>> SetColumn(int i, const std::shared_ptr<Field> &field, const std::shared_ptr<Array> &column) const = 0#

Replace a column in the record batch, producing a new RecordBatch.

Parameters
  • i[in] field index, does boundscheck

  • field[in] field to be replaced

  • column[in] column to be replaced

virtual Result<std::shared_ptr<RecordBatch>> RemoveColumn(int i) const = 0#

Remove column from the record batch, producing a new RecordBatch.

Parameters

i[in] field index, does boundscheck

const std::string &column_name(int i) const#

Name in i-th column.

int num_columns() const#
Returns

the number of columns in the table

inline int64_t num_rows() const#
Returns

the number of rows (the corresponding length of each column)

virtual std::shared_ptr<RecordBatch> Slice(int64_t offset) const#

Slice each of the arrays in the record batch.

Parameters

offset[in] the starting offset to slice, through end of batch

Returns

new record batch

virtual std::shared_ptr<RecordBatch> Slice(int64_t offset, int64_t length) const = 0#

Slice each of the arrays in the record batch.

Parameters
  • offset[in] the starting offset to slice

  • length[in] the number of elements to slice from offset

Returns

new record batch

std::string ToString() const#
Returns

PrettyPrint representation suitable for debugging

Result<std::shared_ptr<RecordBatch>> SelectColumns(const std::vector<int> &indices) const#

Return new record batch with specified columns.

virtual Status Validate() const#

Perform cheap validation checks to determine obvious inconsistencies within the record batch’s schema and internal data.

This is O(k) where k is the total number of fields and array descendents.

Returns

Status

virtual Status ValidateFull() const#

Perform extensive validation checks to determine inconsistencies within the record batch’s schema and internal data.

This is potentially O(k*n) where n is the number of rows.

Returns

Status

Public Static Functions

static std::shared_ptr<RecordBatch> Make(std::shared_ptr<Schema> schema, int64_t num_rows, std::vector<std::shared_ptr<Array>> columns)#
Parameters
  • schema[in] The record batch schema

  • num_rows[in] length of fields in the record batch. Each array should have the same length as num_rows

  • columns[in] the record batch fields as vector of arrays

static std::shared_ptr<RecordBatch> Make(std::shared_ptr<Schema> schema, int64_t num_rows, std::vector<std::shared_ptr<ArrayData>> columns)#

Construct record batch from vector of internal data structures.

This class is intended for internal use, or advanced users.

Since

0.5.0

Parameters
  • schema – the record batch schema

  • num_rows – the number of semantic rows in the record batch. This should be equal to the length of each field

  • columns – the data for the batch’s columns

static Result<std::shared_ptr<RecordBatch>> MakeEmpty(std::shared_ptr<Schema> schema, MemoryPool *pool = default_memory_pool())#

Create an empty RecordBatch of a given schema.

The output RecordBatch will be created with DataTypes from the given schema.

Parameters
  • schema[in] the schema of the empty RecordBatch

  • pool[in] the memory pool to allocate memory from

Returns

the resulting RecordBatch

static Result<std::shared_ptr<RecordBatch>> FromStructArray(const std::shared_ptr<Array> &array)#

Construct record batch from struct array.

This constructs a record batch using the child arrays of the given array, which must be a struct array. Note that the struct array’s own null bitmap is not reflected in the resulting record batch.

class arrow::RecordBatchReader#

Abstract interface for reading stream of record batches.

Subclassed by arrow::csv::StreamingReader, arrow::flight::sql::example::SqliteStatementBatchReader, arrow::flight::sql::example::SqliteTablesWithSchemaBatchReader, arrow::ipc::RecordBatchStreamReader, arrow::py::PyRecordBatchReader, arrow::TableBatchReader

Public Functions

virtual std::shared_ptr<Schema> schema() const = 0#
Returns

the shared schema of the record batches in the stream

virtual Status ReadNext(std::shared_ptr<RecordBatch> *batch) = 0#

Read the next record batch in the stream.

Return null for batch when reaching end of stream

Parameters

batch[out] the next loaded batch, null at end of stream

Returns

Status

inline Result<std::shared_ptr<RecordBatch>> Next()#

Iterator interface.

inline RecordBatchReaderIterator begin()#

Return an iterator to the first record batch in the stream.

inline RecordBatchReaderIterator end()#

Return an iterator to the end of the stream.

Result<RecordBatchVector> ToRecordBatches()#

Consume entire stream as a vector of record batches.

Result<std::shared_ptr<Table>> ToTable()#

Read all batches and concatenate as arrow::Table.

Public Static Functions

static Result<std::shared_ptr<RecordBatchReader>> Make(RecordBatchVector batches, std::shared_ptr<Schema> schema = NULLPTR)#

Create a RecordBatchReader from a vector of RecordBatch.

Parameters
  • batches[in] the vector of RecordBatch to read from

  • schema[in] schema to conform to. Will be inferred from the first element if not provided.

class RecordBatchReaderIterator#
class arrow::TableBatchReader : public arrow::RecordBatchReader#

Compute a stream of record batches from a (possibly chunked) Table.

The conversion is zero-copy: each record batch is a view over a slice of the table’s columns.

Public Functions

explicit TableBatchReader(const Table &table)#

Construct a TableBatchReader for the given table.

virtual std::shared_ptr<Schema> schema() const override#
Returns

the shared schema of the record batches in the stream

virtual Status ReadNext(std::shared_ptr<RecordBatch> *out) override#

Read the next record batch in the stream.

Return null for batch when reaching end of stream

Parameters

batch[out] the next loaded batch, null at end of stream

Returns

Status

void set_chunksize(int64_t chunksize)#

Set the desired maximum chunk size of record batches.

The actual chunk size of each record batch may be smaller, depending on actual chunking characteristics of each table column.

Tables#

class arrow::Table#

Logical table as sequence of chunked arrays.

Public Functions

inline const std::shared_ptr<Schema> &schema() const#

Return the table schema.

virtual std::shared_ptr<ChunkedArray> column(int i) const = 0#

Return a column by index.

virtual const std::vector<std::shared_ptr<ChunkedArray>> &columns() const = 0#

Return vector of all columns for table.

inline std::shared_ptr<Field> field(int i) const#

Return a column’s field by index.

std::vector<std::shared_ptr<Field>> fields() const#

Return vector of all fields for table.

virtual std::shared_ptr<Table> Slice(int64_t offset, int64_t length) const = 0#

Construct a zero-copy slice of the table with the indicated offset and length.

Parameters
  • offset[in] the index of the first row in the constructed slice

  • length[in] the number of rows of the slice. If there are not enough rows in the table, the length will be adjusted accordingly

Returns

a new object wrapped in std::shared_ptr<Table>

inline std::shared_ptr<Table> Slice(int64_t offset) const#

Slice from first row at offset until end of the table.

inline std::shared_ptr<ChunkedArray> GetColumnByName(const std::string &name) const#

Return a column by name.

Parameters

name[in] field name

Returns

an Array or null if no field was found

virtual Result<std::shared_ptr<Table>> RemoveColumn(int i) const = 0#

Remove column from the table, producing a new Table.

virtual Result<std::shared_ptr<Table>> AddColumn(int i, std::shared_ptr<Field> field_arg, std::shared_ptr<ChunkedArray> column) const = 0#

Add column to the table, producing a new Table.

virtual Result<std::shared_ptr<Table>> SetColumn(int i, std::shared_ptr<Field> field_arg, std::shared_ptr<ChunkedArray> column) const = 0#

Replace a column in the table, producing a new Table.

std::vector<std::string> ColumnNames() const#

Return names of all columns.

Result<std::shared_ptr<Table>> RenameColumns(const std::vector<std::string> &names) const#

Rename columns with provided names.

Result<std::shared_ptr<Table>> SelectColumns(const std::vector<int> &indices) const#

Return new table with specified columns.

virtual std::shared_ptr<Table> ReplaceSchemaMetadata(const std::shared_ptr<const KeyValueMetadata> &metadata) const = 0#

Replace schema key-value metadata with new metadata.

Since

0.5.0

Parameters

metadata[in] new KeyValueMetadata

Returns

new Table

virtual Result<std::shared_ptr<Table>> Flatten(MemoryPool *pool = default_memory_pool()) const = 0#

Flatten the table, producing a new Table.

Any column with a struct type will be flattened into multiple columns

Parameters

pool[in] The pool for buffer allocations, if any

std::string ToString() const#
Returns

PrettyPrint representation suitable for debugging

virtual Status Validate() const = 0#

Perform cheap validation checks to determine obvious inconsistencies within the table’s schema and internal data.

This is O(k*m) where k is the total number of field descendents, and m is the number of chunks.

Returns

Status

virtual Status ValidateFull() const = 0#

Perform extensive validation checks to determine inconsistencies within the table’s schema and internal data.

This is O(k*n) where k is the total number of field descendents, and n is the number of rows.

Returns

Status

inline int num_columns() const#

Return the number of columns in the table.

inline int64_t num_rows() const#

Return the number of rows (equal to each column’s logical length)

bool Equals(const Table &other, bool check_metadata = false) const#

Determine if tables are equal.

Two tables can be equal only if they have equal schemas. However, they may be equal even if they have different chunkings.

Result<std::shared_ptr<Table>> CombineChunks(MemoryPool *pool = default_memory_pool()) const#

Make a new table by combining the chunks this table has.

All the underlying chunks in the ChunkedArray of each column are concatenated into zero or one chunk.

Parameters

pool[in] The pool for buffer allocations

Result<std::shared_ptr<RecordBatch>> CombineChunksToBatch(MemoryPool *pool = default_memory_pool()) const#

Make a new record batch by combining the chunks this table has.

All the underlying chunks in the ChunkedArray of each column are concatenated into a single chunk.

Parameters

pool[in] The pool for buffer allocations

Public Static Functions

static std::shared_ptr<Table> Make(std::shared_ptr<Schema> schema, std::vector<std::shared_ptr<ChunkedArray>> columns, int64_t num_rows = -1)#

Construct a Table from schema and columns.

If columns is zero-length, the table’s number of rows is zero

Parameters
  • schema[in] The table schema (column types)

  • columns[in] The table’s columns as chunked arrays

  • num_rows[in] number of rows in table, -1 (default) to infer from columns

static std::shared_ptr<Table> Make(std::shared_ptr<Schema> schema, const std::vector<std::shared_ptr<Array>> &arrays, int64_t num_rows = -1)#

Construct a Table from schema and arrays.

Parameters
  • schema[in] The table schema (column types)

  • arrays[in] The table’s columns as arrays

  • num_rows[in] number of rows in table, -1 (default) to infer from columns

static Result<std::shared_ptr<Table>> MakeEmpty(std::shared_ptr<Schema> schema, MemoryPool *pool = default_memory_pool())#

Create an empty Table of a given schema.

The output Table will be created with a single empty chunk per column.

Parameters
  • schema[in] the schema of the empty Table

  • pool[in] the memory pool to allocate memory from

Returns

the resulting Table

static Result<std::shared_ptr<Table>> FromRecordBatchReader(RecordBatchReader *reader)#

Construct a Table from a RecordBatchReader.

Parameters

reader[in] the arrow::Schema for each batch

static Result<std::shared_ptr<Table>> FromRecordBatches(const std::vector<std::shared_ptr<RecordBatch>> &batches)#

Construct a Table from RecordBatches, using schema supplied by the first RecordBatch.

Parameters

batches[in] a std::vector of record batches

static Result<std::shared_ptr<Table>> FromRecordBatches(std::shared_ptr<Schema> schema, const std::vector<std::shared_ptr<RecordBatch>> &batches)#

Construct a Table from RecordBatches, using supplied schema.

There may be zero record batches

Parameters
  • schema[in] the arrow::Schema for each batch

  • batches[in] a std::vector of record batches

static Result<std::shared_ptr<Table>> FromChunkedStructArray(const std::shared_ptr<ChunkedArray> &array)#

Construct a Table from a chunked StructArray.

One column will be produced for each field of the StructArray.

Parameters

array[in] a chunked StructArray

Result<std::shared_ptr<Table>> arrow::ConcatenateTables(const std::vector<std::shared_ptr<Table>> &tables, ConcatenateTablesOptions options = ConcatenateTablesOptions::Defaults(), MemoryPool *memory_pool = default_memory_pool())#

Construct table from multiple input tables.