Two-dimensional Datasets#

Record Batches#

class RecordBatch#

Collection of equal-length arrays matching a particular Schema.

A record batch is table-like data structure that is semantically a sequence of fields, each a contiguous Arrow array

Public Functions

Result<std::shared_ptr<StructArray>> ToStructArray() const#

Convert record batch to struct array.

Create a struct array whose child arrays are the record batch’s columns. Note that the record batch’s top-level field metadata cannot be reflected in the resulting struct array.

bool Equals(const RecordBatch &other, bool check_metadata = false, const EqualOptions &opts = EqualOptions::Defaults()) const#

Determine if two record batches are exactly equal.

Parameters:
  • other[in] the RecordBatch to compare with

  • check_metadata[in] if true, check that Schema metadata is the same

  • opts[in] the options for equality comparisons

Returns:

true if batches are equal

bool ApproxEquals(const RecordBatch &other, const EqualOptions &opts = EqualOptions::Defaults()) const#

Determine if two record batches are approximately equal.

Parameters:
  • other[in] the RecordBatch to compare with

  • opts[in] the options for equality comparisons

Returns:

true if batches are approximately equal

inline const std::shared_ptr<Schema> &schema() const#
Returns:

the record batch’s schema

Result<std::shared_ptr<RecordBatch>> ReplaceSchema(std::shared_ptr<Schema> schema) const#

Replace the schema with another schema with the same types, but potentially different field names and/or metadata.

virtual const std::vector<std::shared_ptr<Array>> &columns() const = 0#

Retrieve all columns at once.

virtual std::shared_ptr<Array> column(int i) const = 0#

Retrieve an array from the record batch.

Parameters:

i[in] field index, does not boundscheck

Returns:

an Array object

std::shared_ptr<Array> GetColumnByName(const std::string &name) const#

Retrieve an array from the record batch.

Parameters:

name[in] field name

Returns:

an Array or null if no field was found

virtual std::shared_ptr<ArrayData> column_data(int i) const = 0#

Retrieve an array’s internal data from the record batch.

Parameters:

i[in] field index, does not boundscheck

Returns:

an internal ArrayData object

virtual const ArrayDataVector &column_data() const = 0#

Retrieve all arrays’ internal data from the record batch.

virtual Result<std::shared_ptr<RecordBatch>> AddColumn(int i, const std::shared_ptr<Field> &field, const std::shared_ptr<Array> &column) const = 0#

Add column to the record batch, producing a new RecordBatch.

Parameters:
  • i[in] field index, which will be boundschecked

  • field[in] field to be added

  • column[in] column to be added

virtual Result<std::shared_ptr<RecordBatch>> AddColumn(int i, std::string field_name, const std::shared_ptr<Array> &column) const#

Add new nullable column to the record batch, producing a new RecordBatch.

For non-nullable columns, use the Field-based version of this method.

Parameters:
  • i[in] field index, which will be boundschecked

  • field_name[in] name of field to be added

  • column[in] column to be added

virtual Result<std::shared_ptr<RecordBatch>> SetColumn(int i, const std::shared_ptr<Field> &field, const std::shared_ptr<Array> &column) const = 0#

Replace a column in the record batch, producing a new RecordBatch.

Parameters:
  • i[in] field index, does boundscheck

  • field[in] field to be replaced

  • column[in] column to be replaced

virtual Result<std::shared_ptr<RecordBatch>> RemoveColumn(int i) const = 0#

Remove column from the record batch, producing a new RecordBatch.

Parameters:

i[in] field index, does boundscheck

const std::string &column_name(int i) const#

Name in i-th column.

int num_columns() const#
Returns:

the number of columns in the table

inline int64_t num_rows() const#
Returns:

the number of rows (the corresponding length of each column)

virtual std::shared_ptr<RecordBatch> Slice(int64_t offset) const#

Slice each of the arrays in the record batch.

Parameters:

offset[in] the starting offset to slice, through end of batch

Returns:

new record batch

virtual std::shared_ptr<RecordBatch> Slice(int64_t offset, int64_t length) const = 0#

Slice each of the arrays in the record batch.

Parameters:
  • offset[in] the starting offset to slice

  • length[in] the number of elements to slice from offset

Returns:

new record batch

std::string ToString() const#
Returns:

PrettyPrint representation suitable for debugging

Result<std::shared_ptr<RecordBatch>> SelectColumns(const std::vector<int> &indices) const#

Return new record batch with specified columns.

virtual Status Validate() const#

Perform cheap validation checks to determine obvious inconsistencies within the record batch’s schema and internal data.

This is O(k) where k is the total number of fields and array descendents.

Returns:

Status

virtual Status ValidateFull() const#

Perform extensive validation checks to determine inconsistencies within the record batch’s schema and internal data.

This is potentially O(k*n) where n is the number of rows.

Returns:

Status

Public Static Functions

static std::shared_ptr<RecordBatch> Make(std::shared_ptr<Schema> schema, int64_t num_rows, std::vector<std::shared_ptr<Array>> columns)#
Parameters:
  • schema[in] The record batch schema

  • num_rows[in] length of fields in the record batch. Each array should have the same length as num_rows

  • columns[in] the record batch fields as vector of arrays

static std::shared_ptr<RecordBatch> Make(std::shared_ptr<Schema> schema, int64_t num_rows, std::vector<std::shared_ptr<ArrayData>> columns)#

Construct record batch from vector of internal data structures.

This class is intended for internal use, or advanced users.

Since

0.5.0

Parameters:
  • schema – the record batch schema

  • num_rows – the number of semantic rows in the record batch. This should be equal to the length of each field

  • columns – the data for the batch’s columns

static Result<std::shared_ptr<RecordBatch>> MakeEmpty(std::shared_ptr<Schema> schema, MemoryPool *pool = default_memory_pool())#

Create an empty RecordBatch of a given schema.

The output RecordBatch will be created with DataTypes from the given schema.

Parameters:
  • schema[in] the schema of the empty RecordBatch

  • pool[in] the memory pool to allocate memory from

Returns:

the resulting RecordBatch

static Result<std::shared_ptr<RecordBatch>> FromStructArray(const std::shared_ptr<Array> &array, MemoryPool *pool = default_memory_pool())#

Construct record batch from struct array.

This constructs a record batch using the child arrays of the given array, which must be a struct array.

This operation will usually be zero-copy. However, if the struct array has an offset or a validity bitmap then these will need to be pushed into the child arrays. Pushing the offset is zero-copy but pushing the validity bitmap is not.

Parameters:
  • array[in] the source array, must be a StructArray

  • pool[in] the memory pool to allocate new validity bitmaps

class RecordBatchReader#

Abstract interface for reading stream of record batches.

Subclassed by arrow::TableBatchReader, arrow::csv::StreamingReader, arrow::flight::sql::example::SqliteStatementBatchReader, arrow::flight::sql::example::SqliteTablesWithSchemaBatchReader, arrow::ipc::RecordBatchStreamReader, arrow::json::StreamingReader

Public Functions

virtual std::shared_ptr<Schema> schema() const = 0#
Returns:

the shared schema of the record batches in the stream

virtual Status ReadNext(std::shared_ptr<RecordBatch> *batch) = 0#

Read the next record batch in the stream.

Return null for batch when reaching end of stream

Parameters:

batch[out] the next loaded batch, null at end of stream

Returns:

Status

inline Result<std::shared_ptr<RecordBatch>> Next()#

Iterator interface.

inline virtual Status Close()#

finalize reader

inline RecordBatchReaderIterator begin()#

Return an iterator to the first record batch in the stream.

inline RecordBatchReaderIterator end()#

Return an iterator to the end of the stream.

Result<RecordBatchVector> ToRecordBatches()#

Consume entire stream as a vector of record batches.

Result<std::shared_ptr<Table>> ToTable()#

Read all batches and concatenate as arrow::Table.

Public Static Functions

static Result<std::shared_ptr<RecordBatchReader>> Make(RecordBatchVector batches, std::shared_ptr<Schema> schema = NULLPTR)#

Create a RecordBatchReader from a vector of RecordBatch.

Parameters:
  • batches[in] the vector of RecordBatch to read from

  • schema[in] schema to conform to. Will be inferred from the first element if not provided.

static Result<std::shared_ptr<RecordBatchReader>> MakeFromIterator(Iterator<std::shared_ptr<RecordBatch>> batches, std::shared_ptr<Schema> schema)#

Create a RecordBatchReader from an Iterator of RecordBatch.

Parameters:
  • batches[in] an iterator of RecordBatch to read from.

  • schema[in] schema that each record batch in iterator will conform to.

class RecordBatchReaderIterator#
class TableBatchReader : public arrow::RecordBatchReader#

Compute a stream of record batches from a (possibly chunked) Table.

The conversion is zero-copy: each record batch is a view over a slice of the table’s columns.

Public Functions

explicit TableBatchReader(const Table &table)#

Construct a TableBatchReader for the given table.

virtual std::shared_ptr<Schema> schema() const override#
Returns:

the shared schema of the record batches in the stream

virtual Status ReadNext(std::shared_ptr<RecordBatch> *out) override#

Read the next record batch in the stream.

Return null for batch when reaching end of stream

Parameters:

batch[out] the next loaded batch, null at end of stream

Returns:

Status

void set_chunksize(int64_t chunksize)#

Set the desired maximum chunk size of record batches.

The actual chunk size of each record batch may be smaller, depending on actual chunking characteristics of each table column.

Tables#

class Table#

Logical table as sequence of chunked arrays.

Public Functions

inline const std::shared_ptr<Schema> &schema() const#

Return the table schema.

virtual std::shared_ptr<ChunkedArray> column(int i) const = 0#

Return a column by index.

virtual const std::vector<std::shared_ptr<ChunkedArray>> &columns() const = 0#

Return vector of all columns for table.

inline std::shared_ptr<Field> field(int i) const#

Return a column’s field by index.

std::vector<std::shared_ptr<Field>> fields() const#

Return vector of all fields for table.

virtual std::shared_ptr<Table> Slice(int64_t offset, int64_t length) const = 0#

Construct a zero-copy slice of the table with the indicated offset and length.

Parameters:
  • offset[in] the index of the first row in the constructed slice

  • length[in] the number of rows of the slice. If there are not enough rows in the table, the length will be adjusted accordingly

Returns:

a new object wrapped in std::shared_ptr<Table>

inline std::shared_ptr<Table> Slice(int64_t offset) const#

Slice from first row at offset until end of the table.

inline std::shared_ptr<ChunkedArray> GetColumnByName(const std::string &name) const#

Return a column by name.

Parameters:

name[in] field name

Returns:

an Array or null if no field was found

virtual Result<std::shared_ptr<Table>> RemoveColumn(int i) const = 0#

Remove column from the table, producing a new Table.

virtual Result<std::shared_ptr<Table>> AddColumn(int i, std::shared_ptr<Field> field_arg, std::shared_ptr<ChunkedArray> column) const = 0#

Add column to the table, producing a new Table.

virtual Result<std::shared_ptr<Table>> SetColumn(int i, std::shared_ptr<Field> field_arg, std::shared_ptr<ChunkedArray> column) const = 0#

Replace a column in the table, producing a new Table.

std::vector<std::string> ColumnNames() const#

Return names of all columns.

Result<std::shared_ptr<Table>> RenameColumns(const std::vector<std::string> &names) const#

Rename columns with provided names.

Result<std::shared_ptr<Table>> SelectColumns(const std::vector<int> &indices) const#

Return new table with specified columns.

virtual std::shared_ptr<Table> ReplaceSchemaMetadata(const std::shared_ptr<const KeyValueMetadata> &metadata) const = 0#

Replace schema key-value metadata with new metadata.

Since

0.5.0

Parameters:

metadata[in] new KeyValueMetadata

Returns:

new Table

virtual Result<std::shared_ptr<Table>> Flatten(MemoryPool *pool = default_memory_pool()) const = 0#

Flatten the table, producing a new Table.

Any column with a struct type will be flattened into multiple columns

Parameters:

pool[in] The pool for buffer allocations, if any

std::string ToString() const#
Returns:

PrettyPrint representation suitable for debugging

virtual Status Validate() const = 0#

Perform cheap validation checks to determine obvious inconsistencies within the table’s schema and internal data.

This is O(k*m) where k is the total number of field descendents, and m is the number of chunks.

Returns:

Status

virtual Status ValidateFull() const = 0#

Perform extensive validation checks to determine inconsistencies within the table’s schema and internal data.

This is O(k*n) where k is the total number of field descendents, and n is the number of rows.

Returns:

Status

inline int num_columns() const#

Return the number of columns in the table.

inline int64_t num_rows() const#

Return the number of rows (equal to each column’s logical length)

bool Equals(const Table &other, bool check_metadata = false) const#

Determine if tables are equal.

Two tables can be equal only if they have equal schemas. However, they may be equal even if they have different chunkings.

Result<std::shared_ptr<Table>> CombineChunks(MemoryPool *pool = default_memory_pool()) const#

Make a new table by combining the chunks this table has.

All the underlying chunks in the ChunkedArray of each column are concatenated into zero or one chunk.

Parameters:

pool[in] The pool for buffer allocations

Result<std::shared_ptr<RecordBatch>> CombineChunksToBatch(MemoryPool *pool = default_memory_pool()) const#

Make a new record batch by combining the chunks this table has.

All the underlying chunks in the ChunkedArray of each column are concatenated into a single chunk.

Parameters:

pool[in] The pool for buffer allocations

Public Static Functions

static std::shared_ptr<Table> Make(std::shared_ptr<Schema> schema, std::vector<std::shared_ptr<ChunkedArray>> columns, int64_t num_rows = -1)#

Construct a Table from schema and columns.

If columns is zero-length, the table’s number of rows is zero

Parameters:
  • schema[in] The table schema (column types)

  • columns[in] The table’s columns as chunked arrays

  • num_rows[in] number of rows in table, -1 (default) to infer from columns

static std::shared_ptr<Table> Make(std::shared_ptr<Schema> schema, const std::vector<std::shared_ptr<Array>> &arrays, int64_t num_rows = -1)#

Construct a Table from schema and arrays.

Parameters:
  • schema[in] The table schema (column types)

  • arrays[in] The table’s columns as arrays

  • num_rows[in] number of rows in table, -1 (default) to infer from columns

static Result<std::shared_ptr<Table>> MakeEmpty(std::shared_ptr<Schema> schema, MemoryPool *pool = default_memory_pool())#

Create an empty Table of a given schema.

The output Table will be created with a single empty chunk per column.

Parameters:
  • schema[in] the schema of the empty Table

  • pool[in] the memory pool to allocate memory from

Returns:

the resulting Table

static Result<std::shared_ptr<Table>> FromRecordBatchReader(RecordBatchReader *reader)#

Construct a Table from a RecordBatchReader.

Parameters:

reader[in] the arrow::RecordBatchReader that produces batches

static Result<std::shared_ptr<Table>> FromRecordBatches(const std::vector<std::shared_ptr<RecordBatch>> &batches)#

Construct a Table from RecordBatches, using schema supplied by the first RecordBatch.

Parameters:

batches[in] a std::vector of record batches

static Result<std::shared_ptr<Table>> FromRecordBatches(std::shared_ptr<Schema> schema, const std::vector<std::shared_ptr<RecordBatch>> &batches)#

Construct a Table from RecordBatches, using supplied schema.

There may be zero record batches

Parameters:
  • schema[in] the arrow::Schema for each batch

  • batches[in] a std::vector of record batches

static Result<std::shared_ptr<Table>> FromChunkedStructArray(const std::shared_ptr<ChunkedArray> &array)#

Construct a Table from a chunked StructArray.

One column will be produced for each field of the StructArray.

Parameters:

array[in] a chunked StructArray

Result<std::shared_ptr<Table>> arrow::ConcatenateTables(const std::vector<std::shared_ptr<Table>> &tables, ConcatenateTablesOptions options = ConcatenateTablesOptions::Defaults(), MemoryPool *memory_pool = default_memory_pool())#

Construct a new table from multiple input tables.

The new table is assembled from existing column chunks without copying, if schemas are identical. If schemas do not match exactly and unify_schemas is enabled in options (off by default), an attempt is made to unify them, and then column chunks are converted to their respective unified datatype, which will probably incur a copy. :func:arrow::PromoteTableToSchema is used to unify schemas.

Tables are concatenated in order they are provided in and the order of rows within tables will be preserved.

Parameters:
  • tables[in] a std::vector of Tables to be concatenated

  • options[in] specify how to unify schema of input tables

  • memory_pool[in] MemoryPool to be used if null-filled arrays need to be created or if existing column chunks need to endure type conversion

Returns:

new Table

Warning

doxygenfunction: Unable to resolve function “arrow::PromoteTableToSchema” with arguments None in doxygen xml output for project “arrow_cpp” from directory: ../../cpp/apidoc/xml. Potential matches:

- Result<std::shared_ptr<Table>> PromoteTableToSchema(const std::shared_ptr<Table> &table, const std::shared_ptr<Schema> &schema, MemoryPool *pool = default_memory_pool())
- Result<std::shared_ptr<Table>> PromoteTableToSchema(const std::shared_ptr<Table> &table, const std::shared_ptr<Schema> &schema, const compute::CastOptions &options, MemoryPool *pool = default_memory_pool())