Two-dimensional Datasets¶

Record Batches¶

class RecordBatch¶

Collection of equal-length arrays matching a particular Schema.

A record batch is table-like data structure that is semantically a sequence of fields, each a contiguous Arrow array

Public Functions

Result<std::shared_ptr<StructArray>> ToStructArray() const¶

Convert record batch to struct array.

Create a struct array whose child arrays are the record batch’s columns. Note that the record batch’s top-level field metadata cannot be reflected in the resulting struct array.

bool Equals(const RecordBatch &other, bool check_metadata = false, const EqualOptions &opts = EqualOptions::Defaults()) const¶

Determine if two record batches are exactly equal.

Parameters:

other – [in] the RecordBatch to compare with
check_metadata – [in] if true, check that Schema metadata is the same
opts – [in] the options for equality comparisons

Returns:

true if batches are equal

bool ApproxEquals(const RecordBatch &other, const EqualOptions &opts = EqualOptions::Defaults()) const¶

Determine if two record batches are approximately equal.

Parameters:

other – [in] the RecordBatch to compare with
opts – [in] the options for equality comparisons

Returns:

true if batches are approximately equal

inline const std::shared_ptr<Schema> &schema() const¶

Returns:: the record batch’s schema

virtual const std::vector<std::shared_ptr<Array>> &columns() const = 0¶: Retrieve all columns at once.

virtual std::shared_ptr<Array> column(int i) const = 0¶

Retrieve an array from the record batch.

Parameters:: i – [in] field index, does not boundscheck
Returns:: an Array object

std::shared_ptr<Array> GetColumnByName(const std::string &name) const¶

Retrieve an array from the record batch.

Parameters:: name – [in] field name
Returns:: an Array or null if no field was found

virtual std::shared_ptr<ArrayData> column_data(int i) const = 0¶

Retrieve an array’s internal data from the record batch.

Parameters:: i – [in] field index, does not boundscheck
Returns:: an internal ArrayData object

virtual const ArrayDataVector &column_data() const = 0¶: Retrieve all arrays’ internal data from the record batch.

virtual Result<std::shared_ptr<RecordBatch>> AddColumn(int i, const std::shared_ptr<Field> &field, const std::shared_ptr<Array> &column) const = 0¶

Add column to the record batch, producing a new RecordBatch.

Parameters:

i – [in] field index, which will be boundschecked
field – [in] field to be added
column – [in] column to be added

virtual Result<std::shared_ptr<RecordBatch>> AddColumn(int i, std::string field_name, const std::shared_ptr<Array> &column) const¶

Add new nullable column to the record batch, producing a new RecordBatch.

For non-nullable columns, use the Field-based version of this method.

Parameters:

i – [in] field index, which will be boundschecked
field_name – [in] name of field to be added
column – [in] column to be added

virtual Result<std::shared_ptr<RecordBatch>> SetColumn(int i, const std::shared_ptr<Field> &field, const std::shared_ptr<Array> &column) const = 0¶

Replace a column in the record batch, producing a new RecordBatch.

Parameters:

i – [in] field index, does boundscheck
field – [in] field to be replaced
column – [in] column to be replaced

virtual Result<std::shared_ptr<RecordBatch>> RemoveColumn(int i) const = 0¶

Remove column from the record batch, producing a new RecordBatch.

Parameters:: i – [in] field index, does boundscheck

const std::string &column_name(int i) const¶: Name in i-th column.

int num_columns() const¶

Returns:: the number of columns in the table

inline int64_t num_rows() const¶

Returns:: the number of rows (the corresponding length of each column)

virtual std::shared_ptr<RecordBatch> Slice(int64_t offset) const¶

Slice each of the arrays in the record batch.

Parameters:: offset – [in] the starting offset to slice, through end of batch
Returns:: new record batch

virtual std::shared_ptr<RecordBatch> Slice(int64_t offset, int64_t length) const = 0¶

Slice each of the arrays in the record batch.

Parameters:

offset – [in] the starting offset to slice
length – [in] the number of elements to slice from offset

Returns:

new record batch

std::string ToString() const¶

Returns:: PrettyPrint representation suitable for debugging

Result<std::shared_ptr<RecordBatch>> SelectColumns(const std::vector<int> &indices) const¶: Return new record batch with specified columns.

virtual Status Validate() const¶

Perform cheap validation checks to determine obvious inconsistencies within the record batch’s schema and internal data.

This is O(k) where k is the total number of fields and array descendents.

Returns:: Status

virtual Status ValidateFull() const¶

Perform extensive validation checks to determine inconsistencies within the record batch’s schema and internal data.

This is potentially O(k*n) where n is the number of rows.

Returns:: Status

Public Static Functions

static std::shared_ptr<RecordBatch> Make(std::shared_ptr<Schema> schema, int64_t num_rows, std::vector<std::shared_ptr<Array>> columns)¶

Parameters:

schema – [in] The record batch schema
num_rows – [in] length of fields in the record batch. Each array should have the same length as num_rows
columns – [in] the record batch fields as vector of arrays

static std::shared_ptr<RecordBatch> Make(std::shared_ptr<Schema> schema, int64_t num_rows, std::vector<std::shared_ptr<ArrayData>> columns)¶

Construct record batch from vector of internal data structures.

This class is intended for internal use, or advanced users.

Since: 0.5.0

Parameters:

schema – the record batch schema
num_rows – the number of semantic rows in the record batch. This should be equal to the length of each field
columns – the data for the batch’s columns

static Result<std::shared_ptr<RecordBatch>> MakeEmpty(std::shared_ptr<Schema> schema, MemoryPool *pool = default_memory_pool())¶

Create an empty RecordBatch of a given schema.

The output RecordBatch will be created with DataTypes from the given schema.

Parameters:

schema – [in] the schema of the empty RecordBatch
pool – [in] the memory pool to allocate memory from

Returns:

the resulting RecordBatch

static Result<std::shared_ptr<RecordBatch>> FromStructArray(const std::shared_ptr<Array> &array, MemoryPool *pool = default_memory_pool())¶

Construct record batch from struct array.

This constructs a record batch using the child arrays of the given array, which must be a struct array.

This operation will usually be zero-copy. However, if the struct array has an offset or a validity bitmap then these will need to be pushed into the child arrays. Pushing the offset is zero-copy but pushing the validity bitmap is not.

Parameters:

array – [in] the source array, must be a StructArray
pool – [in] the memory pool to allocate new validity bitmaps

class RecordBatchReader¶

Abstract interface for reading stream of record batches.

Subclassed by arrow::csv::StreamingReader, arrow::flight::sql::example::SqliteStatementBatchReader, arrow::flight::sql::example::SqliteTablesWithSchemaBatchReader, arrow::ipc::RecordBatchStreamReader, arrow::json::StreamingReader, arrow::TableBatchReader

Public Functions

virtual std::shared_ptr<Schema> schema() const = 0¶

Returns:: the shared schema of the record batches in the stream

virtual Status ReadNext(std::shared_ptr<RecordBatch> *batch) = 0¶

Read the next record batch in the stream.

Return null for batch when reaching end of stream

Parameters:: batch – [out] the next loaded batch, null at end of stream
Returns:: Status

inline Result<std::shared_ptr<RecordBatch>> Next()¶: Iterator interface.

inline virtual Status Close()¶: finalize reader

inline RecordBatchReaderIterator begin()¶: Return an iterator to the first record batch in the stream.

inline RecordBatchReaderIterator end()¶: Return an iterator to the end of the stream.

Result<RecordBatchVector> ToRecordBatches()¶: Consume entire stream as a vector of record batches.

Result<std::shared_ptr<Table>> ToTable()¶: Read all batches and concatenate as arrow::Table.

Public Static Functions

static Result<std::shared_ptr<RecordBatchReader>> Make(RecordBatchVector batches, std::shared_ptr<Schema> schema = NULLPTR)¶

Create a RecordBatchReader from a vector of RecordBatch.

Parameters:

batches – [in] the vector of RecordBatch to read from
schema – [in] schema to conform to. Will be inferred from the first element if not provided.

static Result<std::shared_ptr<RecordBatchReader>> MakeFromIterator(Iterator<std::shared_ptr<RecordBatch>> batches, std::shared_ptr<Schema> schema)¶

Create a RecordBatchReader from an Iterator of RecordBatch.

Parameters:

batches – [in] an iterator of RecordBatch to read from.
schema – [in] schema that each record batch in iterator will conform to.

class RecordBatchReaderIterator¶

class TableBatchReader : public arrow::RecordBatchReader ¶

Compute a stream of record batches from a (possibly chunked) Table.

The conversion is zero-copy: each record batch is a view over a slice of the table’s columns.

Public Functions

explicit TableBatchReader(const Table &table)¶: Construct a TableBatchReader for the given table.

virtual std::shared_ptr<Schema> schema() const override¶

Returns:: the shared schema of the record batches in the stream

virtual Status ReadNext(std::shared_ptr<RecordBatch> *out) override¶

Read the next record batch in the stream.

Return null for batch when reaching end of stream

Parameters:: batch – [out] the next loaded batch, null at end of stream
Returns:: Status

void set_chunksize(int64_t chunksize)¶

Set the desired maximum chunk size of record batches.

The actual chunk size of each record batch may be smaller, depending on actual chunking characteristics of each table column.

Tables¶

class Table¶

Logical table as sequence of chunked arrays.

Public Functions

inline const std::shared_ptr<Schema> &schema() const¶: Return the table schema.

virtual std::shared_ptr<ChunkedArray> column(int i) const = 0¶: Return a column by index.

virtual const std::vector<std::shared_ptr<ChunkedArray>> &columns() const = 0¶: Return vector of all columns for table.

inline std::shared_ptr<Field> field(int i) const¶: Return a column’s field by index.

std::vector<std::shared_ptr<Field>> fields() const¶: Return vector of all fields for table.

virtual std::shared_ptr<Table> Slice(int64_t offset, int64_t length) const = 0¶

Construct a zero-copy slice of the table with the indicated offset and length.

Parameters:

offset – [in] the index of the first row in the constructed slice
length – [in] the number of rows of the slice. If there are not enough rows in the table, the length will be adjusted accordingly

Returns:

a new object wrapped in std::shared_ptr<Table>

inline std::shared_ptr<Table> Slice(int64_t offset) const¶: Slice from first row at offset until end of the table.

inline std::shared_ptr<ChunkedArray> GetColumnByName(const std::string &name) const¶

Return a column by name.

Parameters:: name – [in] field name
Returns:: an Array or null if no field was found

virtual Result<std::shared_ptr<Table>> RemoveColumn(int i) const = 0¶: Remove column from the table, producing a new Table.

virtual Result<std::shared_ptr<Table>> AddColumn(int i, std::shared_ptr<Field> field_arg, std::shared_ptr<ChunkedArray> column) const = 0¶: Add column to the table, producing a new Table.

virtual Result<std::shared_ptr<Table>> SetColumn(int i, std::shared_ptr<Field> field_arg, std::shared_ptr<ChunkedArray> column) const = 0¶: Replace a column in the table, producing a new Table.

std::vector<std::string> ColumnNames() const¶: Return names of all columns.

Result<std::shared_ptr<Table>> RenameColumns(const std::vector<std::string> &names) const¶: Rename columns with provided names.

Result<std::shared_ptr<Table>> SelectColumns(const std::vector<int> &indices) const¶: Return new table with specified columns.

virtual std::shared_ptr<Table> ReplaceSchemaMetadata(const std::shared_ptr<const KeyValueMetadata> &metadata) const = 0¶

Replace schema key-value metadata with new metadata.

Since: 0.5.0

Parameters:: metadata – [in] new KeyValueMetadata
Returns:: new Table

virtual Result<std::shared_ptr<Table>> Flatten(MemoryPool *pool = default_memory_pool()) const = 0¶

Flatten the table, producing a new Table.

Any column with a struct type will be flattened into multiple columns

Parameters:: pool – [in] The pool for buffer allocations, if any

std::string ToString() const¶

Returns:: PrettyPrint representation suitable for debugging

virtual Status Validate() const = 0¶

Perform cheap validation checks to determine obvious inconsistencies within the table’s schema and internal data.

This is O(k*m) where k is the total number of field descendents, and m is the number of chunks.

Returns:: Status

virtual Status ValidateFull() const = 0¶

Perform extensive validation checks to determine inconsistencies within the table’s schema and internal data.

This is O(k*n) where k is the total number of field descendents, and n is the number of rows.

Returns:: Status

inline int num_columns() const¶: Return the number of columns in the table.

inline int64_t num_rows() const¶: Return the number of rows (equal to each column’s logical length)

bool Equals(const Table &other, bool check_metadata = false) const¶

Determine if tables are equal.

Two tables can be equal only if they have equal schemas. However, they may be equal even if they have different chunkings.

Result<std::shared_ptr<Table>> CombineChunks(MemoryPool *pool = default_memory_pool()) const¶

Make a new table by combining the chunks this table has.

All the underlying chunks in the ChunkedArray of each column are concatenated into zero or one chunk.

Parameters:: pool – [in] The pool for buffer allocations

Result<std::shared_ptr<RecordBatch>> CombineChunksToBatch(MemoryPool *pool = default_memory_pool()) const¶

Make a new record batch by combining the chunks this table has.

All the underlying chunks in the ChunkedArray of each column are concatenated into a single chunk.

Parameters:: pool – [in] The pool for buffer allocations

Public Static Functions

static std::shared_ptr<Table> Make(std::shared_ptr<Schema> schema, std::vector<std::shared_ptr<ChunkedArray>> columns, int64_t num_rows = -1)¶

Construct a Table from schema and columns.

If columns is zero-length, the table’s number of rows is zero

Parameters:

schema – [in] The table schema (column types)
columns – [in] The table’s columns as chunked arrays
num_rows – [in] number of rows in table, -1 (default) to infer from columns

static std::shared_ptr<Table> Make(std::shared_ptr<Schema> schema, const std::vector<std::shared_ptr<Array>> &arrays, int64_t num_rows = -1)¶

Construct a Table from schema and arrays.

Parameters:

schema – [in] The table schema (column types)
arrays – [in] The table’s columns as arrays
num_rows – [in] number of rows in table, -1 (default) to infer from columns

static Result<std::shared_ptr<Table>> MakeEmpty(std::shared_ptr<Schema> schema, MemoryPool *pool = default_memory_pool())¶

Create an empty Table of a given schema.

The output Table will be created with a single empty chunk per column.

Parameters:

schema – [in] the schema of the empty Table
pool – [in] the memory pool to allocate memory from

Returns:

the resulting Table

static Result<std::shared_ptr<Table>> FromRecordBatchReader(RecordBatchReader *reader)¶

Construct a Table from a RecordBatchReader.

Parameters:: reader – [in] the arrow::Schema for each batch

static Result<std::shared_ptr<Table>> FromRecordBatches(const std::vector<std::shared_ptr<RecordBatch>> &batches)¶

Construct a Table from RecordBatches, using schema supplied by the first RecordBatch.

Parameters:: batches – [in] a std::vector of record batches

static Result<std::shared_ptr<Table>> FromRecordBatches(std::shared_ptr<Schema> schema, const std::vector<std::shared_ptr<RecordBatch>> &batches)¶

Construct a Table from RecordBatches, using supplied schema.

There may be zero record batches

Parameters:

schema – [in] the arrow::Schema for each batch
batches – [in] a std::vector of record batches

static Result<std::shared_ptr<Table>> FromChunkedStructArray(const std::shared_ptr<ChunkedArray> &array)¶

Construct a Table from a chunked StructArray.

One column will be produced for each field of the StructArray.

Parameters:: array – [in] a chunked StructArray

Result<std::shared_ptr<Table>> arrow::ConcatenateTables(const std::vector<std::shared_ptr<Table>> &tables, ConcatenateTablesOptions options = ConcatenateTablesOptions::Defaults(), MemoryPool *memory_pool = default_memory_pool())¶

Construct a new table from multiple input tables.

The new table is assembled from existing column chunks without copying, if schemas are identical. If schemas do not match exactly and unify_schemas is enabled in options (off by default), an attempt is made to unify them, and then column chunks are converted to their respective unified datatype, which will probably incur a copy. :func:arrow::PromoteTableToSchema is used to unify schemas.

Tables are concatenated in order they are provided in and the order of rows within tables will be preserved.

Parameters:

tables – [in] a std::vector of Tables to be concatenated
options – [in] specify how to unify schema of input tables
memory_pool – [in] MemoryPool to be used if null-filled arrays need to be created or if existing column chunks need to endure type conversion

Returns:

new Table

Result<std::shared_ptr<Table>> arrow::PromoteTableToSchema(const std::shared_ptr<Table> &table, const std::shared_ptr<Schema> &schema, MemoryPool *pool = default_memory_pool())¶

Promotes a table to conform to the given schema.

If a field in the schema does not have a corresponding column in the table, a column of nulls will be added to the resulting table. If the corresponding column is of type Null, it will be promoted to the type specified by schema, with null values filled. Returns an error:

if the corresponding column’s type is not compatible with the schema.
if there is a column in the table that does not exist in the schema.

Parameters:

table – [in] the input Table
schema – [in] the target schema to promote to
pool – [in] The memory pool to be used if null-filled arrays need to be created.

Array Builders

C Interfaces