Two-dimensional Datasets#

Record Batches#

class RecordBatch#

Collection of equal-length arrays matching a particular Schema.

A record batch is table-like data structure that is semantically a sequence of fields, each a contiguous Arrow array

Public Functions

Result<std::shared_ptr<StructArray>> ToStructArray() const#

Convert record batch to struct array.

Create a struct array whose child arrays are the record batch’s columns. Note that the record batch’s top-level field metadata cannot be reflected in the resulting struct array.

bool Equals(const RecordBatch &other, bool check_metadata = false, const EqualOptions &opts = EqualOptions::Defaults()) const#

Determine if two record batches are exactly equal.

Parameters:

other – [in] the RecordBatch to compare with
check_metadata – [in] if true, check that Schema metadata is the same
opts – [in] the options for equality comparisons

Returns:

true if batches are equal

bool ApproxEquals(const RecordBatch &other, const EqualOptions &opts = EqualOptions::Defaults()) const#

Determine if two record batches are approximately equal.

Parameters:

other – [in] the RecordBatch to compare with
opts – [in] the options for equality comparisons

Returns:

true if batches are approximately equal

inline const std::shared_ptr<Schema> &schema() const#

Returns:: the record batch’s schema

Result<std::shared_ptr<RecordBatch>> ReplaceSchema(std::shared_ptr<Schema> schema) const#: Replace the schema with another schema with the same types, but potentially different field names and/or metadata.

virtual const std::vector<std::shared_ptr<Array>> &columns() const = 0#: Retrieve all columns at once.

virtual std::shared_ptr<Array> column(int i) const = 0#

Retrieve an array from the record batch.

Parameters:: i – [in] field index, does not boundscheck
Returns:: an Array object

std::shared_ptr<Array> GetColumnByName(const std::string &name) const#

Retrieve an array from the record batch.

Parameters:: name – [in] field name
Returns:: an Array or null if no field was found

virtual std::shared_ptr<ArrayData> column_data(int i) const = 0#

Retrieve an array’s internal data from the record batch.

Parameters:: i – [in] field index, does not boundscheck
Returns:: an internal ArrayData object

virtual const ArrayDataVector &column_data() const = 0#: Retrieve all arrays’ internal data from the record batch.

virtual Result<std::shared_ptr<RecordBatch>> AddColumn(int i, const std::shared_ptr<Field> &field, const std::shared_ptr<Array> &column) const = 0#

Add column to the record batch, producing a new RecordBatch.

Parameters:

i – [in] field index, which will be boundschecked
field – [in] field to be added
column – [in] column to be added

virtual Result<std::shared_ptr<RecordBatch>> AddColumn(int i, std::string field_name, const std::shared_ptr<Array> &column) const#

Add new nullable column to the record batch, producing a new RecordBatch.

For non-nullable columns, use the Field-based version of this method.

Parameters:

i – [in] field index, which will be boundschecked
field_name – [in] name of field to be added
column – [in] column to be added

virtual Result<std::shared_ptr<RecordBatch>> SetColumn(int i, const std::shared_ptr<Field> &field, const std::shared_ptr<Array> &column) const = 0#

Replace a column in the record batch, producing a new RecordBatch.

Parameters:

i – [in] field index, does boundscheck
field – [in] field to be replaced
column – [in] column to be replaced

virtual Result<std::shared_ptr<RecordBatch>> RemoveColumn(int i) const = 0#

Remove column from the record batch, producing a new RecordBatch.

Parameters:: i – [in] field index, does boundscheck

const std::string &column_name(int i) const#: Name in i-th column.

int num_columns() const#

Returns:: the number of columns in the table

inline int64_t num_rows() const#

Returns:: the number of rows (the corresponding length of each column)

virtual std::shared_ptr<RecordBatch> Slice(int64_t offset) const#

Slice each of the arrays in the record batch.

Parameters:: offset – [in] the starting offset to slice, through end of batch
Returns:: new record batch

virtual std::shared_ptr<RecordBatch> Slice(int64_t offset, int64_t length) const = 0#

Slice each of the arrays in the record batch.

Parameters:

offset – [in] the starting offset to slice
length – [in] the number of elements to slice from offset

Returns:

new record batch

std::string ToString() const#

Returns:: PrettyPrint representation suitable for debugging

Result<std::shared_ptr<RecordBatch>> SelectColumns(const std::vector<int> &indices) const#: Return new record batch with specified columns.

virtual Status Validate() const#

Perform cheap validation checks to determine obvious inconsistencies within the record batch’s schema and internal data.

This is O(k) where k is the total number of fields and array descendents.

Returns:: Status

virtual Status ValidateFull() const#

Perform extensive validation checks to determine inconsistencies within the record batch’s schema and internal data.

This is potentially O(k*n) where n is the number of rows.

Returns:: Status

Public Static Functions

static std::shared_ptr<RecordBatch> Make(std::shared_ptr<Schema> schema, int64_t num_rows, std::vector<std::shared_ptr<Array>> columns)#

Parameters:

schema – [in] The record batch schema
num_rows – [in] length of fields in the record batch. Each array should have the same length as num_rows
columns – [in] the record batch fields as vector of arrays

static std::shared_ptr<RecordBatch> Make(std::shared_ptr<Schema> schema, int64_t num_rows, std::vector<std::shared_ptr<ArrayData>> columns)#

Construct record batch from vector of internal data structures.

This class is intended for internal use, or advanced users.

Since: 0.5.0

Parameters:

schema – the record batch schema
num_rows – the number of semantic rows in the record batch. This should be equal to the length of each field
columns – the data for the batch’s columns

static Result<std::shared_ptr<RecordBatch>> MakeEmpty(std::shared_ptr<Schema> schema, MemoryPool *pool = default_memory_pool())#

Create an empty RecordBatch of a given schema.

The output RecordBatch will be created with DataTypes from the given schema.

Parameters:

schema – [in] the schema of the empty RecordBatch
pool – [in] the memory pool to allocate memory from

Returns:

the resulting RecordBatch

static Result<std::shared_ptr<RecordBatch>> FromStructArray(const std::shared_ptr<Array> &array, MemoryPool *pool = default_memory_pool())#

Construct record batch from struct array.

This constructs a record batch using the child arrays of the given array, which must be a struct array.

This operation will usually be zero-copy. However, if the struct array has an offset or a validity bitmap then these will need to be pushed into the child arrays. Pushing the offset is zero-copy but pushing the validity bitmap is not.

Parameters:

array – [in] the source array, must be a StructArray
pool – [in] the memory pool to allocate new validity bitmaps

class RecordBatchReader#

Abstract interface for reading stream of record batches.

Subclassed by arrow::TableBatchReader, arrow::csv::StreamingReader, arrow::flight::sql::example::SqliteStatementBatchReader, arrow::flight::sql::example::SqliteTablesWithSchemaBatchReader, arrow::ipc::RecordBatchStreamReader, arrow::json::StreamingReader

Public Functions

virtual std::shared_ptr<Schema> schema() const = 0#

Returns:: the shared schema of the record batches in the stream

virtual Status ReadNext(std::shared_ptr<RecordBatch> *batch) = 0#

Read the next record batch in the stream.

Return null for batch when reaching end of stream

Parameters:: batch – [out] the next loaded batch, null at end of stream
Returns:: Status

inline Result<std::shared_ptr<RecordBatch>> Next()#: Iterator interface.

inline virtual Status Close()#: finalize reader

inline RecordBatchReaderIterator begin()#: Return an iterator to the first record batch in the stream.

inline RecordBatchReaderIterator end()#: Return an iterator to the end of the stream.

Result<RecordBatchVector> ToRecordBatches()#: Consume entire stream as a vector of record batches.

Result<std::shared_ptr<Table>> ToTable()#: Read all batches and concatenate as arrow::Table.

Public Static Functions

static Result<std::shared_ptr<RecordBatchReader>> Make(RecordBatchVector batches, std::shared_ptr<Schema> schema = NULLPTR)#

Create a RecordBatchReader from a vector of RecordBatch.

Parameters:

batches – [in] the vector of RecordBatch to read from
schema – [in] schema to conform to. Will be inferred from the first element if not provided.

static Result<std::shared_ptr<RecordBatchReader>> MakeFromIterator(Iterator<std::shared_ptr<RecordBatch>> batches, std::shared_ptr<Schema> schema)#

Create a RecordBatchReader from an Iterator of RecordBatch.

Parameters:

batches – [in] an iterator of RecordBatch to read from.
schema – [in] schema that each record batch in iterator will conform to.

class RecordBatchReaderIterator#

class TableBatchReader : public arrow::RecordBatchReader #

Compute a stream of record batches from a (possibly chunked) Table.

The conversion is zero-copy: each record batch is a view over a slice of the table’s columns.

Public Functions

explicit TableBatchReader(const Table &table)#: Construct a TableBatchReader for the given table.

virtual std::shared_ptr<Schema> schema() const override#

Returns:: the shared schema of the record batches in the stream

virtual Status ReadNext(std::shared_ptr<RecordBatch> *out) override#

Read the next record batch in the stream.

Return null for batch when reaching end of stream

Parameters:: batch – [out] the next loaded batch, null at end of stream
Returns:: Status

void set_chunksize(int64_t chunksize)#

Set the desired maximum chunk size of record batches.

The actual chunk size of each record batch may be smaller, depending on actual chunking characteristics of each table column.

Tables#

class Table#

Logical table as sequence of chunked arrays.

Public Functions

inline const std::shared_ptr<Schema> &schema() const#: Return the table schema.

virtual std::shared_ptr<ChunkedArray> column(int i) const = 0#: Return a column by index.

virtual const std::vector<std::shared_ptr<ChunkedArray>> &columns() const = 0#: Return vector of all columns for table.

inline std::shared_ptr<Field> field(int i) const#: Return a column’s field by index.

std::vector<std::shared_ptr<Field>> fields() const#: Return vector of all fields for table.

virtual std::shared_ptr<Table> Slice(int64_t offset, int64_t length) const = 0#

Construct a zero-copy slice of the table with the indicated offset and length.

Parameters:

offset – [in] the index of the first row in the constructed slice
length – [in] the number of rows of the slice. If there are not enough rows in the table, the length will be adjusted accordingly

Returns:

a new object wrapped in std::shared_ptr<Table>

inline std::shared_ptr<Table> Slice(int64_t offset) const#: Slice from first row at offset until end of the table.

inline std::shared_ptr<ChunkedArray> GetColumnByName(const std::string &name) const#

Return a column by name.

Parameters:: name – [in] field name
Returns:: an Array or null if no field was found

virtual Result<std::shared_ptr<Table>> RemoveColumn(int i) const = 0#: Remove column from the table, producing a new Table.

virtual Result<std::shared_ptr<Table>> AddColumn(int i, std::shared_ptr<Field> field_arg, std::shared_ptr<ChunkedArray> column) const = 0#: Add column to the table, producing a new Table.

virtual Result<std::shared_ptr<Table>> SetColumn(int i, std::shared_ptr<Field> field_arg, std::shared_ptr<ChunkedArray> column) const = 0#: Replace a column in the table, producing a new Table.

std::vector<std::string> ColumnNames() const#: Return names of all columns.

Result<std::shared_ptr<Table>> RenameColumns(const std::vector<std::string> &names) const#: Rename columns with provided names.

Result<std::shared_ptr<Table>> SelectColumns(const std::vector<int> &indices) const#: Return new table with specified columns.

virtual std::shared_ptr<Table> ReplaceSchemaMetadata(const std::shared_ptr<const KeyValueMetadata> &metadata) const = 0#

Replace schema key-value metadata with new metadata.

Since: 0.5.0

Parameters:: metadata – [in] new KeyValueMetadata
Returns:: new Table

virtual Result<std::shared_ptr<Table>> Flatten(MemoryPool *pool = default_memory_pool()) const = 0#

Flatten the table, producing a new Table.

Any column with a struct type will be flattened into multiple columns

Parameters:: pool – [in] The pool for buffer allocations, if any

std::string ToString() const#

Returns:: PrettyPrint representation suitable for debugging

virtual Status Validate() const = 0#

Perform cheap validation checks to determine obvious inconsistencies within the table’s schema and internal data.

This is O(k*m) where k is the total number of field descendents, and m is the number of chunks.

Returns:: Status

virtual Status ValidateFull() const = 0#

Perform extensive validation checks to determine inconsistencies within the table’s schema and internal data.

This is O(k*n) where k is the total number of field descendents, and n is the number of rows.

Returns:: Status

inline int num_columns() const#: Return the number of columns in the table.

inline int64_t num_rows() const#: Return the number of rows (equal to each column’s logical length)

bool Equals(const Table &other, bool check_metadata = false) const#

Determine if tables are equal.

Two tables can be equal only if they have equal schemas. However, they may be equal even if they have different chunkings.

Result<std::shared_ptr<Table>> CombineChunks(MemoryPool *pool = default_memory_pool()) const#

Make a new table by combining the chunks this table has.

All the underlying chunks in the ChunkedArray of each column are concatenated into zero or one chunk.

Parameters:: pool – [in] The pool for buffer allocations

Result<std::shared_ptr<RecordBatch>> CombineChunksToBatch(MemoryPool *pool = default_memory_pool()) const#

Make a new record batch by combining the chunks this table has.

All the underlying chunks in the ChunkedArray of each column are concatenated into a single chunk.

Parameters:: pool – [in] The pool for buffer allocations

Public Static Functions

static std::shared_ptr<Table> Make(std::shared_ptr<Schema> schema, std::vector<std::shared_ptr<ChunkedArray>> columns, int64_t num_rows = -1)#

Construct a Table from schema and columns.

If columns is zero-length, the table’s number of rows is zero

Parameters:

schema – [in] The table schema (column types)
columns – [in] The table’s columns as chunked arrays
num_rows – [in] number of rows in table, -1 (default) to infer from columns

static std::shared_ptr<Table> Make(std::shared_ptr<Schema> schema, const std::vector<std::shared_ptr<Array>> &arrays, int64_t num_rows = -1)#

Construct a Table from schema and arrays.

Parameters:

schema – [in] The table schema (column types)
arrays – [in] The table’s columns as arrays
num_rows – [in] number of rows in table, -1 (default) to infer from columns

static Result<std::shared_ptr<Table>> MakeEmpty(std::shared_ptr<Schema> schema, MemoryPool *pool = default_memory_pool())#

Create an empty Table of a given schema.

The output Table will be created with a single empty chunk per column.

Parameters:

schema – [in] the schema of the empty Table
pool – [in] the memory pool to allocate memory from

Returns:

the resulting Table

static Result<std::shared_ptr<Table>> FromRecordBatchReader(RecordBatchReader *reader)#

Construct a Table from a RecordBatchReader.

Parameters:: reader – [in] the arrow::RecordBatchReader that produces batches

static Result<std::shared_ptr<Table>> FromRecordBatches(const std::vector<std::shared_ptr<RecordBatch>> &batches)#

Construct a Table from RecordBatches, using schema supplied by the first RecordBatch.

Parameters:: batches – [in] a std::vector of record batches

static Result<std::shared_ptr<Table>> FromRecordBatches(std::shared_ptr<Schema> schema, const std::vector<std::shared_ptr<RecordBatch>> &batches)#

Construct a Table from RecordBatches, using supplied schema.

There may be zero record batches

Parameters:

schema – [in] the arrow::Schema for each batch
batches – [in] a std::vector of record batches

static Result<std::shared_ptr<Table>> FromChunkedStructArray(const std::shared_ptr<ChunkedArray> &array)#

Construct a Table from a chunked StructArray.

One column will be produced for each field of the StructArray.

Parameters:: array – [in] a chunked StructArray

Result<std::shared_ptr<Table>> arrow::ConcatenateTables(const std::vector<std::shared_ptr<Table>> &tables, ConcatenateTablesOptions options = ConcatenateTablesOptions::Defaults(), MemoryPool *memory_pool = default_memory_pool())#

Construct a new table from multiple input tables.

The new table is assembled from existing column chunks without copying, if schemas are identical. If schemas do not match exactly and unify_schemas is enabled in options (off by default), an attempt is made to unify them, and then column chunks are converted to their respective unified datatype, which will probably incur a copy. :func:arrow::PromoteTableToSchema is used to unify schemas.

Tables are concatenated in order they are provided in and the order of rows within tables will be preserved.

Parameters:

tables – [in] a std::vector of Tables to be concatenated
options – [in] specify how to unify schema of input tables
memory_pool – [in] MemoryPool to be used if null-filled arrays need to be created or if existing column chunks need to endure type conversion

Returns:

new Table

Warning

doxygenfunction: Unable to resolve function “arrow::PromoteTableToSchema” with arguments None in doxygen xml output for project “arrow_cpp” from directory: ../../cpp/apidoc/xml. Potential matches:

- Result<std::shared_ptr<Table>> PromoteTableToSchema(const std::shared_ptr<Table> &table, const std::shared_ptr<Schema> &schema, MemoryPool *pool = default_memory_pool())
- Result<std::shared_ptr<Table>> PromoteTableToSchema(const std::shared_ptr<Table> &table, const std::shared_ptr<Schema> &schema, const compute::CastOptions &options, MemoryPool *pool = default_memory_pool())