Two-dimensional Datasets#
Record Batches#
-
class RecordBatch#
Collection of equal-length arrays matching a particular Schema.
A record batch is table-like data structure that is semantically a sequence of fields, each a contiguous Arrow array
Public Functions
-
Result<std::shared_ptr<StructArray>> ToStructArray() const#
Convert record batch to struct array.
Create a struct array whose child arrays are the record batch’s columns. Note that the record batch’s top-level field metadata cannot be reflected in the resulting struct array.
-
bool Equals(const RecordBatch &other, bool check_metadata = false, const EqualOptions &opts = EqualOptions::Defaults()) const#
Determine if two record batches are exactly equal.
- Parameters:
other – [in] the RecordBatch to compare with
check_metadata – [in] if true, check that Schema metadata is the same
opts – [in] the options for equality comparisons
- Returns:
true if batches are equal
-
bool ApproxEquals(const RecordBatch &other, const EqualOptions &opts = EqualOptions::Defaults()) const#
Determine if two record batches are approximately equal.
- Parameters:
other – [in] the RecordBatch to compare with
opts – [in] the options for equality comparisons
- Returns:
true if batches are approximately equal
Replace the schema with another schema with the same types, but potentially different field names and/or metadata.
-
virtual const std::vector<std::shared_ptr<Array>> &columns() const = 0#
Retrieve all columns at once.
-
virtual std::shared_ptr<Array> column(int i) const = 0#
Retrieve an array from the record batch.
- Parameters:
i – [in] field index, does not boundscheck
- Returns:
an Array object
-
std::shared_ptr<Array> GetColumnByName(const std::string &name) const#
Retrieve an array from the record batch.
- Parameters:
name – [in] field name
- Returns:
an Array or null if no field was found
-
virtual std::shared_ptr<ArrayData> column_data(int i) const = 0#
Retrieve an array’s internal data from the record batch.
- Parameters:
i – [in] field index, does not boundscheck
- Returns:
an internal ArrayData object
-
virtual const ArrayDataVector &column_data() const = 0#
Retrieve all arrays’ internal data from the record batch.
Add column to the record batch, producing a new RecordBatch.
- Parameters:
i – [in] field index, which will be boundschecked
field – [in] field to be added
column – [in] column to be added
Add new nullable column to the record batch, producing a new RecordBatch.
For non-nullable columns, use the Field-based version of this method.
- Parameters:
i – [in] field index, which will be boundschecked
field_name – [in] name of field to be added
column – [in] column to be added
Replace a column in the record batch, producing a new RecordBatch.
- Parameters:
i – [in] field index, does boundscheck
field – [in] field to be replaced
column – [in] column to be replaced
-
virtual Result<std::shared_ptr<RecordBatch>> RemoveColumn(int i) const = 0#
Remove column from the record batch, producing a new RecordBatch.
- Parameters:
i – [in] field index, does boundscheck
-
const std::string &column_name(int i) const#
Name in i-th column.
-
int num_columns() const#
- Returns:
the number of columns in the table
-
inline int64_t num_rows() const#
- Returns:
the number of rows (the corresponding length of each column)
-
virtual std::shared_ptr<RecordBatch> Slice(int64_t offset) const#
Slice each of the arrays in the record batch.
- Parameters:
offset – [in] the starting offset to slice, through end of batch
- Returns:
new record batch
-
virtual std::shared_ptr<RecordBatch> Slice(int64_t offset, int64_t length) const = 0#
Slice each of the arrays in the record batch.
- Parameters:
offset – [in] the starting offset to slice
length – [in] the number of elements to slice from offset
- Returns:
new record batch
-
std::string ToString() const#
- Returns:
PrettyPrint representation suitable for debugging
-
Result<std::shared_ptr<RecordBatch>> SelectColumns(const std::vector<int> &indices) const#
Return new record batch with specified columns.
Public Static Functions
- Parameters:
schema – [in] The record batch schema
num_rows – [in] length of fields in the record batch. Each array should have the same length as num_rows
columns – [in] the record batch fields as vector of arrays
Construct record batch from vector of internal data structures.
This class is intended for internal use, or advanced users.
- Since
0.5.0
- Parameters:
schema – the record batch schema
num_rows – the number of semantic rows in the record batch. This should be equal to the length of each field
columns – the data for the batch’s columns
Create an empty RecordBatch of a given schema.
The output RecordBatch will be created with DataTypes from the given schema.
- Parameters:
schema – [in] the schema of the empty RecordBatch
pool – [in] the memory pool to allocate memory from
- Returns:
the resulting RecordBatch
Construct record batch from struct array.
This constructs a record batch using the child arrays of the given array, which must be a struct array.
This operation will usually be zero-copy. However, if the struct array has an offset or a validity bitmap then these will need to be pushed into the child arrays. Pushing the offset is zero-copy but pushing the validity bitmap is not.
- Parameters:
array – [in] the source array, must be a StructArray
pool – [in] the memory pool to allocate new validity bitmaps
-
Result<std::shared_ptr<StructArray>> ToStructArray() const#
-
class RecordBatchReader#
Abstract interface for reading stream of record batches.
Subclassed by arrow::TableBatchReader, arrow::csv::StreamingReader, arrow::flight::sql::example::SqliteStatementBatchReader, arrow::flight::sql::example::SqliteTablesWithSchemaBatchReader, arrow::ipc::RecordBatchStreamReader, arrow::json::StreamingReader
Public Functions
-
virtual std::shared_ptr<Schema> schema() const = 0#
- Returns:
the shared schema of the record batches in the stream
Read the next record batch in the stream.
Return null for batch when reaching end of stream
- Parameters:
batch – [out] the next loaded batch, null at end of stream
- Returns:
-
inline Result<std::shared_ptr<RecordBatch>> Next()#
Iterator interface.
-
inline RecordBatchReaderIterator begin()#
Return an iterator to the first record batch in the stream.
-
inline RecordBatchReaderIterator end()#
Return an iterator to the end of the stream.
-
Result<std::shared_ptr<Table>> ToTable()#
Read all batches and concatenate as arrow::Table.
Public Static Functions
Create a RecordBatchReader from a vector of RecordBatch.
- Parameters:
batches – [in] the vector of RecordBatch to read from
schema – [in] schema to conform to. Will be inferred from the first element if not provided.
Create a RecordBatchReader from an Iterator of RecordBatch.
- Parameters:
batches – [in] an iterator of RecordBatch to read from.
schema – [in] schema that each record batch in iterator will conform to.
-
class RecordBatchReaderIterator#
-
virtual std::shared_ptr<Schema> schema() const = 0#
-
class TableBatchReader : public arrow::RecordBatchReader#
Compute a stream of record batches from a (possibly chunked) Table.
The conversion is zero-copy: each record batch is a view over a slice of the table’s columns.
Public Functions
-
explicit TableBatchReader(const Table &table)#
Construct a TableBatchReader for the given table.
-
virtual std::shared_ptr<Schema> schema() const override#
- Returns:
the shared schema of the record batches in the stream
Read the next record batch in the stream.
Return null for batch when reaching end of stream
- Parameters:
batch – [out] the next loaded batch, null at end of stream
- Returns:
-
void set_chunksize(int64_t chunksize)#
Set the desired maximum chunk size of record batches.
The actual chunk size of each record batch may be smaller, depending on actual chunking characteristics of each table column.
-
explicit TableBatchReader(const Table &table)#
Tables#
-
class Table#
Logical table as sequence of chunked arrays.
Public Functions
-
virtual std::shared_ptr<ChunkedArray> column(int i) const = 0#
Return a column by index.
-
virtual const std::vector<std::shared_ptr<ChunkedArray>> &columns() const = 0#
Return vector of all columns for table.
-
virtual std::shared_ptr<Table> Slice(int64_t offset, int64_t length) const = 0#
Construct a zero-copy slice of the table with the indicated offset and length.
- Parameters:
offset – [in] the index of the first row in the constructed slice
length – [in] the number of rows of the slice. If there are not enough rows in the table, the length will be adjusted accordingly
- Returns:
a new object wrapped in std::shared_ptr<Table>
-
inline std::shared_ptr<Table> Slice(int64_t offset) const#
Slice from first row at offset until end of the table.
-
inline std::shared_ptr<ChunkedArray> GetColumnByName(const std::string &name) const#
Return a column by name.
- Parameters:
name – [in] field name
- Returns:
an Array or null if no field was found
-
virtual Result<std::shared_ptr<Table>> RemoveColumn(int i) const = 0#
Remove column from the table, producing a new Table.
Add column to the table, producing a new Table.
Replace a column in the table, producing a new Table.
-
std::vector<std::string> ColumnNames() const#
Return names of all columns.
-
Result<std::shared_ptr<Table>> RenameColumns(const std::vector<std::string> &names) const#
Rename columns with provided names.
-
Result<std::shared_ptr<Table>> SelectColumns(const std::vector<int> &indices) const#
Return new table with specified columns.
Replace schema key-value metadata with new metadata.
- Since
0.5.0
- Parameters:
metadata – [in] new KeyValueMetadata
- Returns:
new Table
-
virtual Result<std::shared_ptr<Table>> Flatten(MemoryPool *pool = default_memory_pool()) const = 0#
Flatten the table, producing a new Table.
Any column with a struct type will be flattened into multiple columns
- Parameters:
pool – [in] The pool for buffer allocations, if any
-
std::string ToString() const#
- Returns:
PrettyPrint representation suitable for debugging
-
virtual Status Validate() const = 0#
Perform cheap validation checks to determine obvious inconsistencies within the table’s schema and internal data.
This is O(k*m) where k is the total number of field descendents, and m is the number of chunks.
- Returns:
-
virtual Status ValidateFull() const = 0#
Perform extensive validation checks to determine inconsistencies within the table’s schema and internal data.
This is O(k*n) where k is the total number of field descendents, and n is the number of rows.
- Returns:
-
inline int num_columns() const#
Return the number of columns in the table.
-
inline int64_t num_rows() const#
Return the number of rows (equal to each column’s logical length)
-
bool Equals(const Table &other, bool check_metadata = false) const#
Determine if tables are equal.
Two tables can be equal only if they have equal schemas. However, they may be equal even if they have different chunkings.
-
Result<std::shared_ptr<Table>> CombineChunks(MemoryPool *pool = default_memory_pool()) const#
Make a new table by combining the chunks this table has.
All the underlying chunks in the ChunkedArray of each column are concatenated into zero or one chunk.
- Parameters:
pool – [in] The pool for buffer allocations
-
Result<std::shared_ptr<RecordBatch>> CombineChunksToBatch(MemoryPool *pool = default_memory_pool()) const#
Make a new record batch by combining the chunks this table has.
All the underlying chunks in the ChunkedArray of each column are concatenated into a single chunk.
- Parameters:
pool – [in] The pool for buffer allocations
Public Static Functions
Construct a Table from schema and columns.
If columns is zero-length, the table’s number of rows is zero
- Parameters:
schema – [in] The table schema (column types)
columns – [in] The table’s columns as chunked arrays
num_rows – [in] number of rows in table, -1 (default) to infer from columns
Construct a Table from schema and arrays.
- Parameters:
schema – [in] The table schema (column types)
arrays – [in] The table’s columns as arrays
num_rows – [in] number of rows in table, -1 (default) to infer from columns
Create an empty Table of a given schema.
The output Table will be created with a single empty chunk per column.
-
static Result<std::shared_ptr<Table>> FromRecordBatchReader(RecordBatchReader *reader)#
Construct a Table from a RecordBatchReader.
- Parameters:
reader – [in] the arrow::RecordBatchReader that produces batches
Construct a Table from RecordBatches, using schema supplied by the first RecordBatch.
- Parameters:
batches – [in] a std::vector of record batches
Construct a Table from RecordBatches, using supplied schema.
There may be zero record batches
- Parameters:
schema – [in] the arrow::Schema for each batch
batches – [in] a std::vector of record batches
Construct a Table from a chunked StructArray.
One column will be produced for each field of the StructArray.
- Parameters:
array – [in] a chunked StructArray
-
virtual std::shared_ptr<ChunkedArray> column(int i) const = 0#
Construct a new table from multiple input tables.
The new table is assembled from existing column chunks without copying, if schemas are identical. If schemas do not match exactly and unify_schemas is enabled in options (off by default), an attempt is made to unify them, and then column chunks are converted to their respective unified datatype, which will probably incur a copy. :func:
arrow::PromoteTableToSchema
is used to unify schemas.Tables are concatenated in order they are provided in and the order of rows within tables will be preserved.
- Parameters:
tables – [in] a std::vector of Tables to be concatenated
options – [in] specify how to unify schema of input tables
memory_pool – [in] MemoryPool to be used if null-filled arrays need to be created or if existing column chunks need to endure type conversion
- Returns:
new Table
Warning
doxygenfunction: Unable to resolve function “arrow::PromoteTableToSchema” with arguments None in doxygen xml output for project “arrow_cpp” from directory: ../../cpp/apidoc/xml. Potential matches:
- Result<std::shared_ptr<Table>> PromoteTableToSchema(const std::shared_ptr<Table> &table, const std::shared_ptr<Schema> &schema, MemoryPool *pool = default_memory_pool())
- Result<std::shared_ptr<Table>> PromoteTableToSchema(const std::shared_ptr<Table> &table, const std::shared_ptr<Schema> &schema, const compute::CastOptions &options, MemoryPool *pool = default_memory_pool())