Two-dimensional Datasets¶
Record Batches¶
-
class
arrow
::
RecordBatch
¶ Collection of equal-length arrays matching a particular Schema.
A record batch is table-like data structure that is semantically a sequence of fields, each a contiguous Arrow array
Public Functions
-
Result<std::shared_ptr<StructArray>>
ToStructArray
() const¶ Convert record batch to struct array.
Create a struct array whose child arrays are the record batch’s columns. Note that the record batch’s top-level field metadata cannot be reflected in the resulting struct array.
-
bool
Equals
(const RecordBatch &other, bool check_metadata = false) const¶ Determine if two record batches are exactly equal.
- Parameters
[in] other – the RecordBatch to compare with
[in] check_metadata – if true, check that Schema metadata is the same
- Returns
true if batches are equal
-
bool
ApproxEquals
(const RecordBatch &other) const¶ Determine if two record batches are approximately equal.
-
virtual const std::vector<std::shared_ptr<Array>> &
columns
() const = 0¶ Retrieve all columns at once.
-
virtual std::shared_ptr<Array>
column
(int i) const = 0¶ Retrieve an array from the record batch.
- Parameters
[in] i – field index, does not boundscheck
- Returns
an Array object
-
std::shared_ptr<Array>
GetColumnByName
(const std::string &name) const¶ Retrieve an array from the record batch.
- Parameters
[in] name – field name
- Returns
an Array or null if no field was found
-
virtual std::shared_ptr<ArrayData>
column_data
(int i) const = 0¶ Retrieve an array’s internal data from the record batch.
- Parameters
[in] i – field index, does not boundscheck
- Returns
an internal ArrayData object
-
virtual const ArrayDataVector &
column_data
() const = 0¶ Retrieve all arrays’ internal data from the record batch.
Add column to the record batch, producing a new RecordBatch.
- Parameters
[in] i – field index, which will be boundschecked
[in] field – field to be added
[in] column – column to be added
Add new nullable column to the record batch, producing a new RecordBatch.
For non-nullable columns, use the Field-based version of this method.
- Parameters
[in] i – field index, which will be boundschecked
[in] field_name – name of field to be added
[in] column – column to be added
Replace a column in the record batch, producing a new RecordBatch.
- Parameters
[in] i – field index, does boundscheck
[in] field – field to be replaced
[in] column – column to be replaced
-
virtual Result<std::shared_ptr<RecordBatch>>
RemoveColumn
(int i) const = 0¶ Remove column from the record batch, producing a new RecordBatch.
- Parameters
[in] i – field index, does boundscheck
-
const std::string &
column_name
(int i) const¶ Name in i-th column.
-
int
num_columns
() const¶ - Returns
the number of columns in the table
-
inline int64_t
num_rows
() const¶ - Returns
the number of rows (the corresponding length of each column)
-
virtual std::shared_ptr<RecordBatch>
Slice
(int64_t offset) const¶ Slice each of the arrays in the record batch.
- Parameters
[in] offset – the starting offset to slice, through end of batch
- Returns
new record batch
-
virtual std::shared_ptr<RecordBatch>
Slice
(int64_t offset, int64_t length) const = 0¶ Slice each of the arrays in the record batch.
- Parameters
[in] offset – the starting offset to slice
[in] length – the number of elements to slice from offset
- Returns
new record batch
-
std::string
ToString
() const¶ - Returns
PrettyPrint representation suitable for debugging
-
Result<std::shared_ptr<RecordBatch>>
SelectColumns
(const std::vector<int> &indices) const¶ Return new record batch with specified columns.
Public Static Functions
- Parameters
[in] schema – The record batch schema
[in] num_rows – length of fields in the record batch. Each array should have the same length as num_rows
[in] columns – the record batch fields as vector of arrays
Construct record batch from vector of internal data structures.
This class is intended for internal use, or advanced users.
- Since
0.5.0
- Parameters
schema – the record batch schema
num_rows – the number of semantic rows in the record batch. This should be equal to the length of each field
columns – the data for the batch’s columns
Construct record batch from struct array.
This constructs a record batch using the child arrays of the given array, which must be a struct array. Note that the struct array’s own null bitmap is not reflected in the resulting record batch.
-
Result<std::shared_ptr<StructArray>>
-
class
arrow
::
RecordBatchReader
¶ Abstract interface for reading stream of record batches.
Subclassed by arrow::csv::StreamingReader, arrow::ipc::RecordBatchStreamReader, arrow::py::PyRecordBatchReader, arrow::TableBatchReader
Public Functions
-
virtual std::shared_ptr<Schema>
schema
() const = 0¶ - Returns
the shared schema of the record batches in the stream
Read the next record batch in the stream.
Return null for batch when reaching end of stream
- Parameters
[out] batch – the next loaded batch, null at end of stream
- Returns
-
inline Result<std::shared_ptr<RecordBatch>>
Next
()¶ Iterator interface.
Read all batches and concatenate as arrow::Table.
Public Static Functions
Create a RecordBatchReader from a vector of RecordBatch.
- Parameters
[in] batches – the vector of RecordBatch to read from
[in] schema – schema to conform to. Will be inferred from the first element if not provided.
-
virtual std::shared_ptr<Schema>
-
class
arrow
::
TableBatchReader
: public arrow::RecordBatchReader¶ Compute a stream of record batches from a (possibly chunked) Table.
The conversion is zero-copy: each record batch is a view over a slice of the table’s columns.
Public Functions
-
explicit
TableBatchReader
(const Table &table)¶ Construct a TableBatchReader for the given table.
-
virtual std::shared_ptr<Schema>
schema
() const override¶ - Returns
the shared schema of the record batches in the stream
Read the next record batch in the stream.
Return null for batch when reaching end of stream
- Parameters
[out] batch – the next loaded batch, null at end of stream
- Returns
-
void
set_chunksize
(int64_t chunksize)¶ Set the desired maximum chunk size of record batches.
The actual chunk size of each record batch may be smaller, depending on actual chunking characteristics of each table column.
-
explicit
Tables¶
-
class
arrow
::
Table
¶ Logical table as sequence of chunked arrays.
Public Functions
-
virtual std::shared_ptr<ChunkedArray>
column
(int i) const = 0¶ Return a column by index.
-
virtual const std::vector<std::shared_ptr<ChunkedArray>> &
columns
() const = 0¶ Return vector of all columns for table.
-
virtual std::shared_ptr<Table>
Slice
(int64_t offset, int64_t length) const = 0¶ Construct a zero-copy slice of the table with the indicated offset and length.
- Parameters
[in] offset – the index of the first row in the constructed slice
[in] length – the number of rows of the slice. If there are not enough rows in the table, the length will be adjusted accordingly
- Returns
a new object wrapped in std::shared_ptr<Table>
-
inline std::shared_ptr<Table>
Slice
(int64_t offset) const¶ Slice from first row at offset until end of the table.
-
inline std::shared_ptr<ChunkedArray>
GetColumnByName
(const std::string &name) const¶ Return a column by name.
- Parameters
[in] name – field name
- Returns
an Array or null if no field was found
-
virtual Result<std::shared_ptr<Table>>
RemoveColumn
(int i) const = 0¶ Remove column from the table, producing a new Table.
Add column to the table, producing a new Table.
Replace a column in the table, producing a new Table.
-
std::vector<std::string>
ColumnNames
() const¶ Return names of all columns.
-
Result<std::shared_ptr<Table>>
RenameColumns
(const std::vector<std::string> &names) const¶ Rename columns with provided names.
-
Result<std::shared_ptr<Table>>
SelectColumns
(const std::vector<int> &indices) const¶ Return new table with specified columns.
Replace schema key-value metadata with new metadata.
- Since
0.5.0
- Parameters
[in] metadata – new KeyValueMetadata
- Returns
new Table
-
virtual Result<std::shared_ptr<Table>>
Flatten
(MemoryPool *pool = default_memory_pool()) const = 0¶ Flatten the table, producing a new Table.
Any column with a struct type will be flattened into multiple columns
- Parameters
[in] pool – The pool for buffer allocations, if any
-
std::string
ToString
() const¶ - Returns
PrettyPrint representation suitable for debugging
-
virtual Status
Validate
() const = 0¶ Perform cheap validation checks to determine obvious inconsistencies within the table’s schema and internal data.
This is O(k*m) where k is the total number of field descendents, and m is the number of chunks.
- Returns
-
virtual Status
ValidateFull
() const = 0¶ Perform extensive validation checks to determine inconsistencies within the table’s schema and internal data.
This is O(k*n) where k is the total number of field descendents, and n is the number of rows.
- Returns
-
inline int
num_columns
() const¶ Return the number of columns in the table.
-
inline int64_t
num_rows
() const¶ Return the number of rows (equal to each column’s logical length)
-
bool
Equals
(const Table &other, bool check_metadata = false) const¶ Determine if tables are equal.
Two tables can be equal only if they have equal schemas. However, they may be equal even if they have different chunkings.
-
Result<std::shared_ptr<Table>>
CombineChunks
(MemoryPool *pool = default_memory_pool()) const¶ Make a new table by combining the chunks this table has.
All the underlying chunks in the ChunkedArray of each column are concatenated into zero or one chunk.
- Parameters
[in] pool – The pool for buffer allocations
Public Static Functions
Construct a Table from schema and columns.
If columns is zero-length, the table’s number of rows is zero
- Parameters
[in] schema – The table schema (column types)
[in] columns – The table’s columns as chunked arrays
[in] num_rows – number of rows in table, -1 (default) to infer from columns
Construct a Table from schema and arrays.
- Parameters
[in] schema – The table schema (column types)
[in] arrays – The table’s columns as arrays
[in] num_rows – number of rows in table, -1 (default) to infer from columns
-
static Result<std::shared_ptr<Table>>
FromRecordBatchReader
(RecordBatchReader *reader)¶ Construct a Table from a RecordBatchReader.
- Parameters
[in] reader – the arrow::Schema for each batch
Construct a Table from RecordBatches, using schema supplied by the first RecordBatch.
- Parameters
[in] batches – a std::vector of record batches
Construct a Table from RecordBatches, using supplied schema.
There may be zero record batches
- Parameters
[in] schema – the arrow::Schema for each batch
[in] batches – a std::vector of record batches
Construct a Table from a chunked StructArray.
One column will be produced for each field of the StructArray.
- Parameters
[in] array – a chunked StructArray
-
virtual std::shared_ptr<ChunkedArray>
Construct table from multiple input tables.