Two-dimensional Datasets#
Record Batches#
- 
class RecordBatch#
- Collection of equal-length arrays matching a particular Schema. - A record batch is table-like data structure that is semantically a sequence of fields, each a contiguous Arrow array - Public Functions - 
Result<std::shared_ptr<StructArray>> ToStructArray() const#
- Convert record batch to struct array. - Create a struct array whose child arrays are the record batch’s columns. Note that the record batch’s top-level field metadata cannot be reflected in the resulting struct array. 
 - 
Result<std::shared_ptr<Tensor>> ToTensor(bool null_to_nan = false, bool row_major = true, MemoryPool *pool = default_memory_pool()) const#
- Convert record batch with one data type to Tensor. - Create a Tensor object with shape (number of rows, number of columns) and strides (type size in bytes, type size in bytes * number of rows). Generated Tensor will have column-major layout. 
 - 
bool Equals(const RecordBatch &other, bool check_metadata = false, const EqualOptions &opts = EqualOptions::Defaults()) const#
- Determine if two record batches are exactly equal. - Parameters:
- other – [in] the RecordBatch to compare with 
- check_metadata – [in] if true, check that Schema metadata is the same 
- opts – [in] the options for equality comparisons 
 
- Returns:
- true if batches are equal 
 
 - 
bool ApproxEquals(const RecordBatch &other, const EqualOptions &opts = EqualOptions::Defaults()) const#
- Determine if two record batches are approximately equal. - Parameters:
- other – [in] the RecordBatch to compare with 
- opts – [in] the options for equality comparisons 
 
- Returns:
- true if batches are approximately equal 
 
 - Replace the schema with another schema with the same types, but potentially different field names and/or metadata. 
 - 
virtual const std::vector<std::shared_ptr<Array>> &columns() const = 0#
- Retrieve all columns at once. 
 - 
virtual std::shared_ptr<Array> column(int i) const = 0#
- Retrieve an array from the record batch. - Parameters:
- i – [in] field index, does not boundscheck 
- Returns:
- an Array object 
 
 - 
std::shared_ptr<Array> GetColumnByName(const std::string &name) const#
- Retrieve an array from the record batch. - Parameters:
- name – [in] field name 
- Returns:
- an Array or null if no field was found 
 
 - 
virtual std::shared_ptr<ArrayData> column_data(int i) const = 0#
- Retrieve an array’s internal data from the record batch. - Parameters:
- i – [in] field index, does not boundscheck 
- Returns:
- an internal ArrayData object 
 
 - 
virtual const ArrayDataVector &column_data() const = 0#
- Retrieve all arrays’ internal data from the record batch. 
 - Add column to the record batch, producing a new RecordBatch. - Parameters:
- i – [in] field index, which will be boundschecked 
- field – [in] field to be added 
- column – [in] column to be added 
 
 
 - Add new nullable column to the record batch, producing a new RecordBatch. - For non-nullable columns, use the Field-based version of this method. - Parameters:
- i – [in] field index, which will be boundschecked 
- field_name – [in] name of field to be added 
- column – [in] column to be added 
 
 
 - Replace a column in the record batch, producing a new RecordBatch. - Parameters:
- i – [in] field index, does boundscheck 
- field – [in] field to be replaced 
- column – [in] column to be replaced 
 
 
 - 
virtual Result<std::shared_ptr<RecordBatch>> RemoveColumn(int i) const = 0#
- Remove column from the record batch, producing a new RecordBatch. - Parameters:
- i – [in] field index, does boundscheck 
 
 - 
const std::string &column_name(int i) const#
- Name in i-th column. 
 - 
int num_columns() const#
- Returns:
- the number of columns in the table 
 
 - 
inline int64_t num_rows() const#
- Returns:
- the number of rows (the corresponding length of each column) 
 
 - Copy the entire RecordBatch to destination MemoryManager. - This uses Array::CopyTo on each column of the record batch to create a new record batch where all underlying buffers for the columns have been copied to the destination MemoryManager. This uses MemoryManager::CopyBuffer under the hood. 
 - View or Copy the entire RecordBatch to destination MemoryManager. - This uses Array::ViewOrCopyTo on each column of the record batch to create a new record batch where all underlying buffers for the columns have been zero-copy viewed on the destination MemoryManager, falling back to performing a copy if it can’t be viewed as a zero-copy buffer. This uses Buffer::ViewOrCopy under the hood. 
 - 
virtual std::shared_ptr<RecordBatch> Slice(int64_t offset) const#
- Slice each of the arrays in the record batch. - Parameters:
- offset – [in] the starting offset to slice, through end of batch 
- Returns:
- new record batch 
 
 - 
virtual std::shared_ptr<RecordBatch> Slice(int64_t offset, int64_t length) const = 0#
- Slice each of the arrays in the record batch. - Parameters:
- offset – [in] the starting offset to slice 
- length – [in] the number of elements to slice from offset 
 
- Returns:
- new record batch 
 
 - 
std::string ToString() const#
- Returns:
- PrettyPrint representation suitable for debugging 
 
 - 
std::vector<std::string> ColumnNames() const#
- Return names of all columns. 
 - 
Result<std::shared_ptr<RecordBatch>> RenameColumns(const std::vector<std::string> &names) const#
- Rename columns with provided names. 
 - 
Result<std::shared_ptr<RecordBatch>> SelectColumns(const std::vector<int> &indices) const#
- Return new record batch with specified columns. 
 - 
virtual Status Validate() const#
- Perform cheap validation checks to determine obvious inconsistencies within the record batch’s schema and internal data. - This is O(k) where k is the total number of fields and array descendents. - Returns:
 
 - 
virtual Status ValidateFull() const#
- Perform extensive validation checks to determine inconsistencies within the record batch’s schema and internal data. - This is potentially O(k*n) where n is the number of rows. - Returns:
 
 - 
virtual const std::shared_ptr<Device::SyncEvent> &GetSyncEvent() const = 0#
- EXPERIMENTAL: Return a top-level sync event object for this record batch. - If all of the data for this record batch is in CPU memory, then this will return null. If the data for this batch is on a device, then if synchronization is needed before accessing the data the returned sync event will allow for it. - Returns:
- null or a Device::SyncEvent 
 
 - 
Result<std::shared_ptr<Array>> MakeStatisticsArray(MemoryPool *pool = default_memory_pool()) const#
- Create a statistics array of this record batch. - The created array follows the C data interface statistics specification. See https://arrow.apache.org/docs/format/CDataInterfaceStatistics.html for details. - Parameters:
- pool – [in] the memory pool to allocate memory from 
- Returns:
- the statistics array of this record batch 
 
 - Public Static Functions - Parameters:
- schema – [in] The record batch schema 
- num_rows – [in] length of fields in the record batch. Each array should have the same length as num_rows 
- columns – [in] the record batch fields as vector of arrays 
- sync_event – [in] optional synchronization event for non-CPU device memory used by buffers 
 
 
 - Construct record batch from vector of internal data structures. - This class is intended for internal use, or advanced users. - Since
- 0.5.0 
 - Parameters:
- schema – the record batch schema 
- num_rows – the number of semantic rows in the record batch. This should be equal to the length of each field 
- columns – the data for the batch’s columns 
- device_type – the type of the device that the Arrow columns are allocated on 
- sync_event – optional synchronization event for non-CPU device memory used by buffers 
 
 
 - Create an empty RecordBatch of a given schema. - The output RecordBatch will be created with DataTypes from the given schema. - Parameters:
- schema – [in] the schema of the empty RecordBatch 
- pool – [in] the memory pool to allocate memory from 
 
- Returns:
- the resulting RecordBatch 
 
 - Construct record batch from struct array. - This constructs a record batch using the child arrays of the given array, which must be a struct array. - This operation will usually be zero-copy. However, if the struct array has an offset or a validity bitmap then these will need to be pushed into the child arrays. Pushing the offset is zero-copy but pushing the validity bitmap is not. - Parameters:
- array – [in] the source array, must be a StructArray 
- pool – [in] the memory pool to allocate new validity bitmaps 
 
 
 
- 
Result<std::shared_ptr<StructArray>> ToStructArray() const#
- 
class RecordBatchReader#
- Abstract interface for reading stream of record batches. - Subclassed by arrow::TableBatchReader, arrow::csv::StreamingReader, arrow::flight::sql::example::SqliteStatementBatchReader, arrow::flight::sql::example::SqliteTablesWithSchemaBatchReader, arrow::ipc::RecordBatchStreamReader, arrow::json::StreamingReader - Public Functions - 
virtual std::shared_ptr<Schema> schema() const = 0#
- Returns:
- the shared schema of the record batches in the stream 
 
 - Read the next record batch in the stream. - Return null for batch when reaching end of stream - Example: - while (true) { std::shared_ptr<RecordBatch> batch; ARROW_RETURN_NOT_OK(reader->ReadNext(&batch)); if (!batch) { break; } // handling the `batch`, the `batch->num_rows()` // might be 0. }- Parameters:
- batch – [out] the next loaded batch, null at end of stream. Returning an empty batch doesn’t mean the end of stream because it is valid data. 
- Returns:
 
 - 
inline Result<std::shared_ptr<RecordBatch>> Next()#
- Iterator interface. 
 - 
inline virtual DeviceAllocationType device_type() const#
- EXPERIMENTAL: Get the device type for record batches this reader produces. - default implementation is to return DeviceAllocationType::kCPU 
 - 
inline RecordBatchReaderIterator begin()#
- Return an iterator to the first record batch in the stream. 
 - 
inline RecordBatchReaderIterator end()#
- Return an iterator to the end of the stream. 
 - 
Result<std::shared_ptr<Table>> ToTable()#
- Read all batches and concatenate as arrow::Table. 
 - Public Static Functions - Create a RecordBatchReader from a vector of RecordBatch. - Parameters:
- batches – [in] the vector of RecordBatch to read from 
- schema – [in] schema to conform to. Will be inferred from the first element if not provided. 
- device_type – [in] the type of device that the batches are allocated on 
 
 
 - Create a RecordBatchReader from an Iterator of RecordBatch. - Parameters:
- batches – [in] an iterator of RecordBatch to read from. 
- schema – [in] schema that each record batch in iterator will conform to. 
- device_type – [in] the type of device that the batches are allocated on 
 
 
 - 
class RecordBatchReaderIterator#
 
- 
virtual std::shared_ptr<Schema> schema() const = 0#
- 
class TableBatchReader : public arrow::RecordBatchReader#
- Compute a stream of record batches from a (possibly chunked) Table. - The conversion is zero-copy: each record batch is a view over a slice of the table’s columns. - The table is expected to be valid prior to using it with the batch reader. - Public Functions - 
explicit TableBatchReader(const Table &table)#
- Construct a TableBatchReader for the given table. 
 - 
virtual std::shared_ptr<Schema> schema() const override#
- Returns:
- the shared schema of the record batches in the stream 
 
 - Read the next record batch in the stream. - Return null for batch when reaching end of stream - Example: - while (true) { std::shared_ptr<RecordBatch> batch; ARROW_RETURN_NOT_OK(reader->ReadNext(&batch)); if (!batch) { break; } // handling the `batch`, the `batch->num_rows()` // might be 0. }- Parameters:
- batch – [out] the next loaded batch, null at end of stream. Returning an empty batch doesn’t mean the end of stream because it is valid data. 
- Returns:
 
 - 
void set_chunksize(int64_t chunksize)#
- Set the desired maximum number of rows for record batches. - The actual number of rows in each record batch may be smaller, depending on actual chunking characteristics of each table column. 
 
- 
explicit TableBatchReader(const Table &table)#
Tables#
- 
class Table#
- Logical table as sequence of chunked arrays. - Public Functions - 
virtual std::shared_ptr<ChunkedArray> column(int i) const = 0#
- Return a column by index. 
 - 
virtual const std::vector<std::shared_ptr<ChunkedArray>> &columns() const = 0#
- Return vector of all columns for table. 
 - 
virtual std::shared_ptr<Table> Slice(int64_t offset, int64_t length) const = 0#
- Construct a zero-copy slice of the table with the indicated offset and length. - Parameters:
- offset – [in] the index of the first row in the constructed slice 
- length – [in] the number of rows of the slice. If there are not enough rows in the table, the length will be adjusted accordingly 
 
- Returns:
- a new object wrapped in std::shared_ptr<Table> 
 
 - 
inline std::shared_ptr<Table> Slice(int64_t offset) const#
- Slice from first row at offset until end of the table. 
 - 
inline std::shared_ptr<ChunkedArray> GetColumnByName(const std::string &name) const#
- Return a column by name. - Parameters:
- name – [in] field name 
- Returns:
- an Array or null if no field was found 
 
 - 
virtual Result<std::shared_ptr<Table>> RemoveColumn(int i) const = 0#
- Remove column from the table, producing a new Table. 
 - Add column to the table, producing a new Table. 
 - Replace a column in the table, producing a new Table. 
 - 
std::vector<std::string> ColumnNames() const#
- Return names of all columns. 
 - 
Result<std::shared_ptr<Table>> RenameColumns(const std::vector<std::string> &names) const#
- Rename columns with provided names. 
 - 
Result<std::shared_ptr<Table>> SelectColumns(const std::vector<int> &indices) const#
- Return new table with specified columns. 
 - Replace schema key-value metadata with new metadata. - Since
- 0.5.0 
 - Parameters:
- metadata – [in] new KeyValueMetadata 
- Returns:
- new Table 
 
 - 
virtual Result<std::shared_ptr<Table>> Flatten(MemoryPool *pool = default_memory_pool()) const = 0#
- Flatten the table, producing a new Table. - Any column with a struct type will be flattened into multiple columns - Parameters:
- pool – [in] The pool for buffer allocations, if any 
 
 - 
std::string ToString() const#
- Returns:
- PrettyPrint representation suitable for debugging 
 
 - 
virtual Status Validate() const = 0#
- Perform cheap validation checks to determine obvious inconsistencies within the table’s schema and internal data. - This is O(k*m) where k is the total number of field descendents, and m is the number of chunks. - Returns:
 
 - 
virtual Status ValidateFull() const = 0#
- Perform extensive validation checks to determine inconsistencies within the table’s schema and internal data. - This is O(k*n) where k is the total number of field descendents, and n is the number of rows. - Returns:
 
 - 
inline int num_columns() const#
- Return the number of columns in the table. 
 - 
inline int64_t num_rows() const#
- Return the number of rows (equal to each column’s logical length) 
 - 
bool Equals(const Table &other, bool check_metadata = false) const#
- Determine if tables are equal. - Two tables can be equal only if they have equal schemas. However, they may be equal even if they have different chunkings. 
 - 
Result<std::shared_ptr<Table>> CombineChunks(MemoryPool *pool = default_memory_pool()) const#
- Make a new table by combining the chunks this table has. - All the underlying chunks in the ChunkedArray of each column are concatenated into zero or one chunk. - Parameters:
- pool – [in] The pool for buffer allocations 
 
 - 
Result<std::shared_ptr<RecordBatch>> CombineChunksToBatch(MemoryPool *pool = default_memory_pool()) const#
- Make a new record batch by combining the chunks this table has. - All the underlying chunks in the ChunkedArray of each column are concatenated into a single chunk. - Parameters:
- pool – [in] The pool for buffer allocations 
 
 - Public Static Functions - Construct a Table from schema and columns. - If columns is zero-length, the table’s number of rows is zero - Parameters:
- schema – [in] The table schema (column types) 
- columns – [in] The table’s columns as chunked arrays 
- num_rows – [in] number of rows in table, -1 (default) to infer from columns 
 
 
 - Construct a Table from schema and arrays. - Parameters:
- schema – [in] The table schema (column types) 
- arrays – [in] The table’s columns as arrays 
- num_rows – [in] number of rows in table, -1 (default) to infer from columns 
 
 
 - Create an empty Table of a given schema. - The output Table will be created with a single empty chunk per column. 
 - 
static Result<std::shared_ptr<Table>> FromRecordBatchReader(RecordBatchReader *reader)#
- Construct a Table from a RecordBatchReader. - Parameters:
- reader – [in] the arrow::RecordBatchReader that produces batches 
 
 - Construct a Table from RecordBatches, using schema supplied by the first RecordBatch. - Parameters:
- batches – [in] a std::vector of record batches 
 
 - Construct a Table from RecordBatches, using supplied schema. - There may be zero record batches - Parameters:
- schema – [in] the arrow::Schema for each batch 
- batches – [in] a std::vector of record batches 
 
 
 - Construct a Table from a chunked StructArray. - One column will be produced for each field of the StructArray. - Parameters:
- array – [in] a chunked StructArray 
 
 
- 
virtual std::shared_ptr<ChunkedArray> column(int i) const = 0#
- Construct a new table from multiple input tables. - The new table is assembled from existing column chunks without copying, if schemas are identical. If schemas do not match exactly and unify_schemas is enabled in options (off by default), an attempt is made to unify them, and then column chunks are converted to their respective unified datatype, which will probably incur a copy. :func: - arrow::PromoteTableToSchemais used to unify schemas.- Tables are concatenated in order they are provided in and the order of rows within tables will be preserved. - Parameters:
- tables – [in] a std::vector of Tables to be concatenated 
- options – [in] specify how to unify schema of input tables 
- memory_pool – [in] MemoryPool to be used if null-filled arrays need to be created or if existing column chunks need to endure type conversion 
 
- Returns:
- new Table 
 
Warning
doxygenfunction: Unable to resolve function “arrow::PromoteTableToSchema” with arguments None in doxygen xml output for project “arrow_cpp” from directory: /build/cpp/apidoc/xml. Potential matches:
- Result<std::shared_ptr<Table>> PromoteTableToSchema(const std::shared_ptr<Table> &table, const std::shared_ptr<Schema> &schema, MemoryPool *pool = default_memory_pool())
- Result<std::shared_ptr<Table>> PromoteTableToSchema(const std::shared_ptr<Table> &table, const std::shared_ptr<Schema> &schema, const compute::CastOptions &options, MemoryPool *pool = default_memory_pool())
 
    