Struct RecordBatch

Source

pub struct RecordBatch {
    schema: SchemaRef,
    columns: Vec<Arc<dyn Array>>,
    row_count: usize,
}

Expand description

A two-dimensional batch of column-oriented data with a defined schema.

A RecordBatch is a two-dimensional dataset of a number of contiguous arrays, each the same length. A record batch has a schema which must match its arrays’ datatypes.

Record batches are a convenient unit of work for various serialization and computation functions, possibly incremental.

Use the record_batch! macro to create a RecordBatch from literal slice of values, useful for rapid prototyping and testing.

Example:

use arrow_array::record_batch;
let batch = record_batch!(
    ("a", Int32, [1, 2, 3]),
    ("b", Float64, [Some(4.0), None, Some(5.0)]),
    ("c", Utf8, ["alpha", "beta", "gamma"])
);

Fields§

§schema: SchemaRef§columns: Vec<Arc<dyn Array>>§row_count: usize

The number of rows in this RecordBatch

This is stored separately from the columns to handle the case of no columns

Implementations§

Source §

impl RecordBatch

Source

pub fn try_new( schema: SchemaRef, columns: Vec<ArrayRef>, ) -> Result<Self, ArrowError>

Creates a RecordBatch from a schema and columns.

Expects the following:

!columns.is_empty()
schema.fields.len() == columns.len()
schema.fields[i].data_type() == columns[i].data_type()
columns[i].len() == columns[j].len()

If the conditions are not met, an error is returned.

§Example


let id_array = Int32Array::from(vec![1, 2, 3, 4, 5]);
let schema = Schema::new(vec![
    Field::new("id", DataType::Int32, false)
]);

let batch = RecordBatch::try_new(
    Arc::new(schema),
    vec![Arc::new(id_array)]
).unwrap();

Source

pub unsafe fn new_unchecked( schema: SchemaRef, columns: Vec<Arc<dyn Array>>, row_count: usize, ) -> Self

Creates a RecordBatch from a schema and columns, without validation.

See Self::try_new for the checked version.

§Safety

Expects the following:

schema.fields.len() == columns.len()
schema.fields[i].data_type() == columns[i].data_type()
columns[i].len() == row_count

Note: if the schema does not match the underlying data exactly, it can lead to undefined behavior, for example, via conversion to a StructArray, which in turn could lead to incorrect access.

Source

pub fn try_new_with_options( schema: SchemaRef, columns: Vec<ArrayRef>, options: &RecordBatchOptions, ) -> Result<Self, ArrowError>

Creates a RecordBatch from a schema and columns, with additional options, such as whether to strictly validate field names.

See RecordBatch::try_new for the expected conditions.

Source

pub fn new_empty(schema: SchemaRef) -> Self

Creates a new empty RecordBatch.

Source

fn try_new_impl( schema: SchemaRef, columns: Vec<ArrayRef>, options: &RecordBatchOptions, ) -> Result<Self, ArrowError>

Validate the schema and columns using RecordBatchOptions. Returns an error if any validation check fails, otherwise returns the created Self

Source

pub fn into_parts(self) -> (SchemaRef, Vec<ArrayRef>, usize)

Return the schema, columns and row count of this RecordBatch

Source

pub fn with_schema(self, schema: SchemaRef) -> Result<Self, ArrowError>

Override the schema of this RecordBatch

Returns an error if schema is not a superset of the current schema as determined by [Schema::contains]

pub fn schema(&self) -> SchemaRef

Returns the [Schema] of the record batch.

Source

pub fn schema_ref(&self) -> &SchemaRef

Returns a reference to the [Schema] of the record batch.

Source

pub fn schema_metadata_mut(&mut self) -> &mut HashMap<String, String>

Mutable access to the metadata of the schema.

This allows you to modify [Schema::metadata] of Self::schema in a convenient and fast way.

Note this will clone the entire underlying Schema object if it is currently shared

§Example

let mut batch = record_batch!(("a", Int32, [1, 2, 3])).unwrap();
// Initially, the metadata is empty
assert!(batch.schema().metadata().get("key").is_none());
// Insert a key-value pair into the metadata
batch.schema_metadata_mut().insert("key".into(), "value".into());
assert_eq!(batch.schema().metadata().get("key"), Some(&String::from("value")));

Source

pub fn project(&self, indices: &[usize]) -> Result<RecordBatch, ArrowError>

Projects the schema onto the specified columns

Source

pub fn normalize( &self, separator: &str, max_level: Option<usize>, ) -> Result<Self, ArrowError>

Normalize a semi-structured RecordBatch into a flat table.

Nested [Field]s will generate names separated by separator, up to a depth of max_level (unlimited if None).

e.g. given a RecordBatch with schema:

    "foo": StructArray<"bar": Utf8>

A separator of "." would generate a batch with the schema:

    "foo.bar": Utf8

Note that giving a depth of Some(0) to max_level is the same as passing in None; it will be treated as unlimited.

§Example

let animals: ArrayRef = Arc::new(StringArray::from(vec!["Parrot", ""]));
let n_legs: ArrayRef = Arc::new(Int64Array::from(vec![Some(2), Some(4)]));

let animals_field = Arc::new(Field::new("animals", DataType::Utf8, true));
let n_legs_field = Arc::new(Field::new("n_legs", DataType::Int64, true));

let a = Arc::new(StructArray::from(vec![
    (animals_field.clone(), Arc::new(animals.clone()) as ArrayRef),
    (n_legs_field.clone(), Arc::new(n_legs.clone()) as ArrayRef),
]));

let schema = Schema::new(vec![
    Field::new(
        "a",
        DataType::Struct(Fields::from(vec![animals_field, n_legs_field])),
        false,
    )
]);

let normalized = RecordBatch::try_new(Arc::new(schema), vec![a])
    .expect("valid conversion")
    .normalize(".", None)
    .expect("valid normalization");

let expected = RecordBatch::try_from_iter_with_nullable(vec![
    ("a.animals", animals.clone(), true),
    ("a.n_legs", n_legs.clone(), true),
])
.expect("valid conversion");

assert_eq!(expected, normalized);

Source

pub fn num_columns(&self) -> usize

Returns the number of columns in the record batch.

§Example


let id_array = Int32Array::from(vec![1, 2, 3, 4, 5]);
let schema = Schema::new(vec![
    Field::new("id", DataType::Int32, false)
]);

let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(id_array)]).unwrap();

assert_eq!(batch.num_columns(), 1);

Source

pub fn num_rows(&self) -> usize

Returns the number of rows in each column.

§Example


let id_array = Int32Array::from(vec![1, 2, 3, 4, 5]);
let schema = Schema::new(vec![
    Field::new("id", DataType::Int32, false)
]);

let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(id_array)]).unwrap();

assert_eq!(batch.num_rows(), 5);

Source

pub fn column(&self, index: usize) -> &ArrayRef

Get a reference to a column’s array by index.

§Panics

Panics if index is outside of 0..num_columns.

Source

pub fn column_by_name(&self, name: &str) -> Option<&ArrayRef>

Get a reference to a column’s array by name.

Source

pub fn columns(&self) -> &[ArrayRef] ⓘ

Get a reference to all columns in the record batch.

Source

pub fn remove_column(&mut self, index: usize) -> ArrayRef

Remove column by index and return it.

Return the ArrayRef if the column is removed.

§Panics

Panics if `index`` out of bounds.

§Example

use std::sync::Arc;
use arrow_array::{BooleanArray, Int32Array, RecordBatch};
use arrow_schema::{DataType, Field, Schema};
let id_array = Int32Array::from(vec![1, 2, 3, 4, 5]);
let bool_array = BooleanArray::from(vec![true, false, false, true, true]);
let schema = Schema::new(vec![
    Field::new("id", DataType::Int32, false),
    Field::new("bool", DataType::Boolean, false),
]);

let mut batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(id_array), Arc::new(bool_array)]).unwrap();

let removed_column = batch.remove_column(0);
assert_eq!(removed_column.as_any().downcast_ref::<Int32Array>().unwrap(), &Int32Array::from(vec![1, 2, 3, 4, 5]));
assert_eq!(batch.num_columns(), 1);

Source

pub fn slice(&self, offset: usize, length: usize) -> RecordBatch

Return a new RecordBatch where each column is sliced according to offset and length

§Panics

Panics if offset with length is greater than column length.

Source

pub fn try_from_iter<I, F>(value: I) -> Result<Self, ArrowError>
where I: IntoIterator<Item = (F, ArrayRef)>, F: AsRef<str>,

Create a RecordBatch from an iterable list of pairs of the form (field_name, array), with the same requirements on fields and arrays as RecordBatch::try_new. This method is often used to create a single RecordBatch from arrays, e.g. for testing.

The resulting schema is marked as nullable for each column if the array for that column is has any nulls. To explicitly specify nullibility, use RecordBatch::try_from_iter_with_nullable

Example:


let a: ArrayRef = Arc::new(Int32Array::from(vec![1, 2]));
let b: ArrayRef = Arc::new(StringArray::from(vec!["a", "b"]));

let record_batch = RecordBatch::try_from_iter(vec![
  ("a", a),
  ("b", b),
]);

Another way to quickly create a RecordBatch is to use the record_batch! macro, which is particularly helpful for rapid prototyping and testing.

Example:

use arrow_array::record_batch;
let batch = record_batch!(
    ("a", Int32, [1, 2, 3]),
    ("b", Float64, [Some(4.0), None, Some(5.0)]),
    ("c", Utf8, ["alpha", "beta", "gamma"])
);

Source

pub fn try_from_iter_with_nullable<I, F>(value: I) -> Result<Self, ArrowError>
where I: IntoIterator<Item = (F, ArrayRef, bool)>, F: AsRef<str>,

Create a RecordBatch from an iterable list of tuples of the form (field_name, array, nullable), with the same requirements on fields and arrays as RecordBatch::try_new. This method is often used to create a single RecordBatch from arrays, e.g. for testing.

Example:


let a: ArrayRef = Arc::new(Int32Array::from(vec![1, 2]));
let b: ArrayRef = Arc::new(StringArray::from(vec![Some("a"), Some("b")]));

// Note neither `a` nor `b` has any actual nulls, but we mark
// b an nullable
let record_batch = RecordBatch::try_from_iter_with_nullable(vec![
  ("a", a, false),
  ("b", b, true),
]);

Source

pub fn claim(&self, pool: &dyn MemoryPool)

Registers all buffers in this record batch with the provided MemoryPool.

This claims memory for all columns in the batch by calling Array::claim on each column.

Source

pub fn get_array_memory_size(&self) -> usize

Returns the total number of bytes of memory occupied physically by this batch.

Note that this does not always correspond to the exact memory usage of a RecordBatch (might overestimate), since multiple columns can share the same buffers or slices thereof, the memory used by the shared buffers might be counted multiple times.