Struct BatchCoalescer

Source

pub struct BatchCoalescer {
    schema: SchemaRef,
    target_batch_size: usize,
    in_progress_arrays: Vec<Box<dyn InProgressArray>>,
    buffered_rows: usize,
    completed: VecDeque<RecordBatch>,
    biggest_coalesce_batch_size: Option<usize>,
}

Expand description

Concatenate multiple [RecordBatch]es

Implements the common pattern of incrementally creating output [RecordBatch]es of a specific size from an input stream of [RecordBatch]es.

This is useful after operations such as filter and take that produce smaller batches, and we want to coalesce them into larger batches for further processing.

§Motivation

If we use concat_batches to implement the same functionality, there are 2 potential issues:

At least 2x peak memory (holding the input and output of concat)
2 copies of the data (to create the output of filter and then create the output of concat)

See: https://github.com/apache/arrow-rs/issues/6692 for more discussions about the motivation.

§Example

use arrow_array::record_batch;
use arrow_select::coalesce::{BatchCoalescer};
let batch1 = record_batch!(("a", Int32, [1, 2, 3])).unwrap();
let batch2 = record_batch!(("a", Int32, [4, 5])).unwrap();

// Create a `BatchCoalescer` that will produce batches with at least 4 rows
let target_batch_size = 4;
let mut coalescer = BatchCoalescer::new(batch1.schema(), 4);

// push the batches
coalescer.push_batch(batch1).unwrap();
// only pushed 3 rows (not yet 4, enough to produce a batch)
assert!(coalescer.next_completed_batch().is_none());
coalescer.push_batch(batch2).unwrap();
// now we have 5 rows, so we can produce a batch
let finished = coalescer.next_completed_batch().unwrap();
// 4 rows came out (target batch size is 4)
let expected = record_batch!(("a", Int32, [1, 2, 3, 4])).unwrap();
assert_eq!(finished, expected);

// Have no more input, but still have an in-progress batch
assert!(coalescer.next_completed_batch().is_none());
// We can finish the batch, which will produce the remaining rows
coalescer.finish_buffered_batch().unwrap();
let expected = record_batch!(("a", Int32, [5])).unwrap();
assert_eq!(coalescer.next_completed_batch().unwrap(), expected);

// The coalescer is now empty
assert!(coalescer.next_completed_batch().is_none());

§Background

Generally speaking, larger [RecordBatch]es are more efficient to process than smaller [RecordBatch]es (until the CPU cache is exceeded) because there is fixed processing overhead per batch. This coalescer builds up these larger batches incrementally.

┌────────────────────┐
│    RecordBatch     │
│   num_rows = 100   │
└────────────────────┘                 ┌────────────────────┐
                                       │                    │
┌────────────────────┐     Coalesce    │                    │
│                    │      Batches    │                    │
│    RecordBatch     │                 │                    │
│   num_rows = 200   │  ─ ─ ─ ─ ─ ─ ▶  │                    │
│                    │                 │    RecordBatch     │
│                    │                 │   num_rows = 400   │
└────────────────────┘                 │                    │
                                       │                    │
┌────────────────────┐                 │                    │
│                    │                 │                    │
│    RecordBatch     │                 │                    │
│   num_rows = 100   │                 └────────────────────┘
│                    │
└────────────────────┘

§Notes:

Output rows are produced in the same order as the input rows
The output is a sequence of batches, with all but the last being at exactly target_batch_size rows.

Fields§

§schema: SchemaRef

The input schema

§target_batch_size: usize

The target batch size (and thus size for views allocation). This is a hard limit: the output batch will be exactly target_batch_size, rather than possibly being slightly above.

§in_progress_arrays: Vec<Box<dyn InProgressArray>>

In-progress arrays

§buffered_rows: usize

Buffered row count. Always less than batch_size

§completed: VecDeque<RecordBatch>

Completed batches

§biggest_coalesce_batch_size: Option<usize>

Biggest coalesce batch size. See Self::with_biggest_coalesce_batch_size

Struct BatchCoalescer Copy item path

§Motivation

§Example

§Background

§Notes:

Fields§

Implementations§

impl BatchCoalescer

pub fn new(schema: SchemaRef, target_batch_size: usize) -> Self

§Arguments

pub fn with_biggest_coalesce_batch_size(self, limit: Option<usize>) -> Self

pub fn biggest_coalesce_batch_size(&self) -> Option<usize>

pub fn set_biggest_coalesce_batch_size(&mut self, limit: Option<usize>)

pub fn schema(&self) -> SchemaRef

pub fn push_batch_with_filter( &mut self, batch: RecordBatch, filter: &BooleanArray, ) -> Result<(), ArrowError>

§Example

pub fn push_batch(&mut self, batch: RecordBatch) -> Result<(), ArrowError>

§Example

pub fn get_buffered_rows(&self) -> usize

pub fn finish_buffered_batch(&mut self) -> Result<(), ArrowError>

pub fn is_empty(&self) -> bool

pub fn has_completed_batch(&self) -> bool

pub fn next_completed_batch(&mut self) -> Option<RecordBatch>

Trait Implementations§

impl Debug for BatchCoalescer

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Auto Trait Implementations§

impl Freeze for BatchCoalescer

impl !RefUnwindSafe for BatchCoalescer

impl Send for BatchCoalescer

impl Sync for BatchCoalescer

impl Unpin for BatchCoalescer

impl !UnwindSafe for BatchCoalescer

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Struct BatchCoalescer

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,