Skip to main content

ByteBudgetChunker

Struct ByteBudgetChunker 

Source
pub(crate) struct ByteBudgetChunker {
    page_byte_limit: usize,
    max_def_level: i16,
    static_always_fits: bool,
    dict_page_byte_limit: usize,
    static_dict_always_fits: bool,
}
Expand description

Picks byte-budget-aware mini-batch sizes for one column.

The parquet column writer checks the data page byte limit only after each mini-batch finishes writing. Mini-batches are sized in rows (write_batch_size, default 1024), so for BYTE_ARRAY columns whose values are large (e.g. multi-MiB blobs) a single mini-batch can buffer GiB into one page before the limit is consulted.

This isolates the per-chunk decision that prevents that: given a chunk’s level data and the input values, pick the largest sub_batch_size such that one mini-batch fits in one page byte budget. For the overwhelmingly common case (small or fixed-width values) the answer is just chunk_size and the decision is O(1) on the column type — only when the input might overflow does the chunker consult the encoder’s byte estimate.

Fields§

§page_byte_limit: usize

Configured data page byte limit for the column.

§max_def_level: i16

Max definition level of the column; a level equal to this marks a present (non-null) leaf value. Used to count values per chunk.

§static_always_fits: bool

true when no chunk of base_batch_size values can ever overflow page_byte_limit regardless of input. Set once at column open from the physical type’s known per-value byte size; lets the per-chunk decision short-circuit with no work for every numeric, bool, or narrow FIXED_LEN_BYTE_ARRAY column.

§dict_page_byte_limit: usize

Configured dictionary page byte limit for the column.

§static_dict_always_fits: bool

As Self::static_always_fits but for the dictionary page: true when one base_batch_size mini-batch of this fixed-width type cannot overshoot dict_page_byte_limit by more than one mini-batch’s worth.

Implementations§

Source§

impl ByteBudgetChunker

Source

pub(crate) fn new( descr: &ColumnDescriptor, props: &WriterProperties, base_batch_size: usize, ) -> Self

Source

pub(crate) fn pick_sub_batch_size<E: ColumnValueEncoder>( &self, encoder: &E, values: &E::Values, value_indices: Option<&[usize]>, chunk_def: LevelDataRef<'_>, values_offset: usize, chunk_size: usize, ) -> usize

Decide how many levels at the start of a chunk belong in one mini-batch, so the mini-batch cannot overflow whichever page is currently accumulating value bytes: the data page when plain-encoding, or the dictionary page while dictionary-encoding. A returned value smaller than chunk_size triggers granular sub-batching in write_batch_internal.

While dictionary-encoding, the data page holds only small RLE indices, but the dictionary page accumulates the distinct values themselves — so it is the dictionary page’s remaining budget that must bound the mini-batch. The per-mini-batch dictionary spill check would otherwise let one mini-batch of large values balloon the dictionary page.

Returns chunk_size immediately (no value inspection) when the chunk is empty, or when the column is a fixed-width type whose mini-batches statically cannot overshoot the relevant page.

#[inline]: this is a tiny per-chunk dispatcher; the actual byte inspection lives in the out-of-line byte_budget_sub_batch_size.

Source

fn byte_budget_sub_batch_size<E: ColumnValueEncoder>( &self, values: &E::Values, value_indices: Option<&[usize]>, chunk_def: LevelDataRef<'_>, values_offset: usize, chunk_size: usize, budget: usize, ) -> usize

Inspect value sizes to decide how many of the chunk’s values fit in budget bytes (the data page or dictionary page remaining budget).

#[inline(never)] keeps this slow path out of the hot write_batch_internal loop; numeric and bool columns never reach it.

Auto Trait Implementations§

Blanket Implementations§

§

impl<T> Allocation for T
where T: RefUnwindSafe + Send + Sync,

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
§

impl<T> Ungil for T
where T: Send,

§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

§

fn vzip(self) -> V