pub(crate) struct ByteBudgetChunker {
page_byte_limit: usize,
max_def_level: i16,
static_always_fits: bool,
dict_page_byte_limit: usize,
static_dict_always_fits: bool,
}Expand description
Picks byte-budget-aware mini-batch sizes for one column.
The parquet column writer checks the data page byte limit only after
each mini-batch finishes writing. Mini-batches are sized in rows
(write_batch_size, default 1024), so for BYTE_ARRAY columns whose
values are large (e.g. multi-MiB blobs) a single mini-batch can buffer
GiB into one page before the limit is consulted.
This isolates the per-chunk decision that prevents that: given a chunk’s
level data and the input values, pick the largest sub_batch_size such
that one mini-batch fits in one page byte budget. For the overwhelmingly
common case (small or fixed-width values) the answer is just chunk_size
and the decision is O(1) on the column type — only when the input might
overflow does the chunker consult the encoder’s byte estimate.
Fields§
§page_byte_limit: usizeConfigured data page byte limit for the column.
max_def_level: i16Max definition level of the column; a level equal to this marks a present (non-null) leaf value. Used to count values per chunk.
static_always_fits: booltrue when no chunk of base_batch_size values can ever overflow
page_byte_limit regardless of input. Set once at column open from
the physical type’s known per-value byte size; lets the per-chunk
decision short-circuit with no work for every numeric, bool, or
narrow FIXED_LEN_BYTE_ARRAY column.
dict_page_byte_limit: usizeConfigured dictionary page byte limit for the column.
static_dict_always_fits: boolAs Self::static_always_fits but for the dictionary page: true
when one base_batch_size mini-batch of this fixed-width type cannot
overshoot dict_page_byte_limit by more than one mini-batch’s worth.
Implementations§
Source§impl ByteBudgetChunker
impl ByteBudgetChunker
pub(crate) fn new( descr: &ColumnDescriptor, props: &WriterProperties, base_batch_size: usize, ) -> Self
Sourcepub(crate) fn pick_sub_batch_size<E: ColumnValueEncoder>(
&self,
encoder: &E,
values: &E::Values,
value_indices: Option<&[usize]>,
chunk_def: LevelDataRef<'_>,
values_offset: usize,
chunk_size: usize,
) -> usize
pub(crate) fn pick_sub_batch_size<E: ColumnValueEncoder>( &self, encoder: &E, values: &E::Values, value_indices: Option<&[usize]>, chunk_def: LevelDataRef<'_>, values_offset: usize, chunk_size: usize, ) -> usize
Decide how many levels at the start of a chunk belong in one
mini-batch, so the mini-batch cannot overflow whichever page is
currently accumulating value bytes: the data page when plain-encoding,
or the dictionary page while dictionary-encoding. A returned value
smaller than chunk_size triggers granular sub-batching in
write_batch_internal.
While dictionary-encoding, the data page holds only small RLE indices, but the dictionary page accumulates the distinct values themselves — so it is the dictionary page’s remaining budget that must bound the mini-batch. The per-mini-batch dictionary spill check would otherwise let one mini-batch of large values balloon the dictionary page.
Returns chunk_size immediately (no value inspection) when the chunk
is empty, or when the column is a fixed-width type whose mini-batches
statically cannot overshoot the relevant page.
#[inline]: this is a tiny per-chunk dispatcher; the actual byte
inspection lives in the out-of-line byte_budget_sub_batch_size.
Sourcefn byte_budget_sub_batch_size<E: ColumnValueEncoder>(
&self,
values: &E::Values,
value_indices: Option<&[usize]>,
chunk_def: LevelDataRef<'_>,
values_offset: usize,
chunk_size: usize,
budget: usize,
) -> usize
fn byte_budget_sub_batch_size<E: ColumnValueEncoder>( &self, values: &E::Values, value_indices: Option<&[usize]>, chunk_def: LevelDataRef<'_>, values_offset: usize, chunk_size: usize, budget: usize, ) -> usize
Inspect value sizes to decide how many of the chunk’s values fit in
budget bytes (the data page or dictionary page remaining budget).
#[inline(never)] keeps this slow path out of the hot
write_batch_internal loop; numeric and bool columns never reach it.