Skip to main content

Module levels

Module levels 

Source
Expand description

Parquet definition and repetition levels

Contains the algorithm for computing definition and repetition levels. The algorithm works by tracking the slots of an array that should ultimately be populated when writing to Parquet. Parquet achieves nesting through definition levels and repetition levels [1]. Definition levels specify how many optional fields in the part for the column are defined. Repetition levels specify at what repeated field (list) in the path a column is defined.

In a nested data structure such as a.b.c, one can see levels as defining whether a record is defined at a, a.b, or a.b.c. Optional fields are nullable fields, thus if all 3 fields are nullable, the maximum definition could be = 3 if there are no lists.

The algorithm in this module computes the necessary information to enable the writer to keep track of which columns are at which levels, and to extract the correct values at the correct slots from Arrow arrays.

It works by walking a record batch’s arrays, keeping track of what values are non-null, their positions and computing what their levels are.

[1] parquet-format#nested-encoding

StructsΒ§

ArrayLevels πŸ”’
LevelContext πŸ”’
The definition and repetition level of an array within a potentially nested hierarchy

EnumsΒ§

LevelData πŸ”’
The data necessary to write a primitive Arrow array to parquet, taking into account any non-primitive parents it may have in the arrow representation
LevelInfoBuilder πŸ”’
A helper to construct ArrayLevels from a potentially nested [Field]

ConstantsΒ§

BULK_FILL_MIN_LEN πŸ”’
Minimum sub-range length before the bulk-fill fast path in write_leaf becomes profitable for null-heavy leaf columns. Below this, per-call slice + popcount overhead regresses list/struct paths that call write_leaf many times with tiny ranges. Picked via threshold sweep; see https://github.com/apache/arrow-rs/pull/9967 for the rationale.

FunctionsΒ§

calculate_array_levels πŸ”’
Performs a depth-first scan of the children of array, constructing ArrayLevels for each leaf column encountered
is_leaf πŸ”’
Returns true if the DataType can be represented as a primitive parquet column, i.e. a leaf array with no children