Expand description
Parquet definition and repetition levels
Contains the algorithm for computing definition and repetition levels. The algorithm works by tracking the slots of an array that should ultimately be populated when writing to Parquet. Parquet achieves nesting through definition levels and repetition levels [1]. Definition levels specify how many optional fields in the part for the column are defined. Repetition levels specify at what repeated field (list) in the path a column is defined.
In a nested data structure such as a.b.c, one can see levels as defining
whether a record is defined at a, a.b, or a.b.c.
Optional fields are nullable fields, thus if all 3 fields
are nullable, the maximum definition could be = 3 if there are no lists.
The algorithm in this module computes the necessary information to enable the writer to keep track of which columns are at which levels, and to extract the correct values at the correct slots from Arrow arrays.
It works by walking a record batchβs arrays, keeping track of what values are non-null, their positions and computing what their levels are.
StructsΒ§
- Array
Levels π - Level
Context π - The definition and repetition level of an array within a potentially nested hierarchy
EnumsΒ§
- Level
Data π - The data necessary to write a primitive Arrow array to parquet, taking into account any non-primitive parents it may have in the arrow representation
- Level
Info πBuilder - A helper to construct
ArrayLevelsfrom a potentially nested [Field]
ConstantsΒ§
- BULK_
FILL_ πMIN_ LEN - Minimum sub-range length before the bulk-fill fast path in
write_leafbecomes profitable for null-heavy leaf columns. Below this, per-call slice + popcount overhead regresses list/struct paths that callwrite_leafmany times with tiny ranges. Picked via threshold sweep; see https://github.com/apache/arrow-rs/pull/9967 for the rationale.
FunctionsΒ§
- calculate_
array_ πlevels - Performs a depth-first scan of the children of
array, constructingArrayLevelsfor each leaf column encountered - is_leaf π
- Returns true if the DataType can be represented as a primitive parquet column, i.e. a leaf array with no children