Expand description
Parquet definition and repetition levels
Contains the algorithm for computing definition and repetition levels. The algorithm works by tracking the slots of an array that should ultimately be populated when writing to Parquet. Parquet achieves nesting through definition levels and repetition levels [1]. Definition levels specify how many optional fields in the part for the column are defined. Repetition levels specify at what repeated field (list) in the path a column is defined.
In a nested data structure such as a.b.c
, one can see levels as defining
whether a record is defined at a
, a.b
, or a.b.c
.
Optional fields are nullable fields, thus if all 3 fields
are nullable, the maximum definition could be = 3 if there are no lists.
The algorithm in this module computes the necessary information to enable the writer to keep track of which columns are at which levels, and to extract the correct values at the correct slots from Arrow arrays.
It works by walking a record batchβs arrays, keeping track of what values are non-null, their positions and computing what their levels are.
Structs§
- Array
Levels πThe data necessary to write a primitive Arrow array to parquet, taking into account any non-primitive parents it may have in the arrow representation - Level
Context πThe definition and repetition level of an array within a potentially nested hierarchy
Enums§
- Level
Info πBuilder A helper to constructArrayLevels
from a potentially nested [Field
]
Functions§
- Performs a depth-first scan of the children of
array
, constructingArrayLevels
for each leaf column encountered - is_leaf πReturns true if the DataType can be represented as a primitive parquet column, i.e. a leaf array with no children