parquet::arrow::arrow_writer

Module levels

Source
Expand description

Parquet definition and repetition levels

Contains the algorithm for computing definition and repetition levels. The algorithm works by tracking the slots of an array that should ultimately be populated when writing to Parquet. Parquet achieves nesting through definition levels and repetition levels [1]. Definition levels specify how many optional fields in the part for the column are defined. Repetition levels specify at what repeated field (list) in the path a column is defined.

In a nested data structure such as a.b.c, one can see levels as defining whether a record is defined at a, a.b, or a.b.c. Optional fields are nullable fields, thus if all 3 fields are nullable, the maximum definition could be = 3 if there are no lists.

The algorithm in this module computes the necessary information to enable the writer to keep track of which columns are at which levels, and to extract the correct values at the correct slots from Arrow arrays.

It works by walking a record batch’s arrays, keeping track of what values are non-null, their positions and computing what their levels are.

[1] parquet-format#nested-encoding

Structs§

  • ArrayLevels πŸ”’
    The data necessary to write a primitive Arrow array to parquet, taking into account any non-primitive parents it may have in the arrow representation
  • LevelContext πŸ”’
    The definition and repetition level of an array within a potentially nested hierarchy

Enums§

Functions§

  • Performs a depth-first scan of the children of array, constructing ArrayLevels for each leaf column encountered
  • is_leaf πŸ”’
    Returns true if the DataType can be represented as a primitive parquet column, i.e. a leaf array with no children