Arrow and Parquet Part 3: Arbitrary Nesting with Lists of Structs and Structs of Lists

Published 17 Oct 2022
By tustvold and alamb

Introduction

This is the third of a three part series exploring how projects such as Rust Apache Arrow support conversion between Apache Arrow for in memory processing and Apache Parquet for efficient storage. Apache Arrow is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. Apache Parquet is an open, column-oriented data file format designed for very efficient data encoding and retrieval.

Arrow and Parquet Part 1: Primitive Types and Nullability covered the basics of primitive types. Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists covered the Struct and List types. This post builds on this foundation to show how both formats combine these to support arbitrary nesting.

Some libraries, such as Rust parquet implementation, offer complete support for such combinations, and users of those libraries do not need to worry about these details except to satisfy their own curiosity. Other libraries may not handle some corner cases and this post gives some flavor of why it is so complicated to do so.

Structs with Lists

Consider the following three json documents

{                     # <-- First record
  "a": [1],           # <-- top-level field a containing list of integers
  "b": [              # <-- top-level field b containing list of structures
    {                 # <-- list element of b containing two field b1 and b2
      "b1": 1         # <-- b1 is always provided (non nullable)
    },
    {
      "b1": 1,
      "b2": [         # <-- b2 contains list of integers
        3, 4          # <-- list elements of b.b2 always provided (non nullable)
      ]
    }
  ]
}

{
  "b": [              # <-- b is always provided (non nullable)
    {
      "b1": 2
    },
  ]
}

{
  "a": [null, null],  # <-- list elements of a are nullable
  "b": [null]         # <-- list elements of b are nullable
}

Documents of this format could be stored in this Arrow schema

Field(name: "a", nullable: true, datatype: List(
  Field(name: "element", nullable: true, datatype: Int32),
)
Field(name: "b"), nullable: false, datatype: List(
  Field(name: "element", nullable: true, datatype: Struct[
    Field(name: "b1", nullable: false, datatype: Int32),
    Field(name: "b2", nullable: true, datatype: List(
      Field(name: "element", nullable: false, datatype: Int32)
    ))
  ])
))

As explained previously, Arrow chooses to represent this in a hierarchical fashion. StructArrays are stored as child arrays that contain each field of the struct. ListArrays are stored as lists of monotonically increasing integers called offsets, with values stored in a single child array. Each consecutive pair of elements in the offset array identifies a slice of the child array for that array index.

The Arrow encoding of the example would be:

┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
                     ┌──────────────────┐
│ ┌─────┐   ┌─────┐  │ ┌─────┐   ┌─────┐│ │
  │  1  │   │  0  │  │ │  1  │   │  1  ││
│ ├─────┤   ├─────┤  │ ├─────┤   ├─────┤│ │
  │  0  │   │  1  │  │ │  0  │   │ ??  ││
│ ├─────┤   ├─────┤  │ ├─────┤   ├─────┤│ │
  │  1  │   │  1  │  │ │  0  │   │ ??  ││
│ └─────┘   ├─────┤  │ └─────┘   └─────┘│ │
            │  3  │  │ Validity   Values│
│ Validity  └─────┘  │                  │ │
                     │ child[0]         │
│ "a"       Offsets  │ PrimitiveArray   │ │
  ListArray          └──────────────────┘
└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘

┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
           ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │
│                    ┌──────────┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
  ┌─────┐  │ ┌─────┐ │ ┌─────┐  │   ┌─────┐ ┌─────┐ ┌──────────┐ │ │ │
│ │  0  │    │  1  │ │ │  1  │  │ │ │  0  │ │  0  │ │ ┌─────┐  │
  ├─────┤  │ ├─────┤ │ ├─────┤  │   ├─────┤ ├─────┤ │ │  3  │  │ │ │ │
│ │  2  │    │  1  │ │ │  1  │  │ │ │  1  │ │  0  │ │ ├─────┤  │
  ├─────┤  │ ├─────┤ │ ├─────┤  │   ├─────┤ ├─────┤ │ │  4  │  │ │ │ │
│ │  3  │    │  1  │ │ │  2  │  │ │ │  0  │ │  2  │ │ └─────┘  │
  ├─────┤  │ ├─────┤ │ ├─────┤  │   ├─────┤ ├─────┤ │          │ │ │ │
│ │  4  │    │  0  │ │ │ ??  │  │ │ │ ??  │ │  2  │ │  Values  │
  └─────┘  │ └─────┘ │ └─────┘  │   └─────┘ ├─────┤ │          │ │ │ │
│                    │          │ │         │  2  │ │          │
  Offsets  │ Validity│ Values   │           └─────┘ │          │ │ │ │
│                    │          │ │Validity         │ child[0] │
           │         │ "b1"     │           Offsets │ Primitive│ │ │ │
│                    │ Primitive│ │ "b2"            │ Array    │
           │         │ Array    │   ListArray       └──────────┘ │ │ │
│                    └──────────┘ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
           │ "element"                                             │ │
│            StructArray
  "b"      └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │
│ ListArray
 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘

Documents of this format could be stored in this Parquet schema

message schema {
  optional group a (LIST) {
    repeated group list {
      optional int32 element;
    }
  }
  required group b (LIST) {
    repeated group list {
      optional group element {
        required int32 b1;
        optional group b2 (LIST) {
          repeated group list {
            required int32 element;
          }
        }
      }
    }
  }
}

As explained in our previous posts, Parquet uses repetition levels and definition levels to encode nested structures and nullability.

Definition and repetition levels is a non trivial topic. For more detail, you can read the Google Dremel Paper which offers an academic description of the algorithm. You can also explore this gist to see Rust parquet code which generates the example below.

The Parquet encoding of the example would be:

┌───────────────────────────────┐ ┌────────────────────────────────┐
│ ┌─────┐    ┌─────┐    ┌─────┐ │ │  ┌─────┐    ┌─────┐    ┌─────┐ │
│ │  3  │    │  0  │    │  1  │ │ │  │  2  │    │  0  │    │  1  │ │
│ ├─────┤    ├─────┤    └─────┘ │ │  ├─────┤    ├─────┤    ├─────┤ │
│ │  0  │    │  0  │            │ │  │  2  │    │  1  │    │  1  │ │
│ ├─────┤    ├─────┤      Data  │ │  ├─────┤    ├─────┤    ├─────┤ │
│ │  2  │    │  0  │            │ │  │  2  │    │  0  │    │  2  │ │
│ ├─────┤    ├─────┤            │ │  ├─────┤    ├─────┤    └─────┘ │
│ │  2  │    │  1  │            │ │  │  1  │    │  0  │            │
│ └─────┘    └─────┘            │ │  └─────┘    └─────┘     Data   │
│                               │ │                                │
│Definition Repetition          │ │ Definition Repetition          │
│  Levels     Levels            │ │   Levels     Levels            │
│                               │ │                                │
│ "a"                           │ │  "b.b1"                        │
└───────────────────────────────┘ └────────────────────────────────┘

┌───────────────────────────────┐
│  ┌─────┐    ┌─────┐    ┌─────┐│
│  │  2  │    │  0  │    │  3  ││
│  ├─────┤    ├─────┤    ├─────┤│
│  │  4  │    │  1  │    │  4  ││
│  ├─────┤    ├─────┤    └─────┘│
│  │  4  │    │  2  │           │
│  ├─────┤    ├─────┤           │
│  │  2  │    │  0  │           │
│  ├─────┤    ├─────┤     Data  │
│  │  1  │    │  0  │           │
│  └─────┘    └─────┘           │
│Definition  Repetition         │
│  Levels      Levels           │
│                               │
│  "b.b2"                       │
└───────────────────────────────┘

Additional Complications

This series of posts has necessarily glossed over a number of details that further complicate actual implementations:

A ListArray may contain a non-empty offset range that is masked by a validity mask
Reading a given number of rows from a nullable field requires reading the definition levels and determining the number of values to read based on the number of nulls present
Reading a given number of rows from a repeated field requires reading the repetition levels and detecting a new row based on a repetition level of 0
A Parquet file may contain multiple row groups, each containing multiple column chunks
A column chunk may contain multiple pages, and there is no relationship between pages across columns
Parquet has alternative schema for representing lists with varying degrees of nullability
And more…

Summary

Both Parquet and Arrow are columnar formats and support nested structs and lists, however, the way they represent such nesting differs significantly and conversion between the two formats is complex.

Fortunately, with the Rust parquet implementation, reading and writing nested data in Arrow, in Parquet or converting between the two is as simple as reading unnested data. The library handles all the complex record shredding and reconstruction automatically. With this and other exciting features, such as support for reading asynchronously from object storage, it is the fastest and most feature complete Rust parquet implementation available. We look forward to seeing what you build with it!