Arrow and Parquet Part 2: Nested and Hierarchical Data using Structs and Lists

Published 08 Oct 2022
By tustvold and alamb

Introduction

This is the second, in a three part series exploring how projects such as Rust Apache Arrow support conversion between Apache Arrow and Apache Parquet. The first post covered the basics of data storage and validity encoding, and this post will cover the more complex Struct and List types.

Apache Arrow is an open, language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations. Apache Parquet is an open, column-oriented data file format designed for very efficient data encoding and retrieval.

Struct / Group Columns

Both Parquet and Arrow have the concept of a struct column, which is a column containing one or more other columns in named fields and is analogous to a JSON object.

For example, consider the following three JSON documents

{              # <-- First record
  "a": 1,      # <-- the top level fields are a, b, c, and d
  "b": {       # <-- b is always provided (not nullable)
    "b1": 1,   # <-- b1 and b2 are "nested" fields of "b"
    "b2": 3    # <-- b2 is always provided (not nullable)
   },
  "d": {
   "d1":  1    # <-- d1 is a "nested" field of "d"
  }
}

{              # <-- Second record
  "a": 2,
  "b": {
    "b2": 4    # <-- note "b1" is NULL in this record
  },
  "c": {       # <-- note "c" was NULL in the first record
    "c1": 6        but when "c" is provided, c1 is also
  },               always provided (not nullable)
  "d": {
    "d1": 2,
    "d2": 1
  }
}

{              # <-- Third record
  "b": {
    "b1": 5,
    "b2": 6
  },
  "c": {
    "c1": 7
  }
}

Documents of this format could be stored in an Arrow StructArray with this schema

Field(name: "a", nullable: true, datatype: Int32)
Field(name: "b", nullable: false, datatype: Struct[
  Field(name: "b1", nullable: true, datatype: Int32),
  Field(name: "b2", nullable: false, datatype: Int32)
])
Field(name: "c"), nullable: true, datatype: Struct[
  Field(name: "c1", nullable: false, datatype: Int32)
])
Field(name: "d"), nullable: true, datatype: Struct[
  Field(name: "d1", nullable: false, datatype: Int32)
  Field(name: "d2", nullable: true, datatype: Int32)
])

Arrow represents each StructArray hierarchically using a parent child relationship, with separate validity masks on each of the individual nullable arrays

  ┌───────────────────┐        ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
  │                   │           ┌─────────────────┐ ┌────────────┐
  │ ┌─────┐   ┌─────┐ │        │  │┌─────┐   ┌─────┐│ │  ┌─────┐   │ │
  │ │  1  │   │  1  │ │           ││  1  │   │  1  ││ │  │  3  │   │
  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
  │ │  1  │   │  2  │ │           ││  0  │   │ ??  ││ │  │  4  │   │
  │ ├─────┤   ├─────┤ │        │  │├─────┤   ├─────┤│ │  ├─────┤   │ │
  │ │  0  │   │ ??  │ │           ││  1  │   │  5  ││ │  │  6  │   │
  │ └─────┘   └─────┘ │        │  │└─────┘   └─────┘│ │  └─────┘   │ │
  │ Validity   Values │           │Validity   Values│ │   Values   │
  │                   │        │  │                 │ │            │ │
  │ "a"               │           │"b.b1"           │ │  "b.b2"    │
  │ PrimitiveArray    │        │  │PrimitiveArray   │ │  Primitive │ │
  └───────────────────┘           │                 │ │  Array     │
                               │  └─────────────────┘ └────────────┘ │
                                    "b"
                               │    StructArray                      │
                                ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─

┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
            ┌───────────┐                ┌──────────┐┌─────────────────┐ │
│  ┌─────┐  │ ┌─────┐   │ │ │  ┌─────┐   │┌─────┐   ││ ┌─────┐  ┌─────┐│
   │  0  │  │ │ ??  │   │      │  1  │   ││  1  │   ││ │  0  │  │ ??  ││ │
│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
   │  1  │  │ │  6  │   │      │  1  │   ││  2  │   ││ │  1  │  │  1  ││ │
│  ├─────┤  │ ├─────┤   │ │ │  ├─────┤   │├─────┤   ││ ├─────┤  ├─────┤│
   │  1  │  │ │  7  │   │      │  0  │   ││ ??  │   ││ │ ??  │  │ ??  ││ │
│  └─────┘  │ └─────┘   │ │ │  └─────┘   │└─────┘   ││ └─────┘  └─────┘│
   Validity │  Values   │      Validity  │ Values   ││ Validity  Values│ │
│           │           │ │ │            │          ││                 │
            │ "c.c1"    │                │"d.d1"    ││ "d.d2"          │ │
│           │ Primitive │ │ │            │Primitive ││ PrimitiveArray  │
            │ Array     │                │Array     ││                 │ │
│           └───────────┘ │ │            └──────────┘└─────────────────┘
    "c"                         "d"                                      │
│   StructArray           │ │   StructArray
  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘

More technical detail is available in the StructArray format specification.

Definition Levels

Unlike Arrow, Parquet does not encode validity in a structured fashion, instead only storing definition levels for each of the primitive columns, i.e. those that don't contain other columns. The definition level of a given element is the depth in the schema at which it is fully defined.

For example consider the case of d.d2, which contains two nullable levels d and d2.

A definition level of 0 would imply a null at the level of d:

{
}

A definition level of 1 would imply a null at the level of d.d2

{
  "d": { }
}

A definition level of 2 would imply a defined value for d.d2:

{
  "d": { "d2": .. }
}

Going back to the three JSON documents above, they could be stored in Parquet with this schema

message schema {
  optional int32 a;
  required group b {
    optional int32 b1;
    required int32 b2;
  }
  optional group c {
    required int32 c1;
  }
  optional group d {
    required int32 d1;
    optional int32 d2;
  }
}

The Parquet encoding of the example would be:

 ┌────────────────────────┐  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
 │  ┌─────┐     ┌─────┐   │    ┌──────────────────────┐ ┌───────────┐ │
 │  │  1  │     │  1  │   │  │ │  ┌─────┐    ┌─────┐  │ │  ┌─────┐  │
 │  ├─────┤     ├─────┤   │    │  │  1  │    │  1  │  │ │  │  3  │  │ │
 │  │  1  │     │  2  │   │  │ │  ├─────┤    ├─────┤  │ │  ├─────┤  │
 │  ├─────┤     └─────┘   │    │  │  0  │    │  5  │  │ │  │  4  │  │ │
 │  │  0  │               │  │ │  ├─────┤    └─────┘  │ │  ├─────┤  │
 │  └─────┘               │    │  │  1  │             │ │  │  6  │  │ │
 │                        │  │ │  └─────┘             │ │  └─────┘  │
 │  Definition    Data    │    │                      │ │           │ │
 │    Levels              │  │ │  Definition   Data   │ │   Data    │
 │                        │    │    Levels            │ │           │ │
 │  "a"                   │  │ │                      │ │           │
 └────────────────────────┘    │  "b.b1"              │ │  "b.b2"   │ │
                             │ └──────────────────────┘ └───────────┘
                                  "b"                                 │
                             └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─


┌ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
  ┌────────────────────┐ │   ┌────────────────────┐ ┌──────────────────┐ │
│ │  ┌─────┐   ┌─────┐ │   │ │  ┌─────┐   ┌─────┐ │ │ ┌─────┐  ┌─────┐ │
  │  │  0  │   │  6  │ │ │   │  │  1  │   │  1  │ │ │ │  1  │  │  1  │ │ │
│ │  ├─────┤   ├─────┤ │   │ │  ├─────┤   ├─────┤ │ │ ├─────┤  └─────┘ │
  │  │  1  │   │  7  │ │ │   │  │  1  │   │  2  │ │ │ │  2  │          │ │
│ │  ├─────┤   └─────┘ │   │ │  ├─────┤   └─────┘ │ │ ├─────┤          │
  │  │  1  │           │ │   │  │  0  │           │ │ │  0  │          │ │
│ │  └─────┘           │   │ │  └─────┘           │ │ └─────┘          │
  │                    │ │   │                    │ │                  │ │
│ │  Definition  Data  │   │ │  Definition  Data  │ │ Definition Data  │
  │    Levels          │ │   │    Levels          │ │   Levels         │ │
│ │                    │   │ │                    │ │                  │
  │  "c.c1"            │ │   │  "d.d1"            │ │  "d.d2"          │ │
│ └────────────────────┘   │ └────────────────────┘ └──────────────────┘
     "c"                 │      "d"                                      │
└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─

List / Repeated Columns

Closing out support for nested types are lists, which contain a variable number of other values. For example, the following four documents each have a (nullable) field a containing a list of integers

{                     # <-- First record
  "a": [1],           # <-- top-level field a containing list of integers
}

{                     # <-- "a" is not provided (is null)
}

{                     # <-- "a" is non-null but empty
  "a": []
}

{
  "a": [null, 2],     # <-- "a" has a null and non-null elements
}

Documents of this format could be stored in this Arrow schema

Field(name: "a", nullable: true, datatype: List(
  Field(name: "element", nullable: true, datatype: Int32),
)

As before, Arrow chooses to represent this in a hierarchical fashion as a ListArray. A ListArray contains a list of monotonically increasing integers called offsets, a validity mask if the list is nullable, and a child array containing the list elements. Each consecutive pair of elements in the offset array identifies a slice of the child array for that index in the ListArray

For example, a list with offsets [0, 2, 3, 3] contains 3 pairs of offsets, (0,2), (2,3), and (3,3), and therefore represents a ListArray of length 3 with the following values:

[child[0], child[1]]
[child[2]]
[]

For the example above with 4 JSON documents, this would be encoded in Arrow as

┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
                          ┌──────────────────┐ │
│    ┌─────┐   ┌─────┐    │ ┌─────┐   ┌─────┐│
     │  1  │   │  0  │    │ │  1  │   │  1  ││ │
│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
     │  0  │   │  1  │    │ │  0  │   │ ??  ││ │
│    ├─────┤   ├─────┤    │ ├─────┤   ├─────┤│
     │  1  │   │  1  │    │ │  1  │   │  2  ││ │
│    ├─────┤   ├─────┤    │ └─────┘   └─────┘│
     │  1  │   │  1  │    │ Validity   Values│ │
│    └─────┘   ├─────┤    │                  │
               │  3  │    │ child[0]         │ │
│    Validity  └─────┘    │ PrimitiveArray   │
                          │                  │ │
│              Offsets    └──────────────────┘
     "a"                                       │
│    ListArray
 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘

More technical detail is available in the ListArray format specification.

Parquet Repetition Levels

The example above with 4 JSON documents can be stored in this Parquet schema

message schema {
  optional group a (LIST) {
    repeated group list {
      optional int32 element;
    }
  }
}

In order to encode lists, Parquet stores an integer repetition level in addition to a definition level. A repetition level identifies where in the hierarchy of repeated fields the current value is to be inserted. A value of 0 means a new list in the top-most repeated list, a value of 1 means a new element within the top-most repeated list, a value of 2 means a new element within the second top-most repeated list, and so on.

A consequence of this encoding is that the number of zeros in the repetition levels is the total number of rows in the column, and the first level in a column must be 0.

Each repeated field also has a corresponding definition level, however, in this case rather than indicating a null value, they indicate an empty array.

The example above would therefore be encoded as

┌─────────────────────────────────────┐
│  ┌─────┐      ┌─────┐               │
│  │  3  │      │  0  │               │
│  ├─────┤      ├─────┤               │
│  │  0  │      │  0  │               │
│  ├─────┤      ├─────┤      ┌─────┐  │
│  │  1  │      │  0  │      │  1  │  │
│  ├─────┤      ├─────┤      ├─────┤  │
│  │  2  │      │  0  │      │  2  │  │
│  ├─────┤      ├─────┤      └─────┘  │
│  │  3  │      │  1  │               │
│  └─────┘      └─────┘               │
│                                     │
│ Definition  Repetition      Values  │
│   Levels      Levels                │
│  "a"                                │
│                                     │
└─────────────────────────────────────┘

Next up: Arbitrary Nesting: Lists of Structs and Structs of Lists

In our final blog post, we explain how Parquet and Arrow combine these concepts to support arbitrary nesting of potentially nullable data structures.

If you want to store and process structured types, you will be pleased to hear that the Rust parquet implementation fully supports reading and writing directly into Arrow, as simply as any other type. All the complex record shredding and reconstruction is handled automatically. With this and other exciting features such as reading asynchronously from object storage, and advanced row filter pushdown, it is the fastest and most feature complete Rust parquet implementation. We look forward to seeing what you build with it!