Crate arrow_row

Crate arrow_row 

Source
Expand description

A comparable row-oriented representation of a collection of [Array].

Rows are normalized for sorting, and can therefore be very efficiently compared, using memcmp under the hood, or used in non-comparison sorts such as radix sort. This makes the row format ideal for implementing efficient multi-column sorting, grouping, aggregation, windowing and more, as described in more detail in this blog post.

For example, given three input [Array], RowConverter creates byte sequences that compare the same as when using lexsort.

   ┌─────┐   ┌─────┐   ┌─────┐
   │     │   │     │   │     │
   ├─────┤ ┌ ┼─────┼ ─ ┼─────┼ ┐              ┏━━━━━━━━━━━━━┓
   │     │   │     │   │     │  ─────────────▶┃             ┃
   ├─────┤ └ ┼─────┼ ─ ┼─────┼ ┘              ┗━━━━━━━━━━━━━┛
   │     │   │     │   │     │
   └─────┘   └─────┘   └─────┘
               ...
   ┌─────┐ ┌ ┬─────┬ ─ ┬─────┬ ┐              ┏━━━━━━━━┓
   │     │   │     │   │     │  ─────────────▶┃        ┃
   └─────┘ └ ┴─────┴ ─ ┴─────┴ ┘              ┗━━━━━━━━┛
    UInt64      Utf8     F64

          Input Arrays                          Row Format
    (Columns)

Rows must be generated by the same RowConverter for the comparison to be meaningful.

§Basic Example


let a1 = Arc::new(Int32Array::from_iter_values([-1, -1, 0, 3, 3])) as ArrayRef;
let a2 = Arc::new(StringArray::from_iter_values(["a", "b", "c", "d", "d"])) as ArrayRef;
let arrays = vec![a1, a2];

// Convert arrays to rows
let converter = RowConverter::new(vec![
    SortField::new(DataType::Int32),
    SortField::new(DataType::Utf8),
]).unwrap();
let rows = converter.convert_columns(&arrays).unwrap();

// Compare rows
for i in 0..4 {
    assert!(rows.row(i) <= rows.row(i + 1));
}
assert_eq!(rows.row(3), rows.row(4));

// Convert rows back to arrays
let converted = converter.convert_rows(&rows).unwrap();
assert_eq!(arrays, converted);

// Compare rows from different arrays
let a1 = Arc::new(Int32Array::from_iter_values([3, 4])) as ArrayRef;
let a2 = Arc::new(StringArray::from_iter_values(["e", "f"])) as ArrayRef;
let arrays = vec![a1, a2];
let rows2 = converter.convert_columns(&arrays).unwrap();

assert!(rows.row(4) < rows2.row(0));
assert!(rows.row(4) < rows2.row(1));

// Convert selection of rows back to arrays
let selection = [rows.row(0), rows2.row(1), rows.row(2), rows2.row(0)];
let converted = converter.convert_rows(selection).unwrap();
let c1 = converted[0].as_primitive::<Int32Type>();
assert_eq!(c1.values(), &[-1, 4, 0, 3]);

let c2 = converted[1].as_string::<i32>();
let c2_values: Vec<_> = c2.iter().flatten().collect();
assert_eq!(&c2_values, &["a", "f", "c", "e"]);

§Lexicographic Sorts (lexsort)

The row format can also be used to implement a fast multi-column / lexicographic sort

fn lexsort_to_indices(arrays: &[ArrayRef]) -> UInt32Array {
    let fields = arrays
        .iter()
        .map(|a| SortField::new(a.data_type().clone()))
        .collect();
    let converter = RowConverter::new(fields).unwrap();
    let rows = converter.convert_columns(arrays).unwrap();
    let mut sort: Vec<_> = rows.iter().enumerate().collect();
    sort.sort_unstable_by(|(_, a), (_, b)| a.cmp(b));
    UInt32Array::from_iter_values(sort.iter().map(|(i, _)| *i as u32))
}

§Flattening Dictionaries

For performance reasons, dictionary arrays are flattened (“hydrated”) to their underlying values during row conversion. See the issue for more details.

This means that the arrays that come out of RowConverter::convert_rows may not have the same data types as the input arrays. For example, encoding a Dictionary<Int8, Utf8> and then will come out as a Utf8 array.

// Input is a Dictionary array
let dict: DictionaryArray::<Int8Type> = ["a", "b", "c", "a", "b"].into_iter().collect();
let sort_fields = vec![SortField::new(dict.data_type().clone())];
let arrays = vec![Arc::new(dict) as ArrayRef];
let converter = RowConverter::new(sort_fields).unwrap();
// Convert to rows
let rows = converter.convert_columns(&arrays).unwrap();
let converted = converter.convert_rows(&rows).unwrap();
// result was a Utf8 array, not a Dictionary array
assert_eq!(converted[0].data_type(), &DataType::Utf8);

Modules§

fixed 🔒
list 🔒
run 🔒
variable 🔒

Macros§

decode_primitive_helper 🔒

Structs§

OwnedRow
Owned version of a Row that can be moved/cloned freely.
Row
A comparable representation of a row.
RowConfig 🔒
The config of a given set of Row
RowConverter
Converts [ArrayRef] columns into a row-oriented format.
RowParser
A RowParser can be created from a RowConverter and used to parse bytes to Row
Rows
A row-oriented representation of arrow data, that is normalized for comparison.
RowsIter
An iterator over Rows
SortField
Configure the data type and sort order for a given column

Enums§

Codec 🔒
Encoder 🔒
LengthTracker 🔒
Stores the lengths of the rows. Lazily materializes lengths for columns with fixed-size types.

Functions§

decode_column 🔒
Decodes a the provided field from rows
encode_column 🔒
Encodes a column to the provided Rows incrementing the offsets as it progresses
encode_dictionary_values
Encode dictionary values not preserving the dictionary encoding
null_sentinel 🔒
Returns the null sentinel, negated if invert is true
row_lengths 🔒
Computes the length of each encoded Rows and returns an empty Rows