Apache Arrow 2.0.0 Rust Highlights


Published 27 Oct 2020
By The Apache Arrow PMC (pmc)

Apache Arrow 2.0.0 is a significant release for the Apache Arrow project in general (release notes), and the Rust subproject in particular, with almost 200 issues resolved by 15 contributors. In this blog post, we will go through the main changes affecting core Arrow, Parquet support, and DataFusion query engine. The full list of resolved issues can be found here.

While the Java and C/C++ (used by Python and R) Arrow implementations likely remain the most feature-rich, with the 2.0.0 release, the Rust implementation is closing the feature gap quickly. Here are some of the highlights for this release.

Core Arrow Crate

Iterator Trait

  • Primitive arrays (e.g., array of integers) can now be converted to and initialized from an iterator. This exposes a very popular API - iterators - to arrow arrays. Work for other types will continue throughout 3.0.0.

Improved Variable-sized Arrays

  • Variable sized arrays (e.g., array of strings) have been internally refactored to more easily support their larger (64-bit size offset) version. This allowed us to generalize some of the kernels to both (32 and 64) versions and perform type checks when building them.

Kernels

There have been numerous improvements in the Arrow compute kernels, including:

  • New kernels have been added for string operations, including substring, min, max, concat, and length.
  • Aggregate sum is now implemented for SIMD with a 5x improvement over the non-SIMD operation
  • Many kernels have been improved to support dictionary-encoded arrays
  • Some kernels were optimized for arrays without nulls, making them significantly faster in that case.
  • Many kernels were optimized in the number of memory copies that are needed to apply them and also on their implementation.

Other Improvements

The Array trait now has get_buffer_memory_size and get_array_memory_size methods for determining the amount of memory allocated for the array.

Parquet

A significant effort is underway to create a Parquet writer for Arrow data. This work has not been released as part of 2.0.0, and is planned for the 3.0.0 release. The development of this writer is being carried out on the rust-parquet-arrow-writer branch, and the branch is regularly synchronized with the main branch. As part of the writer, the necessary improvements and features are being added to the reader.

The main focus areas are:

  • Supporting nested Arrow types, such as List<Struct<[Dictionary, String]>>
  • Ensuring correct round-trip between the reader and writer by encoding Arrow schemas in the Parquet metadata
  • Improve null value writing for Parquet

A new parquet_derive crate has been created, which allows users to derive Parquet records for simple structs. Refer to the parquet_derive crate for usage examples.

DataFusion

DataFusion is an in-memory query engine with DataFrame and SQL APIs, built on top of base Arrow support.

DataFrame API

DataFusion now has a richer DataFrame API with improved documentation showing example usage, supporting the following operations:

  • select_columns
  • select
  • filter
  • aggregate
  • limit
  • sort
  • collect
  • explain

Performance & Scalability

DataFusion query execution now uses async/await with the tokio threaded runtime rather than launching dedicated threads, making queries scale much better across available cores.

The hash aggregate physical operator has been largely re-written, resulting in significant performance improvements.

Expressions and Compute

Improved Scalar Functions

DataFusion has many new functions, both in the SQL and the DataFrame API:

  • Length of an string
  • COUNT(DISTINCT column)
  • to_timestamp
  • IsNull and IsNotNull
  • Min/Max for strings (lexicographic order)
  • Array of columns
  • Concatenation of strings
  • Aliases of aggregate expressions

Many existing expressions were also significantly optimized (2-3x speedups) by avoiding memory copies and leveraging Arrow format’s invariants.

Unary mathematical functions (such as sqrt) now support both 32 and 64-bit floats and return the corresponding type, thereby allowing faster operations when higher precision is not needed.

Improved User-defined Functions (UDFs)

The API to use and register UDFs has been significantly improved, allowing users to register UDFs and call them both via SQL and the DataFrame API. UDFs now also have the same generality as DataFusion’s functions, including variadic and dynamically typed arguments.

User-defined Aggregate Functions (UDAFs)

DataFusion now supports user-defined aggregate functions that can be used to perform operations than span multiple rows, batches, and partitions. UDAFs have the same generality as DataFusion’s functions and support both row updates and batch updates. You can check out this example to learn how to declare and use a UDAF.

User-defined Constants

DataFusion now supports registering constants (e.g. “@version”) that live for the duration of the execution context and can be accessed from SQL.

Query Planning & Optimization

User-defined logical plans

The Logical Plan enum is now extensible through an Extension variant which accepts a UserDefinedLogicalPlan trait using dynamic dispatch. Consequently, DataFusion now supports user-defined logical nodes, thereby allowing complex nodes to be planned and executed. You can check this example to learn how to declare a new node.

Predicate push-down

DataFusion now has a Predicate push-down optimizer rule that pushes filter operations as close as possible to scans, thereby speeding up the physical execution of suboptimal queries created via the DataFrame API.

SQL

DataFusion now uses a more recent release of the sqlparser crate, which has much more comprehensive support for SQL syntax and also supports multiple dialects (Postgres, MS SQL, and MySQL).

It is now possible to see the query plan for a SQL statement using EXPLAIN syntax.

Benchmarks

The benchmark crate now contains a new benchmark based on TPC-H that can execute TPC-H query 1 against CSV, Parquet, and memory data sources. This is useful for running benchmarks against larger data sets.

Integration Testing / IPC

Arrow IPC is the format for serialization and interprocess communication. It is described in arrow.apache.org and is the format used for file and stream I/O between applications wishing to interchange Arrow data.

The Arrow project released IPC version 5 of the Arrow IPC format in version 1.0.0. Before that, a message padding change was made in version 0.15.0 to change the default padding to 8 bytes, while remaining in IPC version 4. Arrow release 0.14.1 and earlier were the last releases to use the legacy 4 byte alignment. As part of 2.0.0, the Rust implementation was updated to comply with the changes up to release 0.15.0 of Arrow. Work on supporting IPC version 5 is underway, and is expected to be completed in time for 3.0.0.

As part of the conformance work, Rust is being added to the Arrow integration suite, which tests that supported language implementations (ARROW-3690):

  • Comply with the Arrow IPC format
  • Can read and write each other’s generated data
  • IPC version 4 is being verified through the above work.

Roadmap for 3.0.0 and Beyond

Here are some of the initiatives that contributors are currently working on for future releases:

  • Support stable Rust
  • Improved DictionaryArray support and performance
  • Implement inner equijoins
  • Support for various platforms like ARMv8
  • Supporting the C Data Interface from Rust to better support interoperability with other Arrow implementations

How to Get Involved

If you are interested in contributing to the Rust subproject in Apache Arrow, you can find a list of open issues suitable for beginners here and the full list here.

Other ways to get involved include trying out Arrow on some of your data and filing bug reports, and helping to improve the documentation.