February 2022 Rust Apache Arrow and Parquet Highlights


Published 13 Feb 2022
By The Apache Arrow PMC (pmc)

The Rust implementation of Apache Arrow has just released version 9.0.2.

While a major version of this magnitude may shock some in the Rust community to whom it implies a slow moving 20 year old piece of software, nothing could be further from the truth!

With regular and predictable bi-weekly releases, the library continues to evolve rapidly, and 9.0.2 is no exception. Some recent highlights:

parquet: async, performance, safety and nested types

The parquet 9.0.2 release includes an async reader, a long time requested feature. Using the async reader it is now possible to read only the relevant parts of a parquet file from a networked source such as object storage. Previously the entire file had to be buffered locally. We are hoping to add an async writer in a future release and would love some help.

It is also significantly faster to read parquet data (up to 60x in some cases) than with previous versions of the parquet crate. Kudos to tustvold and yordan-pavlov for their contributions in these areas.

With 8.0.0 and later, the code that reads and writes RecordBatches to and from Parquet now supports all types, including deeply nested structs and lists. Thanks helgikrs for cleaning up the last corner cases!

Other notable recent additions to parquet are UTF-8 validation on string data for improved security against malicious inputs.

Planned upcoming work includes pushing more filtering directly into the parquet scan as well as an async writer.

arrow: performance, dyn kernels, and DecimalArray

The compute kernels have been improved significantly in arrow 9.0.2. Some filter benchmarks are twice as fast and the SIMD kernels are also significantly faster. Many thanks to tustvold and jhorstmann. Additional substantial improvements are likely to land in arrow 10.0.0.

We are working on new set of “dynamic” dyn_ kernels (for example, eq_dyn) that make it easier to invoke the heavily optimized kernels provided by the arrow crate. Work is underway to expand the breadth of types supported by these new kernels to make them even more useful. Thanks to matthewmturner and viirya for their help in this effort.

While arrow has had basic support for DecimalArray since version 3.0.0, support has been expanded for Decimal type in calculation kernels such as sort, take and filter thanks to some great contributions from liukun4515. There is ongoing work to improve the API ergonomics and performance of DecimalArray as well.

Security

The 6.4.0 release resolved the last outstanding RUSTSEC advisory on the arrow crate and the 8.0.0 release resolved the last outstanding known security issues. While these security issues were mostly limited misuse of the low level “power user” APIs which most users do not (and should not) be using, it was good to tighten up that area.

Now that arrow-rs is releasing major versions every other week, we are also able to update dependencies at the same pace, helping to ensure that security fixes upstream can flow more quickly to downstream projects.

Final shoutout

It takes a community to build great software, and we would like to thank everyone who has contributed to the arrow-rs repository since the 7.0.0 release:

git shortlog -sn 7.0.0..9.0.0
    22  Raphael Taylor-Davies
    18  Andrew Lamb
     6  Helgi Kristvin Sigurbjarnarson
     6  Remzi Yang
     5  Jörn Horstmann
     4  Liang-Chi Hsieh
     3  Jiayu Liu
     2  dependabot[bot]
     2  Yijie Shen
     1  Matthew Turner
     1  Kun Liu
     1  Yang
     1  Edd Robinson
     1  Patrick More

How to Get Involved

If you are interested in contributing to the Rust subproject in Apache Arrow, you can find a list of open issues suitable for beginners here and the full list here.

Other ways to get involved include trying out Arrow on some of your data and filing bug reports, and helping to improve the documentation.