June 2022 Rust Apache Arrow and Parquet 16.0.0 Highlights

Published 16 Jun 2022
By The Apache Arrow PMC (pmc)

Introduction

We recently celebrated releasing version 16.0.0 of the Rust implementation of Apache Arrow. While we still get a few comments on “most rust libraries use versions 0.x.0, why are you at 16.0.0?”, our versioning scheme appears to be working well, and permits quick releases of new features and API evolution in a semver compatible way without breaking downstream projects.

This post contains highlights from the last four months (versions 10.0.0 to 16.0.0) of arrow-rs and parquet-rs development as well as a roadmap of future work. The full list of awesomeness can be found in the CHANGELOG.

As you may remember, the arrow and parquet implementations are in the same crate, on the same release schedule, and in this same blog. This is not for technical reasons, but helps to keep the maintenance burden for delivering great Apache software reasonable, and allows easier development of optimized conversion between Arrow <–> Parquet formats.

Parquet

The parquet crate has seen a return to substantial improvements after being relatively dormant for several years. The current major areas of focus are

Performance: Improving the raw performance for reading and writing mirroring the efforts that went into the C++ version a few years ago.
API Ease of Use: Improving the API so it is easy to use efficiently with modern Rust for two preeminent use cases: 1) reading from local disk and 2) reading asynchronously from remote object stores.

Some Major Highlights:

Advanced Metadata Access: API access to advanced parquet metadata, such as PageEncoding, BloomFilters and PageIndex.
Improved API Usability: For example, the parquet writer now uses std:io::Write rather than a custom ParquetWriter trait, making it more interoperable with the rest of the Rust ecosystem and the projection API is easier to use with nested types.
Rewritten support for nested types (e.g. struct, lists) : @tustvold has revamped / rewritten support for reading and writing structured types, which both improved the support for arbitrary nested schemas, and is 30% faster.

Looking Forward:

Even Faster: We are actively working to make writing even faster and expect to see some major improvements over the next few releases.
Object Store Integration: Support for easily and efficiently reading/writing to/from object storage is improving, and we expect it will soon work well out of the box, fetching the minimal bytes, etc… More on this to follow in a separate blog post.
Parallel Decode: We intend to transparently support high performance parallel decoding of parquet to arrow arrays, when invoked from a rayon threadpool.

Arrow

The Rust arrow implementation has also had substantial improvements, in addition to bug fixes and performance improvements.

Some Major Highlights:

Ecosystem Compatibility: @viriya has put in a massive effort to improve (and prove) compatibility with other Arrow implementations via the Rust IPC integration tests. There have been major improvements for corner cases involving nested structures, nullability, nested dictionaries, etc.
Safety: We continue to improve the safety of arrow, and it is not possible to trigger undefined behavior using safe apis – checkout the README and the module level rustdocs for more details. Among other things, we have added additional validation checking to string kernels and DecimalArrays and sealed some sensitive traits.
Performance: There have been several major performance improvements such as much faster filter kernels, thanks to @tustvold.
Easier to Use APIs: Several of the APIs are now easier to use (e.g. #1645 and #1739 which lowers the barrier to entry of using arrow-rs, thanks to @HaoYang670.
DataType::Null support: is much improved, such as in the cast kernels, thanks to @WinkerDu.
Improved JSON reader: The JSON reader is now easier to use thanks to @sum12.

Looking Forward:

Make ArrayData Easier to use Safely: Some amount of unsafe will likely always be required in arrow (for fast IPC, for example), but we are also working to improve the underlying ArrayData structure to make it more compatible with the ecosystem (e.g. use Bytes), support faster to decode from parquet, and to avoid bugs related to offsets (slicing) which are a frequent pain point.
FlightSQL – we have some initial support for Flight SQL thank to @wangfenjin and @timvw, though we would love to see some additional contributors. Such help can include a basic FlightSQL server, and starting work on clients.

Some areas looking for help include:

Decimal 256 support: https://github.com/apache/arrow-rs/issues/131
Support for negative Decimal scale: https://github.com/apache/arrow-rs/issues/1785
Support IPC file compression: https://github.com/apache/arrow-rs/issues/1709
Zero-copy bitmap slicing: https://github.com/apache/arrow-rs/issues/1802

Contributors:

While some open source software can be created mostly by a single contributor, we believe the greatest software with the largest impact and reach is built around its community. Thus, Arrow is part of the Apache Software Foundation and our releases both past and present are a result of our amazing community’s effort.

We would like to thank everyone who has contributed to the arrow-rs repository since the 9.0.2 release. Keep up the great work and we look forward to continued improvements:

git shortlog -sn 9.0.0..16.0.0
Liang-Chi Hsieh
Raphael Taylor-Davies
Andrew Lamb
Remzi Yang
Sergey Glushchenko
Jörn Horstmann
Shani Solomon
dependabot[bot]
Yang Jiang
jakevin
Chao Sun
Yijie Shen
kazuhiko kikuchi
Sumit
Ismail-Maj
Kamil Konior
tfeda
Matthew Turner
iyupeng
ryan-jacobs1
Alex Qyoun-ae
tjwilson90
Andy Grove
Atef Sawaed
Daniël Heres
DuRipeng
Helgi Kristvin Sigurbjarnarson
Kun Liu
Kyle Barron
Marc Garcia
Peter C. Jentsch
Remco Verhoef
Sven Cattell
Thomas Peiselt
Tiphaine Ruy
Trent Feda
Wang Fenjin
Ze'ev Maor
diana

Join the community

If you are interested in contributing to the Rust subproject in Apache Arrow, you can find a list of open issues suitable for beginners here and the full list here.

Other ways to get involved include trying out Arrow on some of your data and filing bug reports, and helping to improve the documentation.