June 2022 Rust Apache Arrow and Parquet 16.0.0 Highlights
16 Jun 2022
By The Apache Arrow PMC (pmc)
We recently celebrated releasing version 16.0.0 of the Rust implementation of Apache Arrow. While we still get a few comments on “most rust libraries use versions 0.x.0, why are you at 16.0.0?”, our versioning scheme appears to be working well, and permits quick releases of new features and API evolution in a semver compatible way without breaking downstream projects.
This post contains highlights from the last four months (versions 10.0.0 to 16.0.0) of arrow-rs and parquet-rs development as well as a roadmap of future work. The full list of awesomeness can be found in the CHANGELOG.
As you may remember, the arrow and parquet implementations are in the same crate, on the same release schedule, and in this same blog. This is not for technical reasons, but helps to keep the maintenance burden for delivering great Apache software reasonable, and allows easier development of optimized conversion between Arrow <–> Parquet formats.
The parquet crate has seen a return to substantial improvements after being relatively dormant for several years. The current major areas of focus are
- Performance: Improving the raw performance for reading and writing mirroring the efforts that went into the C++ version a few years ago.
- API Ease of Use: Improving the API so it is easy to use efficiently with modern Rust for two preeminent use cases: 1) reading from local disk and 2) reading
asynchronously from remote object stores.
Some Major Highlights:
- Advanced Metadata Access: API access to advanced parquet metadata, such as PageEncoding, BloomFilters and PageIndex.
- Improved API Usability: For example, the parquet writer now uses
std:io::Writerather than a custom
ParquetWritertrait, making it more interoperable with the rest of the Rust ecosystem and the projection API is easier to use with nested types.
- Rewritten support for nested types (e.g. struct, lists) : @tustvold has revamped / rewritten support for reading and writing structured types, which both improved the support for arbitrary nested schemas, and is 30% faster.
- Even Faster: We are actively working to make writing even faster and expect to see some major improvements over the next few releases.
- Object Store Integration: Support for easily and efficiently reading/writing to/from object storage is improving, and we expect it will soon work well out of the box, fetching the minimal bytes, etc… More on this to follow in a separate blog post.
- Parallel Decode: We intend to transparently support high performance parallel decoding of parquet to arrow arrays, when invoked from a rayon threadpool.
The Rust arrow implementation has also had substantial improvements, in addition to bug fixes and performance improvements.
Some Major Highlights:
- Ecosystem Compatibility: @viriya has put in a massive effort to improve (and prove) compatibility with other Arrow implementations via the Rust IPC integration tests. There have been major improvements for corner cases involving nested structures, nullability, nested dictionaries, etc.
- Safety: We continue to improve the safety of arrow, and it is not possible to trigger undefined behavior using
safeapis – checkout the README and the module level rustdocs for more details. Among other things, we have added additional validation checking to string kernels and
DecimalArraysand sealed some sensitive traits.
- Performance: There have been several major performance improvements such as much faster filter kernels, thanks to @tustvold.
- Easier to Use APIs: Several of the APIs are now easier to use (e.g. #1645 and #1739 which lowers the barrier to entry of using
arrow-rs, thanks to @HaoYang670.
- DataType::Null support: is much improved, such as in the cast kernels, thanks to @WinkerDu.
- Improved JSON reader: The JSON reader is now easier to use thanks to @sum12.
- Make ArrayData Easier to use Safely: Some amount of
unsafewill likely always be required in arrow (for fast IPC, for example), but we are also working to improve the underlying
ArrayDatastructure to make it more compatible with the ecosystem (e.g. use
Bytes), support faster to decode from parquet, and to avoid bugs related to offsets (slicing) which are a frequent pain point.
- FlightSQL – we have some initial support for Flight SQL thank to @wangfenjin and @timvw, though we would love to see some additional contributors. Such help can include a basic FlightSQL server, and starting work on clients.
Some areas looking for help include:
- Decimal 256 support: https://github.com/apache/arrow-rs/issues/131
- Support for negative Decimal scale: https://github.com/apache/arrow-rs/issues/1785
- Support IPC file compression: https://github.com/apache/arrow-rs/issues/1709
- Zero-copy bitmap slicing: https://github.com/apache/arrow-rs/issues/1802
While some open source software can be created mostly by a single contributor, we believe the greatest software with the largest impact and reach is built around its community. Thus, Arrow is part of the Apache Software Foundation and our releases both past and present are a result of our amazing community’s effort.
We would like to thank everyone who has contributed to the arrow-rs repository since the
9.0.2 release. Keep up the great work and we look forward to continued improvements:
git shortlog -sn 9.0.0..16.0.0 47 Liang-Chi Hsieh 45 Raphael Taylor-Davies 43 Andrew Lamb 40 Remzi Yang 8 Sergey Glushchenko 7 Jörn Horstmann 6 Shani Solomon 6 dependabot[bot] 5 Yang Jiang 4 jakevin 4 Chao Sun 4 Yijie Shen 3 kazuhiko kikuchi 2 Sumit 2 Ismail-Maj 2 Kamil Konior 2 tfeda 2 Matthew Turner 1 iyupeng 1 ryan-jacobs1 1 Alex Qyoun-ae 1 tjwilson90 1 Andy Grove 1 Atef Sawaed 1 Daniël Heres 1 DuRipeng 1 Helgi Kristvin Sigurbjarnarson 1 Kun Liu 1 Kyle Barron 1 Marc Garcia 1 Peter C. Jentsch 1 Remco Verhoef 1 Sven Cattell 1 Thomas Peiselt 1 Tiphaine Ruy 1 Trent Feda 1 Wang Fenjin 1 Ze'ev Maor 1 diana
Join the community
Other ways to get involved include trying out Arrow on some of your data and filing bug reports, and helping to improve the documentation.