June 2022 Rust Apache Arrow and Parquet 16.0.0 Highlights


Published 16 Jun 2022
By The Apache Arrow PMC (pmc)

Introduction

We recently celebrated releasing version 16.0.0 of the Rust implementation of Apache Arrow. While we still get a few comments on “most rust libraries use versions 0.x.0, why are you at 16.0.0?”, our versioning scheme appears to be working well, and permits quick releases of new features and API evolution in a semver compatible way without breaking downstream projects.

This post contains highlights from the last four months (versions 10.0.0 to 16.0.0) of arrow-rs and parquet-rs development as well as a roadmap of future work. The full list of awesomeness can be found in the CHANGELOG.

As you may remember, the arrow and parquet implementations are in the same crate, on the same release schedule, and in this same blog. This is not for technical reasons, but helps to keep the maintenance burden for delivering great Apache software reasonable, and allows easier development of optimized conversion between Arrow <–> Parquet formats.

Parquet

The parquet crate has seen a return to substantial improvements after being relatively dormant for several years. The current major areas of focus are

  1. Performance: Improving the raw performance for reading and writing mirroring the efforts that went into the C++ version a few years ago.
  2. API Ease of Use: Improving the API so it is easy to use efficiently with modern Rust for two preeminent use cases: 1) reading from local disk and 2) reading asynchronously from remote object stores.

Some Major Highlights:

Looking Forward:

  • Even Faster: We are actively working to make writing even faster and expect to see some major improvements over the next few releases.
  • Object Store Integration: Support for easily and efficiently reading/writing to/from object storage is improving, and we expect it will soon work well out of the box, fetching the minimal bytes, etc… More on this to follow in a separate blog post.
  • Parallel Decode: We intend to transparently support high performance parallel decoding of parquet to arrow arrays, when invoked from a rayon threadpool.

Arrow

The Rust arrow implementation has also had substantial improvements, in addition to bug fixes and performance improvements.

Some Major Highlights:

  • Ecosystem Compatibility: @viriya has put in a massive effort to improve (and prove) compatibility with other Arrow implementations via the Rust IPC integration tests. There have been major improvements for corner cases involving nested structures, nullability, nested dictionaries, etc.
  • Safety: We continue to improve the safety of arrow, and it is not possible to trigger undefined behavior using safe apis – checkout the README and the module level rustdocs for more details. Among other things, we have added additional validation checking to string kernels and DecimalArrays and sealed some sensitive traits.
  • Performance: There have been several major performance improvements such as much faster filter kernels, thanks to @tustvold.
  • Easier to Use APIs: Several of the APIs are now easier to use (e.g. #1645 and #1739 which lowers the barrier to entry of using arrow-rs, thanks to @HaoYang670.
  • DataType::Null support: is much improved, such as in the cast kernels, thanks to @WinkerDu.
  • Improved JSON reader: The JSON reader is now easier to use thanks to @sum12.

Looking Forward:

  • Make ArrayData Easier to use Safely: Some amount of unsafe will likely always be required in arrow (for fast IPC, for example), but we are also working to improve the underlying ArrayData structure to make it more compatible with the ecosystem (e.g. use Bytes), support faster to decode from parquet, and to avoid bugs related to offsets (slicing) which are a frequent pain point.
  • FlightSQL – we have some initial support for Flight SQL thank to @wangfenjin and @timvw, though we would love to see some additional contributors. Such help can include a basic FlightSQL server, and starting work on clients.

Some areas looking for help include:

Contributors:

While some open source software can be created mostly by a single contributor, we believe the greatest software with the largest impact and reach is built around its community. Thus, Arrow is part of the Apache Software Foundation and our releases both past and present are a result of our amazing community’s effort.

We would like to thank everyone who has contributed to the arrow-rs repository since the 9.0.2 release. Keep up the great work and we look forward to continued improvements:

git shortlog -sn 9.0.0..16.0.0
    47  Liang-Chi Hsieh
    45  Raphael Taylor-Davies
    43  Andrew Lamb
    40  Remzi Yang
     8  Sergey Glushchenko
     7  Jörn Horstmann
     6  Shani Solomon
     6  dependabot[bot]
     5  Yang Jiang
     4  jakevin
     4  Chao Sun
     4  Yijie Shen
     3  kazuhiko kikuchi
     2  Sumit
     2  Ismail-Maj
     2  Kamil Konior
     2  tfeda
     2  Matthew Turner
     1  iyupeng
     1  ryan-jacobs1
     1  Alex Qyoun-ae
     1  tjwilson90
     1  Andy Grove
     1  Atef Sawaed
     1  Daniël Heres
     1  DuRipeng
     1  Helgi Kristvin Sigurbjarnarson
     1  Kun Liu
     1  Kyle Barron
     1  Marc Garcia
     1  Peter C. Jentsch
     1  Remco Verhoef
     1  Sven Cattell
     1  Thomas Peiselt
     1  Tiphaine Ruy
     1  Trent Feda
     1  Wang Fenjin
     1  Ze'ev Maor
     1  diana

Join the community

If you are interested in contributing to the Rust subproject in Apache Arrow, you can find a list of open issues suitable for beginners here and the full list here.

Other ways to get involved include trying out Arrow on some of your data and filing bug reports, and helping to improve the documentation.