Apache Arrow Rust 5.0.0 Release

Published 29 Jul 2021
By The Apache Arrow PMC (pmc)

We recently released the 5.0.0 Rust version of Apache Arrow which coincides with the Arrow 5.0.0 release. This post highlights some of the improvements in the Rust implementation. The full changelog can be found here.

The Rust Arrow implementation would not be possible without the wonderful work and support of our community, and the 5.0.0 release is no exception. It includes 161 commits from 34 individual contributors, many of them with their first contribution. Thank you all very much.

Arrow

Feature-wise, this release adds:

A new kernel which lexicographically partitions points.
Expanded support for the FFI/C data interface, easing integration with the broader Arrow ecosystem
Usability enhancements for creating and manipulating RecordBatches.
Improved usability for Arrow Flight’s API.
Slimmer dependency stack when default features are disabled

We continue to leverage the Rust ecosystem to deliver reliable and performant code. We made significant progress towards running the Rust test suite under the MIRI checker (a sort of valgrind for Rust) for memory access violations, and we expect it to be fully enabled in CI for the 5.1.0 release.

Of course, this release also contains bug fixes, performance improvements, and improved documentation examples. For the full list of changes, please consult the changelog.

Parquet

The parquet-derive crate now automatically derives the required parquet schema, and the parquet crate had several bug fixes and enhancements.

More Frequent Releases

Arrow releases major versions every three months. The Rust implementation has been experimenting with releasing minor version updates to speed the flow of new features and fixes. By implementing a new development process, as described in A New Development Workflow for Arrow’s Rust Implementation we have successfully created 4 minor releases on the 4.x.x line every other week without any reports of breakage.

You can always find the latest releases on crates.io: arrow, parquet, arrow-flight, and parquet-derive.

DataFusion & Ballista

DataFusion is an in-memory query engine with DataFrame and SQL APIs, built on top of Arrow. Ballista is a distributed compute platform. These projects are now in their own repository, and are no longer released in lock-step with Arrow. Expect further news in this area soon.

Roadmap for 6.0.0 and Beyond

Here are some of the initiatives that contributors are currently working on for future releases:

Improved performance of compute kernels
Date/time/timestamp/interval compute kernels
MapArray support
Preparations for removing the use of unsafe to make arrow faster and more secure – see the mailing list discussion for more details.

Contributors to 5.0.0:

Again, thank you to all the contributors for this release. Here is the raw git listing:

Jorge Leitao
Andrew Lamb
Jiayu Liu
Ritchie Vink
Wakahisa
Raphael Taylor-Davies
Daniël Heres
Andy Grove
Navin
Jörn Horstmann
Ádám Lippai
Dominik Moritz
Marco Neumann
Roee Shlomo
Michael Edwards
Steven
Krisztián Szűcs
Gary Pennington
Ben Chambers
Max Meldrum
Edd Robinson
Gang Liao
Chojan Shang
Boaz
Wes McKinney
Yordan Pavlov
baishen
hulunbier
kazuhiko kikuchi
Dmitry Patsura
Kornelijus Survila
Laurent Mazare
Manish Gill
Marc van Heerden

How to Get Involved

If you are interested in contributing to the Rust implementation of Apache Arrow, we would love to have you! You can help by trying out Arrow on some of your own data and projects and filing bug reports and helping to improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for beginners is here and the full list is here.