Apache Arrow Rust 5.0.0 Release
Published
29 Jul 2021
By
The Apache Arrow PMC (pmc)
We recently released the 5.0.0 Rust version of Apache Arrow which coincides with the Arrow 5.0.0 release. This post highlights some of the improvements in the Rust implementation. The full changelog can be found here.
The Rust Arrow implementation would not be possible without the wonderful work and support of our community, and the 5.0.0 release is no exception. It includes 161 commits from 34 individual contributors, many of them with their first contribution. Thank you all very much.
Arrow
Feature-wise, this release adds:
- A new kernel which lexicographically partitions points.
- Expanded support for the FFI/C data interface, easing integration with the broader Arrow ecosystem
- Usability enhancements for creating and manipulating
RecordBatch
es. - Improved usability for Arrow Flight’s API.
- Slimmer dependency stack when default features are disabled
We continue to leverage the Rust ecosystem to deliver reliable and performant code. We made significant progress towards running the Rust test suite under the MIRI checker (a sort of valgrind for Rust) for memory access violations, and we expect it to be fully enabled in CI for the 5.1.0 release.
Of course, this release also contains bug fixes, performance improvements, and improved documentation examples. For the full list of changes, please consult the changelog.
Parquet
The parquet-derive
crate now automatically derives the required parquet schema, and the parquet
crate had several bug fixes and enhancements.
More Frequent Releases
Arrow releases major versions every three months. The Rust implementation has been experimenting with releasing minor version updates to speed the flow of new features and fixes. By implementing a new development process, as described in A New Development Workflow for Arrow’s Rust Implementation we have successfully created 4 minor releases on the 4.x.x line every other week without any reports of breakage.
You can always find the latest releases on crates.io: arrow
, parquet
, arrow-flight
, and parquet-derive
.
DataFusion & Ballista
DataFusion is an in-memory query engine with DataFrame and SQL APIs, built on top of Arrow. Ballista is a distributed compute platform. These projects are now in their own repository, and are no longer released in lock-step with Arrow. Expect further news in this area soon.
Roadmap for 6.0.0 and Beyond
Here are some of the initiatives that contributors are currently working on for future releases:
- Improved performance of compute kernels
- Date/time/timestamp/interval compute kernels
- MapArray support
- Preparations for removing the use of
unsafe
to make arrow faster and more secure – see the mailing list discussion for more details.
Contributors to 5.0.0:
Again, thank you to all the contributors for this release. Here is the raw git listing:
28 Jorge Leitao
27 Andrew Lamb
15 Jiayu Liu
12 Ritchie Vink
10 Wakahisa
8 Raphael Taylor-Davies
6 Daniël Heres
5 Andy Grove
5 Navin
5 Jörn Horstmann
4 Ádám Lippai
4 Dominik Moritz
4 Marco Neumann
3 Roee Shlomo
3 Michael Edwards
2 Steven
2 Krisztián Szűcs
2 Gary Pennington
1 Ben Chambers
1 Max Meldrum
1 Edd Robinson
1 Gang Liao
1 Chojan Shang
1 Boaz
1 Wes McKinney
1 Yordan Pavlov
1 baishen
1 hulunbier
1 kazuhiko kikuchi
1 Dmitry Patsura
1 Kornelijus Survila
1 Laurent Mazare
1 Manish Gill
1 Marc van Heerden
How to Get Involved
If you are interested in contributing to the Rust implementation of Apache Arrow, we would love to have you! You can help by trying out Arrow on some of your own data and projects and filing bug reports and helping to improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for beginners is here and the full list is here.