February 2023 Rust Apache Arrow Highlights


Published 13 Feb 2023
By The Apache Arrow PMC (pmc)

Introduction

With the recent release of 32.0.0 of the Rust implementation of Apache Arrow, it seemed timely to highlight some of the community works since the last update.

The most recent list of detailed changes can always be found in the CHANGELOG, with the full historical list available here.

Arrow

arrow and arrow-flight are native Rust implementations of Apache Arrow. Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.

The Rust language offers best in class performance, memory safety, and the developer productivity of a modern programming language. These features make Rust an excellent choice for building modern high performance analytical systems. When combined, Rust and the Apache Arrow Ecosystem are a compelling toolkit for building the next generation of systems.

The repository recently passed 1400 stars on github, and the community has been focused on performance and feature completeness.

Major Highlights

  • New CSV and JSON Readers: The CSV and JSON readers have been revamped. Their performance has more than doubled, and they now support push-driven parsing facilitating async streaming decode from object storage.
  • Faster Build Times and Reduced Codegen: The arrow crate has been split into multiple smaller crates, and large kernels have been moved behind optional feature flags. These changes allow downstream projects to choose a smaller dependency footprint and build times, if desired.
  • Support for Copy-On-Write: Arrow arrays now support copy-on-write, via the into_builder methods
  • Comparable Row Format: Much faster multi-column Sorting and Grouping is now possible with the the new spillable, comparable row-format
  • FlightSQL Support: FlightSQL support has been expanded
  • Mid-Level Flight Client: A new FlightClient is available that handles lower level protocol details, and easier to use encoding and decoding APIs.
  • IPC File Compression: Arrow IPC file compression with ZSTD and LZ4 is now fully supported.
  • Full Decimal Support: 256-bit decimals and negative scales can be created and manipulated using many kernels, such as arithmetic.
  • Improved Dictionary Support: Dictionaries are now transparently supported in most kernels.
  • Improved Temporal Support: Timestamps with Timezones and other temporal types are supported in many more kernels.
  • Improved Generics: Improved generics allow writing code generic over all arrays, or all arrays with the same layout
  • Downcast Macros: Various helper macros are now available to simplify dynamic dispatch to statically typed implementations.

Parquet

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. The Apache Parquet implementation in Rust is one of the fastest and most sophisticated open source implementations available.

Major Highlights

  • Arbitrarily Nested Schema: Arbitrarily nested schemas can be read to and written from arrow, as described in the series of blog posts on the topic.
  • Predicate Pushdown: The arrow reader now supports advanced predicate pushdown, including late materialization, as described here. See RowSelection and ArrowPredicate.
  • Bloom Filter Support: Support for both reading and writing bloom filters has been added.
  • CLI Tools: additional CLI tools have been added to introspect and manipulate parquet data.

Object Store

Modern analytic workloads increasingly make use of blob storage facilities, such as S3, to store large volumes of queryable data. A native Rust object storage implementation that works well with the Rust Ecosystem in general, and the Arrow IO abstractions, is an important building block for many applications. The object_store crate was donated to the Apache Arrow project in July 2022 to fill this need, and while it follows a separate release schedule than the arrow and parquet crates, it forms an integral part of the overarching Arrow IO story.

Recent improvements include the following:

Major Highlights

  • Streaming Upload: Multipart uploads are now supported.
  • Minimised dependency footprint: Upstream SDKs are no longer used, improving consistency and reducing dependencies.
  • HTTP / WebDAV Support: Applications can read from arbitrary HTTP servers, with mutation and listing supported on WebDAV-compatible endpoints.
  • Configurable Networking: Socks proxies, and advanced HTTP client configuration are now supported.
  • Serializable Configuration: Configuration information can now be easily serialized and deserialized.
  • Additional Authentication: Additional authentication options are now available for the various cloud providers.

Contributors:

While some open source software can be created mostly by a single contributor, we believe the greatest software with the largest impact and reach is built around its community. Thus, Arrow is proud to part of the Apache Software Foundation and our releases both past and present are a result of our amazing community’s effort.

We would like to thank everyone who has contributed to the arrow-rs repository since the 16.0.0 release. Keep up the great work, and we look forward to continued improvements:

% git shortlog -sn 16.0.0..32.0.0
   347  Raphael Taylor-Davies
   166  Liang-Chi Hsieh
    94  Andrew Lamb
    36  Remzi Yang
    30  Kun Liu
    21  Yang Jiang
    20  askoa
    17  dependabot[bot]
    15  Vrishabh
    12  Dan Harris
    12  Wei-Ting Kuo
    11  Daniël Heres
    11  Jörn Horstmann
     9  Brent Gardner
     9  Ian Alexander Joiner
     9  Jiayu Liu
     9  Martin Grigorov
     8  Palladium
     7  Jeffrey
     7  Marco Neumann
     6  Robert Pack
     6  Will Jones
     4  Andy Grove
     4  comphead
     3  Adrián Gallego Castellanos
     3  Markus Westerlind
     3  Quentin
     2  Alex Qyoun-ae
     2  Dmitry Patsura
     2  Frank
     2  Jiacai Liu
     2  Marc Garcia
     2  Marko Grujic
     2  Max Burke
     2  Your friendly neighborhood geek
     2  sachin agarwal
     1  Aarash Heydari
     1  Adam Gutglick
     1  Andrey Frolov
     1  Anthony Poncet
     1  Artjoms Iskovs
     1  Ben Kimock
     1  Brian Phillips
     1  Carol (Nichols || Goulding)
     1  Christian Salvati
     1  Dalton Modlin
     1  Daniel Martinez Maqueda
     1  Daniel Poelzleithner
     1  Davis Silverman
     1  Dhruv Vats
     1  Fabio Silva
     1  GeauxEric
     1  George Andronchik
     1  Ismail-Maj
     1  Ismaël Mejía
     1  JanKaul
     1  JasonLi
     1  Javier Goday
     1  Jayjeet Chakraborty
     1  Jean-Charles Campagne
     1  Jie Han
     1  John Hughes
     1  Jon Mease
     1  Kevin Lim
     1  Kohei Suzuki
     1  Konstantin Fastov
     1  Marius S
     1  Masato Kato
     1  Matthijs Brobbel
     1  Michael Edwards
     1  Pier-Olivier Thibault
     1  Remco Verhoef
     1  Rutvik Patel
     1  Sean Smith
     1  Sid
     1  Stanislav Lukeš
     1  Steve Vaughan
     1  Stuart Carnie
     1  Sumit
     1  Trent Feda
     1  Valeriy V. Vorotyntsev
     1  Wenjun L
     1  X
     1  aksharau
     1  bmmeijers
     1  chunshao.rcs
     1  jakevin
     1  kastolars
     1  nvartolomei
     1  xudong.w
     1  哇呜哇呜呀咦耶
     1  尹吉峰

Join the community

If you are interested in contributing to the Rust subproject in Apache Arrow, we encourage you to try out Arrow on some of your data, help improve the documentation, or submit a PR. You can find a list of open issues suitable for beginners here and the full list here.