Apache Arrow 0.13.0 Release


Published 02 Apr 2019
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.13.0 release. This covers more than 2 months of development work and includes 550 resolved issues from 81 distinct contributors.

See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

While it’s a large release, this post will give some brief highlights in the project since the 0.12.0 release from January.

New committers and PMC member

The Arrow team is growing! Since the 0.12.0 release we have increased the size of our committer and PMC rosters.

Thank you for all your contributions!

Rust DataFusion Query Engine donation

Since the last release, we received a donation of DataFusion, a Rust-native query engine for the Arrow columnar format, whose development had been led prior by Andy Grove. Read more about DataFusion in our February blog post.

This is an exciting development for the Rust community, and we look forward to developing more analytical query processing within the Apache Arrow project.

Arrow Flight gRPC progress

Over the last couple months, we have made significant progress on Arrow Flight, an Arrow-native data messaging framework. We have integration tests to check C++ and Java compatibility, and we have added Python bindings for the C++ library. We will write a future blog post to go into more detail about how Flight works.

C++ notes

There were 231 issues relating to C++ in this release, far too much to summarize in a blog post. Some notable items include:

  • An experimental ExtensionType was developed for creating user-defined data types that can be embedded in the Arrow binary protocol. This is not yet finalized, but feedback would be welcome.
  • We have undertaken a significant reworking of our CMake build system for C++ to make the third party dependencies more configurable. Among other things, this eases work on packaging for Linux distributions. Read more about this in the C++ developer documentation.
  • Laying more groundwork for an Arrow-native in-memory query engine
  • We began building a reader for line-delimited JSON files
  • Gandiva can now be compiled on Windows with Visual Studio

C# Notes

C# .NET development has picked up since the initial code donation last fall. 11 issues were resolved this release cycle.

The Arrow C# package is now available via NuGet.

Go notes

8 Go-related issues were resolved. A notable feature is the addition of a CSV file writer.

Java notes

26 Java issues were resolved. Outside of Flight-related work, some notable items include:

  • Migration to Java 8 date and time APIs from Joda
  • Array type support in JDBC adapter

Javascript Notes

The recent JavaScript 0.4.1 release is the last JavaScript-only release of Apache Arrow. Starting with 0.13 the Javascript implementation is now included in mainline Arrow releases! The version number of the released JavaScript packages will now be in sync with the mainline version number.

Python notes

86 Python-related issues were resolved. Some highlights include:

  • The Gandiva LLVM expression compiler is now available in the Python wheels through the pyarrow.gandiva module.
  • Flight RPC bindings
  • Improved pandas serialization performance with RangeIndex
  • pyarrow can be used without pandas installed

Note that Apache Arrow will continue to support Python 2.7 until January 2020.

Ruby and C GLib notes

36 C/GLib- and Ruby-related issues were resolved. The work continues to follow the upstream work in the C++ project.

  • Arrow::RecordBatch#raw_records was added. It can convert a record batch to a Ruby’s array in 10x-200x faster than the same conversion by a pure-Ruby implementation.

Rust notes

69 Rust-related issues were resolved. Many of these relate to ongoing work in the DataFusion query engine. Some notable items include:

  • Date/time support
  • SIMD for arithmetic operations
  • Writing CSV and reading line-delimited JSON
  • Parquet data source support for DataFusion
  • Prototype DataFrame-style API for DataFusion
  • Continued evolution of Parquet file reader

R development progress

The Arrow R developers have expanded the scope of the R language bindings and additionally worked on packaging support to be able to submit the package to CRAN in the near future. 23 issues were resolved for this release.

We wrote in January about ongoing work to accelerate R work on Apache Spark using Arrow.

Community Discussions Ongoing

There are a number of active discussions ongoing on the developer dev@arrow.apache.org mailing list. We look forward to hearing from the community there:

  • Benchmarking: we are working to create tools for tracking all of our benchmark results on a commit-by-commit basis in a centralized database schema so that we can monitor for performance regressions over time. We hope to develop a publicly viewable benchmark result dashboard.
  • C++ Datasets: development of a unified API for reading and writing datasets stored in various common formats like Parquet, JSON, and CSV.
  • C++ Query Engine: architecture of a parallel Arrow-native query engine for C++
  • Arrow Flight Evolution: adding features to support different real-world data messaging use cases
  • Arrow Columnar Format evolution: we are discussing a new “duration” or “time interval” type and some other additions to the Arrow columnar format.