Apache Arrow 0.13.0 Release
02 Apr 2019
By Wes McKinney (wesm)
While it’s a large release, this post will give some brief highlights in the project since the 0.12.0 release from January.
New committers and PMC member
The Arrow team is growing! Since the 0.12.0 release we have increased the size of our committer and PMC rosters.
- Andy Grove was promoted to PMC member
- Paddy Horan was added as a committer
- Micah Kornfield was added as a committer
- Ravindra Pindikura was added as a committer
- Chao Sun was added as a committer
Thank you for all your contributions!
Rust DataFusion Query Engine donation
Since the last release, we received a donation of DataFusion, a Rust-native query engine for the Arrow columnar format, whose development had been led prior by Andy Grove. Read more about DataFusion in our February blog post.
This is an exciting development for the Rust community, and we look forward to developing more analytical query processing within the Apache Arrow project.
Arrow Flight gRPC progress
Over the last couple months, we have made significant progress on Arrow Flight, an Arrow-native data messaging framework. We have integration tests to check C++ and Java compatibility, and we have added Python bindings for the C++ library. We will write a future blog post to go into more detail about how Flight works.
There were 231 issues relating to C++ in this release, far too much to summarize in a blog post. Some notable items include:
- An experimental
ExtensionTypewas developed for creating user-defined data types that can be embedded in the Arrow binary protocol. This is not yet finalized, but feedback would be welcome.
- We have undertaken a significant reworking of our CMake build system for C++ to make the third party dependencies more configurable. Among other things, this eases work on packaging for Linux distributions. Read more about this in the C++ developer documentation.
- Laying more groundwork for an Arrow-native in-memory query engine
- We began building a reader for line-delimited JSON files
- Gandiva can now be compiled on Windows with Visual Studio
C# .NET development has picked up since the initial code donation last fall. 11 issues were resolved this release cycle.
The Arrow C# package is now available via NuGet.
8 Go-related issues were resolved. A notable feature is the addition of a CSV file writer.
26 Java issues were resolved. Outside of Flight-related work, some notable items include:
- Migration to Java 8 date and time APIs from Joda
- Array type support in JDBC adapter
86 Python-related issues were resolved. Some highlights include:
- The Gandiva LLVM expression compiler is now available in the Python wheels
- Flight RPC bindings
- Improved pandas serialization performance with RangeIndex
- pyarrow can be used without pandas installed
Note that Apache Arrow will continue to support Python 2.7 until January 2020.
Ruby and C GLib notes
36 C/GLib- and Ruby-related issues were resolved. The work continues to follow the upstream work in the C++ project.
Arrow::RecordBatch#raw_recordswas added. It can convert a record batch to a Ruby’s array in 10x-200x faster than the same conversion by a pure-Ruby implementation.
69 Rust-related issues were resolved. Many of these relate to ongoing work in the DataFusion query engine. Some notable items include:
- Date/time support
- SIMD for arithmetic operations
- Writing CSV and reading line-delimited JSON
- Parquet data source support for DataFusion
- Prototype DataFrame-style API for DataFusion
- Continued evolution of Parquet file reader
R development progress
The Arrow R developers have expanded the scope of the R language bindings and additionally worked on packaging support to be able to submit the package to CRAN in the near future. 23 issues were resolved for this release.
We wrote in January about ongoing work to accelerate R work on Apache Spark using Arrow.
Community Discussions Ongoing
There are a number of active discussions ongoing on the developer
firstname.lastname@example.org mailing list. We look forward to hearing from the
- Benchmarking: we are working to create tools for tracking all of our benchmark results on a commit-by-commit basis in a centralized database schema so that we can monitor for performance regressions over time. We hope to develop a publicly viewable benchmark result dashboard.
- C++ Datasets: development of a unified API for reading and writing datasets stored in various common formats like Parquet, JSON, and CSV.
- C++ Query Engine: architecture of a parallel Arrow-native query engine for C++
- Arrow Flight Evolution: adding features to support different real-world data messaging use cases
- Arrow Columnar Format evolution: we are discussing a new “duration” or “time interval” type and some other additions to the Arrow columnar format.