Apache Arrow 0.12.0 Release
21 Jan 2019
By Wes McKinney (wesm)
The Apache Arrow team is pleased to announce the 0.12.0 release. This is the largest release yet in the project, covering 3 months of development work and includes 614 resolved issues from 77 distinct contributors.
It’s a huge release, but we’ll give some brief highlights and new from the project to help guide you to the parts of the project that may be of interest.
New committers and PMC member
The Arrow team is growing! Since the 0.11.0 release we have added 3 new committers:
- Sebastien Binet, who has mainly worked on the Go implementation
- Romain Francois, who has mainly worked on the R implementation
- Yosuke Shiro, who has mainly worked on the GLib (C) and Ruby implementations
We also pleased to announce that Krisztián Szűcs has been promoted from committer to PMC (Project Management Committee) member.
Thank you for all your contributions!
Since the last release, we have received 3 code donations into the Apache project.
- A native C# .NET library donated by Feyen Zylstra LLC.
- A Ruby library for Parquet files which uses the existing GLib bindings to the C++ Parquet library.
- A native Rust Parquet library
We are excited to continue to grow the Apache Arrow development community.
Combined project-level documentation
Since the last release, we have merged the Python and C++ documentation to create a combined project-wide documentation site: https://arrow.apache.org/docs. There is now some prose documentation about many parts of the C++ library. We intend to keep adding documentation for other parts of Apache Arrow to this site.
We start providing the official APT and Yum repositories for C++ and GLib (C). See the install document for details.
Much of the C++ development work the last 3 months concerned internal code refactoring and performance improvements. Some user-visible highlights of note:
- Experimental support for in-memory sparse tensors (or ndarrays), with support for zero-copy IPC
- Support for building on Alpine Linux
- Significantly improved hash table utilities, with improved hash table performance in many parts of the library
- IO library improvements for both read and write buffering
- A fast trie implementation for string searching
- Many improvements to the parallel CSV reader in performance and features. See the changelog
Since the LLVM-based Gandiva expression compiler was donated to Apache Arrow during the last release cycle, development there has been moving along. We expect to have Windows support for Gandiva and to ship this in downstream packages (like Python) in the 0.13 release time frame.
The Arrow Go development team has been expanding. The Go library has gained support for many missing features from the columnar format as well as semantic constructs like chunked arrays and tables that are used heavily in the C++ project.
GLib and Ruby notes
Development of the GLib-based C bindings and corresponding Ruby interfaces have advanced in lock-step with the C++, Python, and R libraries. In this release, there are many new features in C and Ruby:
- Compressed file read/write support
- Support for using the C++ parallel CSV reader
- Feather file support
- Gandiva bindings
- Plasma bindings
We fixed a ton of bugs and made many improvements throughout the Python project. Some highlights from the Python side include:
- Python 3.7 support: wheels and conda packages are now available for Python 3.7
- Substantially improved memory use when converting strings types to pandas format, including when reading Parquet files. Parquet users should notice significantly lower memory use in common use cases
- Support for reading and writing compressed files, can be used for CSV files, IPC, or any other form of IO
- The new
pyarrow.output_streamfunctions support read and write buffering. This is analogous to
BufferedIOBasefrom the Python standard library, but the internals are implemented natively in C++.
- Gandiva (LLVM expression compiler) bindings, though not yet available in pip/conda yet. Look for this in 0.13.0.
- Many improvements to Arrow CUDA integration, including interoperability with Numba
The R library made huge progress in 0.12, with work led by new committer Romain Francois. The R project’s features are not far behind the Python library, and we are hoping to be able to make the R library available to CRAN users for use with Apache Spark or for reading and writing Parquet files over the next quarter.
Users of the
feather R library will see significant speed increases in many
cases when reading Feather files with the new Arrow R library.
Rust development had an active last 3 months; see the changelog for details.
A native Rust implementation was just donated to the project, and the community intends to provide a similar level of functionality for reading and writing Parquet files using the Arrow in-memory columnar format as an intermediary.
Upcoming Roadmap, Outlook for 2019
Apache Arrow has become a large, diverse open source project. It is now being used in dozens of downstream open source and commercial projects. Work will be proceeding in many areas in 2019:
- Development of in-memory query execution engines (e.g. in C++, Rust)
- Expanded support for reading and writing the Apache Parquet format, and other common data formats like Apache Avro, CSV, JSON, and Apache ORC.
- New Flight RPC system for fast messaging of Arrow datasets
- Expanded support in existing programming languages
- New programming language bindings or native implementations
It promises to be an exciting 2019. We look forward to having you involved in the development community.