Apache Arrow 0.12.0 Release


Published 21 Jan 2019
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.12.0 release. This is the largest release yet in the project, covering 3 months of development work and includes 614 resolved issues from 77 distinct contributors.

See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

It’s a huge release, but we’ll give some brief highlights and new from the project to help guide you to the parts of the project that may be of interest.

New committers and PMC member

The Arrow team is growing! Since the 0.11.0 release we have added 3 new committers:

We also pleased to announce that Krisztián Szűcs has been promoted from committer to PMC (Project Management Committee) member.

Thank you for all your contributions!

Code donations

Since the last release, we have received 3 code donations into the Apache project.

We are excited to continue to grow the Apache Arrow development community.

Combined project-level documentation

Since the last release, we have merged the Python and C++ documentation to create a combined project-wide documentation site: https://arrow.apache.org/docs. There is now some prose documentation about many parts of the C++ library. We intend to keep adding documentation for other parts of Apache Arrow to this site.

Packages

We start providing the official APT and Yum repositories for C++ and GLib (C). See the install document for details.

C++ notes

Much of the C++ development work the last 3 months concerned internal code refactoring and performance improvements. Some user-visible highlights of note:

  • Experimental support for in-memory sparse tensors (or ndarrays), with support for zero-copy IPC
  • Support for building on Alpine Linux
  • Significantly improved hash table utilities, with improved hash table performance in many parts of the library
  • IO library improvements for both read and write buffering
  • A fast trie implementation for string searching
  • Many improvements to the parallel CSV reader in performance and features. See the changelog

Since the LLVM-based Gandiva expression compiler was donated to Apache Arrow during the last release cycle, development there has been moving along. We expect to have Windows support for Gandiva and to ship this in downstream packages (like Python) in the 0.13 release time frame.

Go notes

The Arrow Go development team has been expanding. The Go library has gained support for many missing features from the columnar format as well as semantic constructs like chunked arrays and tables that are used heavily in the C++ project.

GLib and Ruby notes

Development of the GLib-based C bindings and corresponding Ruby interfaces have advanced in lock-step with the C++, Python, and R libraries. In this release, there are many new features in C and Ruby:

  • Compressed file read/write support
  • Support for using the C++ parallel CSV reader
  • Feather file support
  • Gandiva bindings
  • Plasma bindings

Python notes

We fixed a ton of bugs and made many improvements throughout the Python project. Some highlights from the Python side include:

  • Python 3.7 support: wheels and conda packages are now available for Python 3.7
  • Substantially improved memory use when converting strings types to pandas format, including when reading Parquet files. Parquet users should notice significantly lower memory use in common use cases
  • Support for reading and writing compressed files, can be used for CSV files, IPC, or any other form of IO
  • The new pyarrow.input_stream and pyarrow.output_stream functions support read and write buffering. This is analogous to BufferedIOBase from the Python standard library, but the internals are implemented natively in C++.
  • Gandiva (LLVM expression compiler) bindings, though not yet available in pip/conda yet. Look for this in 0.13.0.
  • Many improvements to Arrow CUDA integration, including interoperability with Numba

R notes

The R library made huge progress in 0.12, with work led by new committer Romain Francois. The R project’s features are not far behind the Python library, and we are hoping to be able to make the R library available to CRAN users for use with Apache Spark or for reading and writing Parquet files over the next quarter.

Users of the feather R library will see significant speed increases in many cases when reading Feather files with the new Arrow R library.

Rust notes

Rust development had an active last 3 months; see the changelog for details.

A native Rust implementation was just donated to the project, and the community intends to provide a similar level of functionality for reading and writing Parquet files using the Arrow in-memory columnar format as an intermediary.

Upcoming Roadmap, Outlook for 2019

Apache Arrow has become a large, diverse open source project. It is now being used in dozens of downstream open source and commercial projects. Work will be proceeding in many areas in 2019:

  • Development of in-memory query execution engines (e.g. in C++, Rust)
  • Expanded support for reading and writing the Apache Parquet format, and other common data formats like Apache Avro, CSV, JSON, and Apache ORC.
  • New Flight RPC system for fast messaging of Arrow datasets
  • Expanded support in existing programming languages
  • New programming language bindings or native implementations

It promises to be an exciting 2019. We look forward to having you involved in the development community.