Apache Arrow 0.16.0 Release


Published 12 Feb 2020
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 0.16.0 release. This covers about 4 months of development work and includes 735 resolved issues from 99 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

New committers

Since the 0.15.0 release, we’ve added two new committers:

Thank you for all your contributions!

Columnar Format Notes

We still have work to do to complete comprehensive columnar format integration testing between the Java and C++ libraries. Once this work is completed, we intend to make a 1.0.0 release with forward and backward compatibility guarantees.

We clarified some ambiguity on dictionary encoding in the specification. Work is on going to implement the features in Arrow libraries.

Arrow Flight RPC notes

Flight development work has recently focused on robustness and stability. If you are not yet familiar with Flight, read the introductory blog post from October.

We are also discussing adding a “bidirectional RPC” which enables request-response workflows requiring both client and server to send data streams to be performed a single RPC request.

C++ notes

Some work has been done to make the default build configuration of Arrow C++ as lean as possible. The Arrow C++ core can now be built without any external dependencies other than a new enough C++ compiler (gcc 4.9 or higher). Notably, Boost is no longer required. We invested effort to vendor some small essential dependencies: Flatbuffers, double-conversion, and uriparser. Many optional features requiring external libraries, like compression and GLog integration, are now disabled by default. Several subcomponents of the C++ project like the filesystem API, CSV, compute, dataset and JSON layers, as well as command-line utilities, are now disabled by default. The only toolchain dependency enabled by default is jemalloc, the recommended memory allocator, but this can also be disabled if desired. For illustration, see the example minimal build script and Dockerfile.

When enabled, the default jemalloc configuration has been tweaked to return memory more aggressively to the OS (ARROW-6910, ARROW-6994). We welcome feedback from users about our memory allocation configuration and performance in applications.

The array validation facilities have been vastly expanded and now exist in two flavors: the Validate method does a light-weight validation that’s O(1) in array size, while the potentially O(N) method ValidateFull does thorough data validation (ARROW-6157).

The IO APIs now use Result<T> when returning both a Status and result value, rather than taking a pointer-out function parameter (ARROW-7235).

C++: CSV

An option is added to attempt automatic dictionary encoding of string columns during reading a CSV file, until a cardinality limit is reached. When successful, it can make reading faster and the resulting Arrow data is much more memory-efficient (ARROW-3408).

The CSV APIs now use Result<T> when returning both a Status and result value, rather than taking a pointer-out function parameter (ARROW-7236).

C++: Datasets

The 0.16 release introduces the Datasets API to the C++ library, along with bindings in Python and R. This API allows you to treat multiple files as a single logical dataset entity and make efficient selection queries against it. This release includes support for Parquet and Arrow IPC file formats. Factory objects allow you to discover files in a directory recursively, inspect the schemas in the files, and performs some basic schema unification. You may specify how file path segments map to partition, and there is support for auto-detecting some partition information, including Hive-style partitioning. The Datasets API includes a filter expression syntax as well as column selection. These are evaluated with predicate pushdown, and for Parquet, evaluation is pushed down to row groups.

C++: Filesystem layer

An HDFS implementation of the FileSystem class is available (ARROW-6720). We plan to deprecate the prior bespoke C++ HDFS class in favor of the standardized filesystem API.

The filesystem APIs now use Result<T> when returning both a Status and result value, rather than taking a pointer-out function parameter (ARROW-7161).

C++: IPC

The Arrow IPC reader is being fuzzed continuously by the OSS-Fuzz infrastructure, to detect undesirable behavior on invalid or malicious input. Several issues have already been found and fixed.

C++: Parquet

Modular encryption is now supported (PARQUET-1300).

A performance regression when reading a file with a large number of columns has been fixed (ARROW-6876, ARROW-7059), as well as several bugs (PARQUET-1766, ARROW-6895).

C++: Tensors

CSC sparse matrices are supported (ARROW-4225).

The Tensor APIs now use Result<T> when returning both a Status and result value, rather than taking a pointer-out function parameter (ARROW-7420).

C# Notes

There were a number of C# bug fixes this release. Note that the C# library is not yet being tested in CI against the other native Arrow implementations (integration tests). We are looking for more contributors for the C# project to help with this and other new feature development.

Java notes

  • Added prose documentation describing how to work with the Java libraries.
  • Some additional algorithms have been added to the “contrib” algorithms package: multithreaded searching of ValueVectors,
  • The memory modules have been refactored so non-netty allocators can now be used.
  • A new utility for populating ValueVectors more concisely for testing was introduced
  • The “contrib” Avro adapter now supports Avro Logical type conversion to corresponding Arrow type.
  • Various bug fixes across all packages

Python notes

pyarrow 0.16 will be the last release to support Python 2.7.

Python now has bindings for the datasets API (ARROW-6341) as well as the S3 (ARROW-6655) and HDFS (ARROW-7310) filesystem implementations.

The Duration (ARROW-5855) and Fixed Size List (ARROW-7261) types are exposed in Python.

Sparse tensors can be converted to dense tensors (ARROW-6624). They are also interoperable with the pydata/sparse and scipy.sparse libraries (ARROW-4223, ARROW-4224).

Pandas extension arrays now are able to roundtrip through Arrow conversion (ARROW-2428).

A memory leak when converting Arrow data to Pandas “object” data has been fixed (ARROW-6874).

Arrow is now tested against Python 3.8, and we now build manylinux2014 wheels for Python 3 (ARROW-7344).

R notes

This release includes a dplyr interface to Arrow Datasets, which let you work efficiently with large, multi-file datasets as a single entity. See vignette("dataset", package = "arrow") for more.

Another major area of work in this release was to improve the installation experience on Linux. A source package installation (as from CRAN) will now handle its C++ dependencies automatically, with no system dependencies beyond what R requires. For common Linux distributions and versions, installation will retrieve a prebuilt static C++ library for inclusion in the package; where this binary is not available, the package executes a bundled script that should build the Arrow C++ library. See vignette("install", package = "arrow") for details.

For more on what’s in the 0.16 R package, see the R changelog.

Ruby and C GLib notes

Ruby and C GLib continues to follow the features in the C++ project.

Ruby

Ruby includes the following improvements.

  • Improve CSV save performance (ARROW-7474).
  • Add support for saving/loading TSV (ARROW-7454).
  • Add Arrow::Schema#build_expression to improve building Gandiva::Expression (ARROW-6619).

C GLib

C GLib includes the following changes.

  • Add support for LargeList, LargeBinary, and LargeString (ARROW-6285, ARROW-6286).
  • Add filter and take API for GArrowTable, GArrowChunkedArray, and GArrowRecordBatch (ARROW-7110, ARROW-7111).
  • Add garrow_table_combine_chunks() API (ARROW-7369).

Rust notes

Support for Arrow data types has been improved, with the following array types now supported (ARROW-3690):

  • Fixed Size List and Fixed Size Binary
  • Adding a String Array for utf-8 strings, and keeping the Binary Array for general binary data
  • Duration and interval arrays.

Initial work on Arrow IPC support has been completed, with readers and writers for streams and files implemented (ARROW-5180).

Rust: DataFusion

Query execution has been reimplemented with an extensible physical query plan. This allows other projects to add other plans, such as for distributed computing or for specific database servers (ARROW-5227).

Added support for writing query results to CSV (ARROW-6274).

The new Parquet -> Arrow reader is now used to read Parquet files (ARROW-6700).

Various other query improvements have been implemented, especially on grouping and aggregate queries (ARROW-6689).

Rust: Parquet

The Arrow reader integration has been completed, allowing Parquet files to be read into Arrow memory (ARROW-4059).

Development notes

Arrow has moved away from Travis-CI and is now using Github Actions for PR-based continuous integration. This new CI configuration relies heavily on docker-compose, making it easier for developers to reproduce builds locally, thanks to tremendous work by Krisztián Szűcs (ARROW-7101).

Community Discussions Ongoing

There are a number of active discussions ongoing on the developer dev@arrow.apache.org mailing list. We look forward to hearing from the community there.

  • Mandatory fields in IPC format: to ease input validation, it is being proposed to mark some fields in our Flatbuffers schema “required”. Those fields are already semantically required, but are not considered so by the generated Flatbuffers verifier. Before accepting this proposal, we need to ensure that it does not break binary compatibility with existing valid data.
  • The C Data Interface has not yet been formally adopted, though the community has reached consensus to move forward after addressing various design questions and concerns.
  • Guidelines for the use of “unsafe” in the Rust implementation are being discussed.