Apache Arrow 0.5.0 Release

Published 25 Jul 2017
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.5.0 release. It includes 130 resolved JIRAs with some new features, expanded integration testing between implementations, and bug fixes. The Arrow memory format remains stable since the 0.3.x and 0.4.x releases.

See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

Expanded Integration Testing

In this release, we added compatibility tests for dictionary-encoded data between Java and C++. This enables the distinct values (the dictionary) in a vector to be transmitted as part of an Arrow schema while the record batches contain integers which correspond to the dictionary.

So we might have:

data (string): ['foo', 'bar', 'foo', 'bar']

In dictionary-encoded form, this could be represented as:

indices (int8): [0, 1, 0, 1]
dictionary (string): ['foo', 'bar']

In upcoming releases, we plan to complete integration testing for the remaining data types (including some more complicated types like unions and decimals) on the road to a 1.0.0 release in the future.

C++ Activity

We completed a number of significant pieces of work in the C++ part of Apache Arrow.

Using jemalloc as default memory allocator

We decided to use jemalloc as the default memory allocator unless it is explicitly disabled. This memory allocator has significant performance advantages in Arrow workloads over the default malloc implementation. We will publish a blog post going into more detail about this and why you might care.

We imported the compression library interfaces and dictionary encoding algorithms from the Apache Parquet C++ library. The Parquet library now depends on this code in Arrow, and we will be able to use it more easily for data compression in Arrow use cases.

As part of incorporating Parquet’s dictionary encoding utilities, we have developed an arrow::DictionaryBuilder class to enable building dictionary-encoded arrays iteratively. This can help save memory and yield better performance when interacting with databases, Parquet files, or other sources which may have columns having many duplicates.

Support for LZ4 and ZSTD compressors

We added LZ4 and ZSTD compression library support. In ARROW-300 and other planned work, we intend to add some compression features for data sent via RPC.

Python Activity

We fixed many bugs which were affecting Parquet and Feather users and fixed several other rough edges with normal Arrow use. We also added some additional Arrow type conversions: structs, lists embedded in pandas objects, and Arrow time types (which deserialize to the datetime.time type).

In upcoming releases we plan to continue to improve Dask support and performance for distributed processing of Apache Parquet files with pyarrow.

The Road Ahead

We have much work ahead of us to build out Arrow integrations in other data systems to improve their processing performance and interoperability with other systems.

We are discussing the roadmap to a future 1.0.0 release on the developer mailing list. Please join the discussion there.