Apache Arrow 0.5.0 Release
Published
25 Jul 2017
By
Wes McKinney (wesm)
The Apache Arrow team is pleased to announce the 0.5.0 release. It includes 130 resolved JIRAs with some new features, expanded integration testing between implementations, and bug fixes. The Arrow memory format remains stable since the 0.3.x and 0.4.x releases.
See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.
Expanded Integration Testing
In this release, we added compatibility tests for dictionary-encoded data between Java and C++. This enables the distinct values (the dictionary) in a vector to be transmitted as part of an Arrow schema while the record batches contain integers which correspond to the dictionary.
So we might have:
data (string): ['foo', 'bar', 'foo', 'bar']
In dictionary-encoded form, this could be represented as:
indices (int8): [0, 1, 0, 1]
dictionary (string): ['foo', 'bar']
In upcoming releases, we plan to complete integration testing for the remaining data types (including some more complicated types like unions and decimals) on the road to a 1.0.0 release in the future.
C++ Activity
We completed a number of significant pieces of work in the C++ part of Apache Arrow.
Using jemalloc as default memory allocator
We decided to use jemalloc as the default memory allocator unless it is
explicitly disabled. This memory allocator has significant performance
advantages in Arrow workloads over the default malloc
implementation. We will
publish a blog post going into more detail about this and why you might care.
Sharing more C++ code with Apache Parquet
We imported the compression library interfaces and dictionary encoding algorithms from the Apache Parquet C++ library. The Parquet library now depends on this code in Arrow, and we will be able to use it more easily for data compression in Arrow use cases.
As part of incorporating Parquet’s dictionary encoding utilities, we have
developed an arrow::DictionaryBuilder
class to enable building
dictionary-encoded arrays iteratively. This can help save memory and yield
better performance when interacting with databases, Parquet files, or other
sources which may have columns having many duplicates.
Support for LZ4 and ZSTD compressors
We added LZ4 and ZSTD compression library support. In ARROW-300 and other planned work, we intend to add some compression features for data sent via RPC.
Python Activity
We fixed many bugs which were affecting Parquet and Feather users and fixed
several other rough edges with normal Arrow use. We also added some additional
Arrow type conversions: structs, lists embedded in pandas objects, and Arrow
time types (which deserialize to the datetime.time
type).
In upcoming releases we plan to continue to improve Dask support and performance for distributed processing of Apache Parquet files with pyarrow.
The Road Ahead
We have much work ahead of us to build out Arrow integrations in other data systems to improve their processing performance and interoperability with other systems.
We are discussing the roadmap to a future 1.0.0 release on the developer mailing list. Please join the discussion there.