The Apache Arrow team is pleased to announce the 0.5.0 release. It includes 130 resolved JIRAs with some new features, expanded integration testing between implementations, and bug fixes. The Arrow memory format remains stable since the 0.3.x and 0.4.x releases.
In this release, we added compatibility tests for dictionary-encoded data between Java and C++. This enables the distinct values (the dictionary) in a vector to be transmitted as part of an Arrow schema while the record batches contain integers which correspond to the dictionary.
So we might have:
data (string): ['foo', 'bar', 'foo', 'bar']
In dictionary-encoded form, this could be represented as:
indices (int8): [0, 1, 0, 1] dictionary (string): ['foo', 'bar']
In upcoming releases, we plan to complete integration testing for the remaining data types (including some more complicated types like unions and decimals) on the road to a 1.0.0 release in the future.
We completed a number of significant pieces of work in the C++ part of Apache Arrow.
We decided to use jemalloc as the default memory allocator unless it is
explicitly disabled. This memory allocator has significant performance
advantages in Arrow workloads over the default
malloc implementation. We will
publish a blog post going into more detail about this and why you might care.
We imported the compression library interfaces and dictionary encoding algorithms from the Apache Parquet C++ library. The Parquet library now depends on this code in Arrow, and we will be able to use it more easily for data compression in Arrow use cases.
As part of incorporating Parquet’s dictionary encoding utilities, we have
arrow::DictionaryBuilder class to enable building
dictionary-encoded arrays iteratively. This can help save memory and yield
better performance when interacting with databases, Parquet files, or other
sources which may have columns having many duplicates.
We added LZ4 and ZSTD compression library support. In ARROW-300 and other planned work, we intend to add some compression features for data sent via RPC.
We fixed many bugs which were affecting Parquet and Feather users and fixed
several other rough edges with normal Arrow use. We also added some additional
Arrow type conversions: structs, lists embedded in pandas objects, and Arrow
time types (which deserialize to the
In upcoming releases we plan to continue to improve Dask support and performance for distributed processing of Apache Parquet files with pyarrow.
We have much work ahead of us to build out Arrow integrations in other data systems to improve their processing performance and interoperability with other systems.
We are discussing the roadmap to a future 1.0.0 release on the developer mailing list. Please join the discussion there.