Apache Arrow 0.8.0 Release ∞
18 Dec 2017
By Wes McKinney (wesm)
The Apache Arrow team is pleased to announce the 0.8.0 release. It is the product of 10 weeks of development and includes 286 resolved JIRAs with many new features and bug fixes to the various language implementations. This is the largest release since 0.3.0 earlier this year.
As part of work towards a stabilizing the Arrow format and making a 1.0.0 release sometime in 2018, we made a series of backwards-incompatible changes to the serialized Arrow metadata that requires Arrow readers and writers (0.7.1 and earlier) to upgrade in order to be compatible with 0.8.0 and higher. We expect future backwards-incompatible changes to be rare going forward.
We discuss some highlights from the release and other project news in this post.
Projects “Powered By” Apache Arrow
A growing ecosystem of projects are using Arrow to solve in-memory analytics and data interchange problems. We have added a new Powered By page to the Arrow website where we can acknowledge open source projects and companies which are using Arrow. If you would like to add your project to the list as an Arrow user, please let us know.
New Arrow committers
Since the last release, we have added 5 new Apache committers:
- Phillip Cloud, who has mainly contributed to C++ and Python
- Bryan Cutler, who has mainly contributed to Java and Spark integration
- Li Jin, who has mainly contributed to Java and Spark integration
- Siddharth Teotia, who has mainly contributed to Java
Welcome to the Arrow team, and thank you for your contributions!
Improved Java vector API, performance improvements
Siddharth Teotia led efforts to revamp the Java vector API to make things simpler and faster. As part of this, we removed the dichotomy between nullable and non-nullable vectors.
See Sidd’s blog post for more about these changes.
Decimal support in C++, Python, consistency with Java
Phillip Cloud led efforts this release to harden details about exact decimal values in the Arrow specification and ensure a consistent implementation across Java, C++, and Python.
Arrow now supports decimals represented internally as a 128-bit little-endian integer, with a set precision and scale (as defined in many SQL-based systems). As part of this work, we needed to change Java’s internal representation from big- to little-endian.
We are now integration testing decimals between Java, C++, and Python, which will facilitate Arrow adoption in Apache Spark and other systems that use both Java and Python.
Decimal data can now be read and written by the Apache Parquet C++ library, including via pyarrow.
In the future, we may implement support for smaller-precision decimals represented by 32- or 64-bit integers.
C++ improvements: expanded kernels library and more
In C++, we have continued developing the new
consisting of native computation fuctions for Arrow data. New contributor
Licht Takeuchi helped expand the supported types for type casting in
compute::Cast. We have also implemented new kernels
DictionaryEncode for computing the distinct elements of an array and
dictionary encoding (conversion to categorical), respectively.
We expect the C++ computation “kernel” library to be a major expansion area for the project over the next year and beyond. Here, we can also implement SIMD- and GPU-accelerated versions of basic in-memory analytics functionality.
As minor breaking API change in C++, we have made the
APIs “virtual” or abstract interfaces, to enable different implementations of a
record batch or table which conform to the standard interface. This will help
enable features like lazy IO or column loading.
There was significant work improving the C++ library generally and supporting work happening in Python and C. See the change log for full details.
GLib C improvements: Meson build, GPU support
Developing of the GLib-based C bindings has generally tracked work happening in the C++ library. These bindings are being used to develop data science tools for Ruby users and elsewhere.
The C bindings now support the Meson build system in addition to autotools, which enables them to be built on Windows.
The Arrow GPU extension library is now also supported in the C bindings.
In addition to some of the new features mentioned above, we have made a variety of usability and performance improvements for integrations with pandas, NumPy, Dask, and other Python projects which may make use of pyarrow, the Arrow Python library.
Some of these improvements include:
- Component-based serialization for more flexible and memory-efficient transport of large or complex Python objects
- Substantially improved serialization performance for pandas objects when
pyarrow.deserialize. This includes a special
pyarrow.pandas_serialization_contextwhich further accelerates certain internal details of pandas serialization * Support zero-copy reads for
pyarrow.deserializefor objects without Python objects
- Multithreaded conversions from
pyarrow.Table(we already supported multithreaded conversions from Arrow back to pandas)
- More efficient conversion from 1-dimensional NumPy arrays to Arrow format
- New generic buffer compression and decompression APIs
- Enhanced Parquet cross-compatibility with fastparquet and improved Dask support
- Python support for accessing Parquet row group column statistics
The 0.8.0 release includes some API and format changes, but upcoming releases will focus on ompleting and stabilizing critical functionality to move the project closer to a 1.0.0 release.
With the ecosystem of projects using Arrow expanding rapidly, we will be working to improve and expand the libraries in support of downstream use cases.