Apache Arrow 0.8.0 Release

Published 18 Dec 2017
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.8.0 release. It is the product of 10 weeks of development and includes 286 resolved JIRAs with many new features and bug fixes to the various language implementations. This is the largest release since 0.3.0 earlier this year.

As part of work towards a stabilizing the Arrow format and making a 1.0.0 release sometime in 2018, we made a series of backwards-incompatible changes to the serialized Arrow metadata that requires Arrow readers and writers (0.7.1 and earlier) to upgrade in order to be compatible with 0.8.0 and higher. We expect future backwards-incompatible changes to be rare going forward.

See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

We discuss some highlights from the release and other project news in this post.

Projects “Powered By” Apache Arrow

A growing ecosystem of projects are using Arrow to solve in-memory analytics and data interchange problems. We have added a new Powered By page to the Arrow website where we can acknowledge open source projects and companies which are using Arrow. If you would like to add your project to the list as an Arrow user, please let us know.

New Arrow committers

Since the last release, we have added 5 new Apache committers:

  • Phillip Cloud, who has mainly contributed to C++ and Python
  • Bryan Cutler, who has mainly contributed to Java and Spark integration
  • Li Jin, who has mainly contributed to Java and Spark integration
  • Paul Taylor, who has mainly contributed to JavaScript
  • Siddharth Teotia, who has mainly contributed to Java

Welcome to the Arrow team, and thank you for your contributions!

Improved Java vector API, performance improvements

Siddharth Teotia led efforts to revamp the Java vector API to make things simpler and faster. As part of this, we removed the dichotomy between nullable and non-nullable vectors.

See Sidd’s blog post for more about these changes.

Decimal support in C++, Python, consistency with Java

Phillip Cloud led efforts this release to harden details about exact decimal values in the Arrow specification and ensure a consistent implementation across Java, C++, and Python.

Arrow now supports decimals represented internally as a 128-bit little-endian integer, with a set precision and scale (as defined in many SQL-based systems). As part of this work, we needed to change Java’s internal representation from big- to little-endian.

We are now integration testing decimals between Java, C++, and Python, which will facilitate Arrow adoption in Apache Spark and other systems that use both Java and Python.

Decimal data can now be read and written by the Apache Parquet C++ library, including via pyarrow.

In the future, we may implement support for smaller-precision decimals represented by 32- or 64-bit integers.

C++ improvements: expanded kernels library and more

In C++, we have continued developing the new arrow::compute submodule consisting of native computation fuctions for Arrow data. New contributor Licht Takeuchi helped expand the supported types for type casting in compute::Cast. We have also implemented new kernels Unique and DictionaryEncode for computing the distinct elements of an array and dictionary encoding (conversion to categorical), respectively.

We expect the C++ computation “kernel” library to be a major expansion area for the project over the next year and beyond. Here, we can also implement SIMD- and GPU-accelerated versions of basic in-memory analytics functionality.

As minor breaking API change in C++, we have made the RecordBatch and Table APIs “virtual” or abstract interfaces, to enable different implementations of a record batch or table which conform to the standard interface. This will help enable features like lazy IO or column loading.

There was significant work improving the C++ library generally and supporting work happening in Python and C. See the change log for full details.

GLib C improvements: Meson build, GPU support

Developing of the GLib-based C bindings has generally tracked work happening in the C++ library. These bindings are being used to develop data science tools for Ruby users and elsewhere.

The C bindings now support the Meson build system in addition to autotools, which enables them to be built on Windows.

The Arrow GPU extension library is now also supported in the C bindings.

JavaScript: first independent release on NPM

Brian Hulette and Paul Taylor have been continuing to drive efforts on the TypeScript-based JavaScript implementation.

Since the last release, we made a first JavaScript-only Apache release, version 0.2.0, which is now available on NPM. We decided to make separate JavaScript releases to enable the JS library to release more frequently than the rest of the project.

Python improvements

In addition to some of the new features mentioned above, we have made a variety of usability and performance improvements for integrations with pandas, NumPy, Dask, and other Python projects which may make use of pyarrow, the Arrow Python library.

Some of these improvements include:

  • Component-based serialization for more flexible and memory-efficient transport of large or complex Python objects
  • Substantially improved serialization performance for pandas objects when using pyarrow.serialize and pyarrow.deserialize. This includes a special pyarrow.pandas_serialization_context which further accelerates certain internal details of pandas serialization * Support zero-copy reads for
  • pandas.DataFrame using pyarrow.deserialize for objects without Python objects
  • Multithreaded conversions from pandas.DataFrame to pyarrow.Table (we already supported multithreaded conversions from Arrow back to pandas)
  • More efficient conversion from 1-dimensional NumPy arrays to Arrow format
  • New generic buffer compression and decompression APIs pyarrow.compress and pyarrow.decompress
  • Enhanced Parquet cross-compatibility with fastparquet and improved Dask support
  • Python support for accessing Parquet row group column statistics

Upcoming Roadmap

The 0.8.0 release includes some API and format changes, but upcoming releases will focus on ompleting and stabilizing critical functionality to move the project closer to a 1.0.0 release.

With the ecosystem of projects using Arrow expanding rapidly, we will be working to improve and expand the libraries in support of downstream use cases.

We continue to look for more JavaScript, Julia, R, Rust, and other programming language developers to join the project and expand the available implementations and bindings to more languages.