Apache Arrow 1.0.0 Release


Published 24 Jul 2020
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 1.0.0 release. This covers over 3 months of development work and includes 810 resolved issues from 100 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

Despite a “1.0.0” version, this is the 18th major release of Apache Arrow and marks a transition to binary stability of the columnar format (which was already informally backward-compatible going back to December 2017) and a transition to Semantic Versioning for the Arrow software libraries.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

1.0.0 Columnar Format and Stability Guarantees

The 1.0.0 release indicates that the Arrow columnar format is declared stable, with forward and backward compatibility guarantees.

The Arrow columnar format received several recent changes and additions, leading to the 1.0.0 format version:

  • The metadata version was bumped to a new version V5, indicating an incompatible change in the buffer layout of Union types. All other types keep the same layout as in V4. V5 also includes format additions to assist with forward compatibility (detecting unsupported changes sent by future library versions). Libraries remain backward compatible with data generated by all libraries back to 0.8.0 (December 2017) and the Java and C++ libraries are capable of generating V4-compatible messages (for sending data to applications using 0.8.0 to 0.17.1).

  • Dictionary indices are now allowed to be unsigned integers rather than only signed integers. Using UInt64 is still discouraged because of poor Java support.

  • A “Feature” enum has been added to announce the use of specific optional features in an IPC stream, such as buffer compression. This new field is not used by any implementation yet.

  • Optional buffer compression using LZ4 or ZStandard was added to the IPC format.

  • Decimal types now have an optional “bitWidth” field, defaulting to 128.
    This will allow for future support of other decimal widths such as 32- and 64-bit.

  • The validity bitmap buffer has been removed from Union types. The nullity of a slot in a Union array is determined exclusively by the constituent arrays forming the union.

Integration testing has been expanded to test for extension types and nested dictionaries. See the implementation matrix for details.

Community

Since the last release, we have added two new committers:

  • Liya Fan
  • Ji Liu

Thank you for all your contributions!

Arrow Flight RPC notes

Flight now offers DoExchange, a fully bidirectional data endpoint, in addition to DoGet and DoPut, in C++, Java, and Python. Middlewares in all languages now expose binary-valued headers. Additionally, servers and clients can set Arrow IPC read/write options in all languages, making compatibility easier with earlier versions of Arrow Flight.

In C++ and Python, Flight now exposes more options from gRPC, including the address of the client (on the server) and the ability to set low-level gRPC client options. Flight also supports mutual TLS authentication and the ability for a client to control the size of a data message on the wire.

C++ notes

  • Support for static linking with Arrow has been vastly improved, including the introduction of a libarrow_bundled_dependencies.a library bundling all external dependencies that are built from source by Arrow’s build system rather than installed by an external package manager. This makes it significantly easier to create dependency-free applications with all libraries statically-linked.
  • Following the Arrow format changes, Union arrays cannot have a top-level bitmap anymore.
  • A number of improvements were made to reduce the overall generated binary size in the Arrow library.
  • A convenience API GetBuildInfo allows querying the characteristics of the Arrow library. We encourage you to suggest any desired addition to the returned information.
  • We added an optional dependency to the utf8proc library, used in several compute functions (see below).
  • Instead of sharing the same concrete classes, sparse and dense unions now have separated classes (SparseUnionType and DenseUnionType, as well as SparseUnionArray, DenseUnionArray, SparseUnionScalar, DenseUnionScalar).
  • Arrow can now be built for iOS using the right set of CMake options, though we don’t officially support it. See this writeup for details.

Compute functions

The compute kernel layer was extensively reworked. It now offers a generic function lookup, dispatch and execution mechanism. Furthermore, new internal scaffoldings make it vastly easier to write new function kernels, with many common details like type checking and function dispatch based on type combinations handled by the framework rather than implemented manually by the function developer.

Around 30 new array compute functions have been added. For example, Unicode-compliant predicates and transforms, such as lowercase and uppercase transforms, are now available.

The available compute functions are listed exhaustively in the Sphinx-generated documentation.

Datasets

Datasets can now be read from CSV files.

Datasets can be expanded to their component fragments, enabling fine grained interoperability with other consumers of data files. Where applicable, metadata is available as a property of the fragment, including partition information and (for the parquet format) per-column statistics.

Datasets of parquet files can now be assembled from a single _metadata file, such as those created by systems like Dask and Spark. _metadata contains the metadata of all fragments, allowing construction of a statistics- aware dataset with a single IO call.

Feather

The Feather format is now available in version 2, which is simply the Arrow IPC file format with another name.

IPC

By default, we now write IPC streams with metadata V5. However, metadata V4 can be requested by setting the appropriate member in IpcWriteOptions. V4 as well as V5 metadata IPC streams can be read properly, with one exception: a V4 metadata stream containing Union arrays with top-level null values will refuse reading.

As noted above, there are no changes between V4 and V5 that break backwards compatibility. For forward compatibility scenarios (where you need to generate data to be read by an older Arrow library), you can set the V4 compatibility mode.

Support for dictionary replacement and dictionary delta was implemented.

Parquet

Writing files with the LZ4 codec is disabled because it produces files incompatible with the widely-used Hadoop Parquet implementation. Support will be reenabled once we align the LZ4 implementation with the special buffer encoding expected by Hadoop.

Java notes

The Java package introduces a number of low level changes in this release. Most notable are the work in support of allocating large arrow buffers and removing Netty from the public API. Users will have to update their dependencies to use one of the two supported allocators Netty: arrow-memory-netty or Unsafe (internal java api for direct memory) arrow-memory-unsafe.

The Java Vector implementation has improved its interoperability having verified LargeVarChar, LargeBinary, LargeList, Union, Extension types and duplicate field names in Structs are binary compatible with C++ and the specification.

Python notes

The size of wheel packages is significantly reduced, up to 75%. One side effect is that these wheels do not enable Gandiva anymore (which requires the LLVM runtime to be statically-linked). We are interested in providing Gandiva as an add-on package as a separate Python wheel in the future.

The Scalar class hierarchy was reworked to more closely follow its C++ counterpart.

TLS CA certificates are looked up more reliably when using the S3 filesystem, especially with manylinux wheels.

The encoding of CSV files can now be specified explicitly, defaulting to UTF8. Custom timestamp parsers can now be used for CSV files.

Filesystems can now be implemented in pure Python. As a result, fsspec-based filesystems can now be used in datasets.

parquet.read_table is now backed by the dataset API by default, enabling filters on any column and more flexible partitioning.

R notes

The R package added support for converting to and from many additional Arrow types. Tables showing how R types are mapped to Arrow types and vice versa have been added to the introductory vignette, and nearly all types are handled. In addition, R attributes like custom classes and metadata are now preserved when converting a data.frame to an Arrow Table and are restored when loading them back into R.

For more on what’s in the 1.0.0 R package, see the R changelog.

Ruby and C GLib notes

The Ruby and C GLib packages added support for the new compute function framework, in which users can find a compute function dynamically and call it. Users don’t need to wait for a C GLib binding for new compute functions: if the C++ package provides a new compute function, users can use it without additional code in the Ruby and C GLib packages.

The Ruby and C GLib packages added support for Apache Arrow Dataset. The Ruby package provides a new gem for Apache Arrow Dataset, red-arrow-dataset. The C GLib package provides a new module for Apache Arrow Dataset, arrow-dataset-glib. They just have a few features for now but we will add more in future releases.

The Ruby and C GLib packages added support for reading only the specified row group in an Apache Parquet file.

Ruby

The Ruby package added support for column level compression in writing Apache Parquet files.

The Ruby package changed the Arrow::DictionaryArray#[] behavior. It now returns the dictionary value instead of the dictionary index. This is a backwards-incompatible change.

Rust notes

  • A new integration test crate has been added, allowing the Rust implementation to participate in integration testing.
  • A new benchmark crate has been added for benchmarking performance against popular data sets. The initial examples run SQL queries against the NYC Taxi data set using DataFusion. This is useful for comparing performance against other Arrow implementations.
  • Rust toolchain has been upgraded to 1.44 nightly.

Arrow Core

  • Support for binary, string, and list arrays with i64 offsets to support large lists.
  • A new sort kernel has been added.
  • There have been various improvements to dictionary array support.
  • CSV reader enhancements include a new CsvReadOptions struct and support for schema inference from multiple CSV files.
  • There are significant (10x - 40x) performance improvements to SIMD comparison kernels.

DataFusion

  • There are numerous UX improvements to LogicalPlan and LogicalPlanBuilder, including support for named columns.
  • General improvements to code base, such as removing many uses of Arc and using slices instead of &Vec as function arguments.
  • ParquetScanExec performance improvement (almost 2x).
  • ExecutionContext can now be shared between threads.
  • Rust closures can now be used as Scalar UDFs.
  • Sort support has been added to SQL and LogicalPlan.