Apache Arrow 2.0.0 Release

Published 22 Oct 2020
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 2.0.0 release. This covers over 3 months of development work and includes 511 resolved issues from 81 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Community

Since the 1.0.0 release, Jorge Leitão has been added as a committer. Thank you for your contributions!

Columnar Format

As this is the first major release since 1.0.0, we remind everyone that we have moved to a "split" versioning system where the Library version (which is now 2.0.0) will now evolve separate from the Format version (which is still 1.0.0). Major releases of the libraries may contain non-backward-compatible API changes, but they will not contain any incompatible format changes. See the Versioning and Stability page in the documentation for more.

The columnar format metadata has been updated to permit 256-bit decimal values in addition to 128-bit decimals. This change is backward and forward compatible.

Arrow Flight RPC notes

For Arrow Flight, 2.0.0 mostly brings bugfixes. In Java, some memory leaks in FlightStream and DoPut have been addressed. In C++ and Python, a deadlock has been fixed in an edge case. Additionally, when supported by gRPC, TLS verification can be disabled.

C++ notes

Parquet reading now fully supports round trip of arbitrarily nested data, including extension types with a nested storage type. In the process, several bugs in writing nested data and FixedSizeList were fixed. If writing data with these type we recommend upgrading to this release and validating old data as there is potential data loss.

Datasets can now be written with partitions, including these features:

Writing to Parquet, including control over accumulation of statistics for individual columns.
Writing to IPC/Feather, including body buffer compression.
Basenames of written files can be specified with a string template, allowing non-colliding writes into the same partitioned dataset.

Other notable features in the release include

Compute kernels for standard deviation, variance, and mode
Improvements to S3 support, including automatic region detection
CSV reading now parses Date type and creating Dictionary types

C# notes

The .NET package has added a number of new features this release.

Full support for Struct types.

Synchronous write APIs for ArrowStreamWriter and ArrowFileWriter. These are complimentary to the existing async write APIs, and can be used in situations where the async APIs can't be used.

The ability to use DateTime instances with Date32Array and Date64Array.

Java notes

The Java package has supported a number of new features. Users can validate vectors in a wider range of aspects, if they are willing to take more time. In dictionary encoding, dictionary indices can be expressed as unsigned integers. A framework for data compression has been setup for IPC.

The calculation for vector capacity has been simplified, so users should experience notable performance improvements for various 'setSafe' methods.

Bugs for JDBC adapters, sort algorithms, and ComplexCopier have been resolved to make them more usable.

JavaScript notes

Upgrades Arrow's build to use TypeScript 3.9, fixing generated .d.ts typings.

Python notes

Parquet reading now supports round trip of arbitrarily nested data. Several bug fixes for writing nested data and FixedSizeList. If writing data with these type we recommend validating old data (there is potential some data loss) and upgrade to 2.0.

Extension types with a nested storage type now round trip through Parquet.

The pyarrow.filesystem submodule is deprecated in favor of new filesystem implementations in pyarrow.fs.

The custom serialization functionality (pyarrow.serialize(), pyarrow.deserialize(), etc) is deprecated. Those functions provided a Python-specific (not cross-language) serialization format which were not compatible with the standardized Arrow (IPC) serialization format. For arbitrary objects, you can use the standard library pickle functionality instead. For pyarrow objects, you can use the IPC serialization format through the pyarrow.ipc module, as explained above.

The pyarrow.compute module now has a complete coverage of the available C++ compute kernels in the python API. Several new kernels have been added.

The pyarrow.dataset module was further improved. In addition to reading, it is now also possible to write partitioned datasets (with write_dataset()).

The Arrow <-> Python conversion code was refactored, fixing several bugs and corner cases.

Conversion of an array of pyarrow.MapType to Pandas has been added.

Conversion of timezone aware datetimes to and/from pyarrow arrays including pandas now round-trip preserving timezone. To use the old behavior (e.g. for spark) set the environment variable PYARROW_IGNORE_TIMEZONE to a truthy value (i.e. PYARROW_IGNORE_TIMEZONE=1)

R notes

Highlights of the R release include

Writing multi-file datasets with partitioning to Parquet or Feather
Reading and writing directly to AWS S3, both individual files and multi-file datasets
Bindings for Flight which use reticulate

In addition, the R package benefits from the various improvements in the C++ library listed above, including the ability to read and write Parquet files with nested struct and list types.

For more on what’s in the 2.0.0 R package, see the R changelog.

Ruby and C GLib notes

Ruby

In Ruby binding, Arrow::Table#save uses the number of rows as the chunk_size parameter by default when the table is saved in a Parquet file.

C GLib

The GLib binding newly supports GArrowStringDictionaryArrayBuilder and GArrowBinaryDictionaryArrayBuilder.

Moreover the GLib binding supports new accessors of GArrowListArray and GArrowLargeListArray. They are get_values, get_value_offset, get_value_length, and get_value_offsets.

Rust notes

Due to the high volume of activity in the Rust subproject in this release, we're writing a separate blog post dedicated to those changes.