Apache Arrow 3.0.0 Release


Published 25 Jan 2021
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 3.0.0 release. This covers over 3 months of development work and includes 666 resolved issues from 106 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Columnar Format Notes

The Decimal256 data type, which was already supported by the Arrow columnar format specification, is now implemented in C++ and Java (ARROW-9747).

Arrow Flight RPC notes

Authentication in C++/Java/Python has been overhauled, allowing more flexible authentication methods and use of standard headers. Support for cookies has also been added. The C++/Java implementations are now more permissive when parsing messages in order to interoperate better with other Flight implementations.

A basic Flight implementation for C#/.NET has been added. See the implementation status matrix for details.

C++ notes

The default memory pool can now be changed at runtime using the environment variable ARROW_DEFAULT_MEMORY_POOL (ARROW-11009). The environment variable is inspected at process startup. This is useful when trying to diagnose memory consumption issues with Arrow.

STL-like iterators are now provided over concrete arrays. Those are useful for non-performance critical tasks, for example testing (ARROW-10776).

It is now possible to concatenate dictionary arrays with unequal dictionaries. The dictionaries are unified when concatenating, for supported data types (ARROW-5336).

Threads in a thread pool are now spawned lazily as needed for enqueued tasks, up to the configured capacity. They used to be spawned upfront on creation of the thread pool (ARROW-10038).

Compute layer

Comprehensive documentation for compute functions is now available: https://arrow.apache.org/docs/cpp/compute.html

Compute functions for string processing have been added for:

  • splitting on whitespace (ASCII and Unicode flavors) and splitting on a pattern (ARROW-9991);
  • trimming characters (ARROW-9128).

Behavior of the index_in and is_in compute functions with nulls has been changed for consistency (ARROW-10663).

Multiple-column sort kernels are now available for tables and record batches (ARROW-8199, ARROW-10796, ARROW-10790).

Performance of table filtering has been vastly improved (ARROW-10569).

Scalar arguments are now accepted for more compute functions.

Compute functions quantile (ARROW-10831) and is_nan (ARROW-11043) have been added for numeric data.

Aggregation functions any (ARROW-1846) and all (ARROW-10301) have been added for boolean data.

Dataset

The Expression hierarchy has simplified to a wrapper around literals, field references, or calls to named functions. This enables usage of any compute function while filtering with no boilerplate.

Parquet statistics are lazily parsed in ParquetDatasetFactory and ParquetFileFragment for shorter construction time.

CSV

Conversion of string columns is now faster thanks to faster UTF-8 validation of small strings (ARROW-10313).

Conversion of floating-point columns is now faster thanks to optimized string-to-double conversion routines (ARROW-10328).

Parsing of ISO8601 timestamps is now more liberal: trailing zeros can be omitted in the fractional part (ARROW-10337).

Fixed a bug where null detection could give the wrong results on some platforms (ARROW-11067).

Added type inference for Date32 columns for values in the form YYYY-MM-DD (ARROW-11247).

Feather

Fixed reading of compressed Feather files written with Arrow 0.17 (ARROW-11163).

Filesystem layer

S3 recursive tree walks now benefit from a parallel implementation, where reads of multiple child directories are now issued concurrently (ARROW-10788).

Improved empty directory detection to be mindful of differences between Amazon and Minio S3 implementations (ARROW-10942).

Flight RPC

IPv6 host addresses are now supported (ARROW-10475).

IPC

It is now possible to emit dictionary deltas where possible using the IPC stream writer. This is governed by a new variable in the IpcWriteOptions class (ARROW-6883).

It is now possible to read wider tables, which used to fail due to reaching a limit during Flatbuffers verification (ARROW-10056).

Parquet

Fixed reading of LZ4-compressed Parquet columns emitted by the Java Parquet implementation (ARROW-11301).

Fixed a bug where writing multiple batches of nullable nested strings to Parquet would not write any data in batches after the first one (ARROW-10493)

The Decimal256 data type can be read from and written to Parquet (ARROW-10607).

LargeString and LargeBinary data can now be written to Parquet (ARROW-10426).

C# notes

The .NET package added initial support for Arrow Flight clients and servers. Support is enabled through two new NuGet packages Apache.Arrow.Flight (client) and Apache.Arrow.Flight.AspNetCore (server).

Also fixed an issue where ArrowStreamWriter wasn’t writing schema metadata to Arrow streams.

Julia notes

This is the first release to officially include an implementation for the Julia language. The pure Julia implementation includes support for wide coverage of the format specification. Additional details can be found in the julialang.org blog post.

Python notes

Support for Python 3.9 was added (ARROW-10224), and support for Python 3.5 was removed (ARROW-5679).

Support for building manylinux1 packages has been removed (ARROW-11212). PyArrow continues to be available as manylinux2010 and manylinux2014 wheels.

The minimal required version for NumPy is now 1.16.6. Note that when upgrading NumPy to 1.20, you also need to upgrade pyarrow to 3.0.0 to ensure compatibility, as this pyarrow release fixed a compatibility issue with NumPy 1.20 (ARROW-10833).

Compute functions are now automatically exported from C++ to the pyarrow.compute module, and they have docstrings matching their C++ definition.

An iter_batches() method is now available for reading a Parquet file iteratively (ARROW-7800).

Alternate memory pools (such as mimalloc, jemalloc or the C malloc-based memory pool) are now available from Python (ARROW-11049).

Fixed a potential deadlock when importing pandas from several threads (ARROW-10519).

See the C++ notes above for additional details.

R notes

This release contains new features for the Flight RPC wrapper, better support for saving R metadata (including sf spatial data) to Feather and Parquet files, several significant improvements to speed and memory management, and many other enhancements.

For more on what’s in the 3.0.0 R package, see the R changelog.

Ruby and C GLib notes

Ruby

In Ruby binding, 256-bit decimal support and Arrow::FixedBinaryArrayBuilder are added likewise C GLib below.

C GLib

In the version 3.0.0 of C GLib consists of many new features.

A chunked array, a record batch, and a table support sort_indices function as well as an array. These functions including array’s support to specify sorting option. garrow_array_sort_to_indices has been renamed to garrow_array_sort_indices and the previous name has been deprecated.

GArrowField supports functions to handle metadata. GArrowSchema supports garrow_schema_has_metadata() function.

GArrowArrayBuilder supports to add single null, multiple nulls, single empty value, and multiple empty values. GArrowFixedSizedBinaryArrayBuilder is newly supported.

256-bit decimal and extension types are newly supported. Filesystem module supports Mock, HDFS, S3 file systems. Dataset module supports CSV, IPC, and Parquet file formats.

Rust notes

Core Arrow Crate

The development of the arrow crate was focused on four main aspects:

  • Make the crate usable in stable Rust
  • Bug fixing and removal of unsafe code
  • Extend functionality to keep up with the specification
  • Increase performance of existing kernels

Stable Rust

Possibly the biggest news for this release is that all project crates, including arrow, parquet, and datafusion now build with stable Rust by default. Nightly / unstable Rust is still required when enabling the SIMD feature.

Parquet Arrow writer

The Parquet Writer for Arrow arrays is now available, allowing the Rust programs to easily read and write Parquet files and making it easier to integrate with the overall Arrow ecosystem. The reader and writer include both basic and nested type support (List, Dictionary, Struct)

First Class Arrow Flight IPC Support

This release the Arrow Flight IPC implementation in Rust became fully-featured enough to participate in the regular cross-language integration tests, thus ensuring Rust applications written using Arrow can interoperate with the rest of the ecosystem

Performance

There have been numerous performance improvements in this release across the board. This includes both kernel operations, such as take, filter, and cast, as well as more fundamental parts such as bitwise comparison and reading and writing to CSV.

Increased Data Type Support

New DataTypes:

  • Decimal data type for fixed-width decimal values

Improved operation support for nested structures Dictionary, and Lists (filter, take, etc)

Other improvements:

  • Added support for Date and time on FFI
  • Added support for Binary type on FFI
  • Added support for i64 sized arrays to “take” kernel
  • Support for the i128 Decimal Type
  • Added support to cast string to date
  • Added API to create arrays out of existing arrays (e.g. for join, merge-sort, concatenate)
  • The simd feature is now also available on aarch64

API Changes

  • BooleanArray is no longer a PrimitiveArray
  • ArrowNativeType no longer includes bool since arrows boolean type is represented using bitpacking
  • Several Buffer methods are now infallible instead of returning a Result
  • DataType::List now contains a Field to track metadata about the contained elements
  • PrimitiveArray::raw_values, values_slice and values methods got replaced by a values method returning a slice
  • Buffer::data and raw_data were renamed to as_slice and as_ptr
  • MutableBuffer::data_mut and freeze were renamed to as_slice_mut and into to be more consistent with the stdlib naming conventions
  • The generic type parameter for BufferBuilder was changed from ArrowPrimitiveType to ArrowNativeType

DataFusion

SQL

In this release, we clarified that DataFusion will standardize on the PostgreSQL SQL dialect.

New SQL support:

  • JOIN, LEFT JOIN, RIGHT JOIN
  • COUNT DISTINCT
  • CASE WHEN
  • USING
  • BETWEEN
  • IS IN
  • Nested SELECT statements
  • Nested expressions in aggregations
  • LOWER(), UPPER(), TRIM()
  • NULLIF()
  • SHA224(), SHA256(), SHA384(), SHA512()
  • DATE_TRUNC()

Performance

There have been numerous performance improvements in this release:

  • Optimizations for JOINs such as using vectorized hashing.
  • We started with adding statistics and cost-based optimizations. We choose the smaller side of a join as the build side if possible.
  • Improved parallelism when reading partitioned Parquet data sources
  • Concurrent writes of CSV and Parquet partitions to file

Parquet Crate

The Parquet has the following improvements:

  • Nested reading
  • Support to write booleans
  • Add support to write temporal types

Roadmap for 4.0.0

We have also started building up a shared community roadmap for 4.0: Apache Arrow: Crowd Sourced Rust Roadmap for Arrow 4.0, January 2021.