Apache Arrow 4.0.0 Release

Published 03 May 2021
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 4.0.0 release. This covers 3 months of development work and includes 711 resolved issues from 114 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Community

Since the 3.0.0 release, Yibo Cai, Ian Cook, and Jonathan Keane have been invited as committers to Arrow, and Andrew Lamb and Jorge Leitão have joined the Project Management Committee (PMC). Thank you for all of your contributions!

Arrow Flight RPC notes

In Java, applications can now enable zero-copy optimizations when writing data (ARROW-11066). This potentially breaks source compatibility, so it is not enabled by default.

Arrow Flight is now packaged for C#/.NET.

Linux packages notes

There are Linux packages for C++ and C GLib. They were provided by Bintray but Bintray is no longer available as of 2021-05-01. They are provided by Artifactory now. Users needs to change the install instructions because the URL has changed. See the install page for new instructions. Here is a summary of needed changes.

For Debian GNU Linux and Ubuntu users:

Users need to change the apache-arrow-archive-keyring install instruction:
- Package name is changed to apache-arrow-apt-source.
- Download URL is changed to https://apache.jfrog.io/artifactory/arrow/... from https://apache.bintray.com/arrow/....

For CentOS and Red Hat Enterprise Linux users:

Users need to change the apache-arrow-release install instruction:
- Download URL is changed to https://apache.jfrog.io/artifactory/arrow/... from https://apache.bintray.com/arrow/....

C++ notes

The Arrow C++ library now includes a vcpkg.json manifest file and a new CMake option -DARROW_DEPENDENCY_SOURCE=VCPKG to simplify installation of dependencies using the vcpkg package manager. This provides an alternative means of installing C++ library dependencies on Linux, macOS, and Windows. See the Building Arrow C++ and Developing on Windows docs pages for details.

The default memory allocator on macOS has been changed from jemalloc to mimalloc, yielding performance benefits on a range of macro-benchmarks (ARROW-12316).

Non-monotonic dense union offsets are now disallowed as per the Arrow format specification, and return an error in Array::ValidateFull (ARROW-10580).

Compute layer

Automatic implicit casting in compute kernels (ARROW-8919). For example, for the addition of two arrays, the arrays are first cast to their common numeric type instead of erroring when the types are not equal.

Compute functions quantile (ARROW-10831) and power (ARROW-11070) have been added for numeric data.

Compute functions for string processing have been added for:

Trimming characters (ARROW-9128).
Extracting substrings captured by a regex pattern (extract_regex, ARROW-10195).
Computing UTF8 string lengths (utf8_length, ARROW-11693).
Matching strings against regex pattern (match_substring_regex, ARROW-12134).
Replacing non-overlapping substrings that match a literal pattern or regular expression (replace_substring and replace_substring_regex, ARROW-10306).

It is now possible to sort decimal and fixed-width binary data (ARROW-11662).

The precision of the sum kernel was improved (ARROW-11758).

CSV

A CSV writer has been added (ARROW-2229).

The CSV reader can now infer timestamp columns with fractional seconds (ARROW-12031).

Dataset

Arrow Datasets received various performance improvements and new features. Some highlights:

New columns can be projected from arbitrary expressions at scan time (ARROW-11174)
Read performance was improved for Parquet on high-latency filesystems (ARROW-11601) and in general when there are thousands of files or more (ARROW-8658)
Null partition keys can be written (ARROW-10438)
Compressed CSV files can be read (ARROW-10372)
Filesystems support async operations (ARROW-10846)
Usage and API documentation were added (ARROW-11677)

Files and filesystems

Fixed some rare instances of GZip files could not be read properly (ARROW-12169).

Support for setting S3 proxy parameters has been added (ARROW-8900).

The HDFS filesystem is now able to write more than 2GB of data at a time (ARROW-11391).

IPC

The IPC reader now supports reading data with dictionaries shared between different schema fields (ARROW-11838).

The IPC reader now supports optional endian conversion when receiving IPC data represented with a different endianness. It is therefore possible to exchange Arrow data between systems with different endiannesses (ARROW-8797).

The IPC file writer now optionally unifies dictionaries when writing a file in a single shot, instead of erroring out if unequal dictionaries are encountered (ARROW-10406).

An interoperability issue with the C# implementation was fixed (ARROW-12100).

JSON

A possible crash when reading a line-separated JSON file has been fixed (ARROW-12065).

ORC

The Arrow C++ library now includes an ORC file writer. Hence it is possible to both read and write ORC files from/to Arrow data.

Parquet

The Parquet C++ library version is now synced with the Arrow version (ARROW-7830).

Parquet DECIMAL statistics were previously calculated incorrectly, this has now been fixed (PARQUET-1655).

Initial support for high-level Parquet encryption APIs similar to those in parquet-mr is available (ARROW-9318).

C# notes

Arrow Flight is now packaged for C#/.NET.

Go notes

The go implementation now supports IPC buffer compression

Java notes

Java now supports IPC buffer compression (ZSTD is recommended as the current performance of LZ4 is quite slow).

JavaScript notes

The Arrow JS module is now tree-shakeable.
Iterating over Tables or Vectors is ~2X faster. Demo
The default bundles use modern JS.

Python notes

Limited support for writing out CSV files (only types that have cast implementation to String) is now available.
Writing parquet list types now has the option of enabling the canonical group naming according to the Parquet specification.
The ORC Writer is now available.

Creating a dataset with pyarrow.dataset.write_dataset is now possible from a Python iterator of record batches (ARROW-10882). The Dataset interface can now use custom projections using expressions when scanning (ARROW-11750). The expressions gained basic support for arithmetic operations (e.g. ds.field('a') / ds.field('b')) (ARROW-12058). See the Dataset docs for more details.

See the C++ notes above for additional details.

R notes

The dplyr interface to Arrow data gained many new features in this release, including support for mutate(), relocate(), and more. You can also call in filter() or mutate() over 100 functions supported by the Arrow C++ library, and many string functions are available both by their base R (grepl(), gsub(), etc.) and stringr (str_detect(), str_replace()) spellings.

Datasets can now read compressed CSVs automatically, and you can also open a dataset that is based on a single file, enabling you to use write_dataset() to partition a very large file without having to read the whole file into memory.

For more on what’s in the 4.0.0 R package, see the R changelog.

C GLib and Ruby notes

C GLib

In Arrow GLib version 4.0.0, the following changes are introduced in addition to the changes by Arrow C++.

gandiva-glib supports filtering by using the newly introduced GGandivaFilter, GGandivaCondition, and GGandivaSelectableProjector
The input property is added in GArrowCSVReader and GArrowJSONReader
GNU Autotools, namely configure script, support is dropped
GADScanContext is removed, and use_threads property is moved to GADScanOptions
garrow_chunked_array_combine function is added
garrow_array_concatenate function is added
GADFragment and its subclass GADInMemoryFragment are added
GADScanTask now holds the corresponding GADFragment
gad_scan_options_replace_schema function is removed
The name of Decimal128DataType is changed to decimal128

Ruby

In Red Arrow version 4.0.0, the following changes are introduced in addition to the changes by Arrow GLib.

ArrowDataset::ScanContext is removed, and use_threads attribute is moved to ArrowDataset::ScanOptions
Arrow::Array#concatenate is added; it can concatenate not only an Arrow::Array but also a normal Array
Arrow::SortKey and Arrow::SortOptions are added for accepting Ruby objects as sort key and options
ArrowDataset::InMemoryFragment is added

Rust notes

This release of Arrow continues to add new features and performance improvements. Much of our time this release was spent hammering out the necessary details so we can release the Rust versions to cargo at a more regular rate. In addition, we welcomed the Ballista distributed compute project officially to the fold.

Arrow

Improved LargeUtf8 support
Improved null handling in AND/OR kernels
Added JSON writer support (ARROW-11310)
JSON reader improvements
LargeUTF8
- Improved schema inference for nested list and struct types
Various performance improvements
IPC writer no longer calls finish() implicitly on drop
Compute kernels
- Support for optional limit in sort kernel
- Divide by a single scalar
- Support for casting to timestamps
- Cast: Improved support between casting List, LargeList, Int32, Int64, Date64
- Kernel to combine two arrays based on boolean mask
- Pow kernel
new_null_array for creating Arrays full of nulls.

Parquet

Added support for filtering row groups (used by DataFusion to implement filter push-down)
Added support for Parquet v 2.0 logical types

DataFusion

New Features

SQL Support
- CTEs
- UNION
- HAVING
- EXTRACT
- SHOW TABLES
- SHOW COLUMNS
- INTERVAL
- SQL Information schema
- Support GROUP BY for more data types, including dictionary columns, boolean, Date32
Extensibility API
- Catalogs and schemas support
- Table deregistration
- Better support for multiple optimizers
- User defined functions can now provide specialized implementations for scalar values
Physical Plans
Hash Repartitioning
SQL Metrics
Additional Postgres compatible function library:
- Length functions
- Pad/trim functions
- Concat functions
- Ascii/Unicode functions
- Regex
Proper identifier case identification (e.g. “Foo” vs Foo vs foo)
Upgraded to Tokio 1.x

Performance Improvements:

LIMIT pushdown
Constant folding
Partitioned hash join
Create hashes vectorized in hash join
Improve parallelism using repartitioning pass
Improved hash aggregate performance with large number of grouping values
Predicate pushdown support for table scans
Predicate push-down to parquet enables DataFusion to quickly eliminate entire parquet row-groups based on query filter expressions and parquet row group min/max statistics

API Changes

DataFrame methods now take Vec<Expr> rather than &[Expr]
TableProvider now consistently uses Arc<TableProvider> rather than Box<TableProvider>

Ballista

Ballista was donated shortly before the Arrow 4.0.0 release and there is no new release of Ballista as part of Arrow 4.0.0