Apache Arrow 4.0.0 Release


Published 03 May 2021
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 4.0.0 release. This covers 3 months of development work and includes 711 resolved issues from 114 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Community

Since the 3.0.0 release, Yibo Cai, Ian Cook, and Jonathan Keane have been invited as committers to Arrow, and Andrew Lamb and Jorge Leitão have joined the Project Management Committee (PMC). Thank you for all of your contributions!

Arrow Flight RPC notes

In Java, applications can now enable zero-copy optimizations when writing data (ARROW-11066). This potentially breaks source compatibility, so it is not enabled by default.

Arrow Flight is now packaged for C#/.NET.

Linux packages notes

There are Linux packages for C++ and C GLib. They were provided by Bintray but Bintray is no longer available as of 2021-05-01. They are provided by Artifactory now. Users needs to change the install instructions because the URL has changed. See the install page for new instructions. Here is a summary of needed changes.

For Debian GNU Linux and Ubuntu users:

  • Users need to change the apache-arrow-archive-keyring install instruction:
    • Package name is changed to apache-arrow-apt-source.
    • Download URL is changed to https://apache.jfrog.io/artifactory/arrow/... from https://apache.bintray.com/arrow/....

For CentOS and Red Hat Enterprise Linux users:

  • Users need to change the apache-arrow-release install instruction:
    • Download URL is changed to https://apache.jfrog.io/artifactory/arrow/... from https://apache.bintray.com/arrow/....

C++ notes

The Arrow C++ library now includes a vcpkg.json manifest file and a new CMake option -DARROW_DEPENDENCY_SOURCE=VCPKG to simplify installation of dependencies using the vcpkg package manager. This provides an alternative means of installing C++ library dependencies on Linux, macOS, and Windows. See the Building Arrow C++ and Developing on Windows docs pages for details.

The default memory allocator on macOS has been changed from jemalloc to mimalloc, yielding performance benefits on a range of macro-benchmarks (ARROW-12316).

Non-monotonic dense union offsets are now disallowed as per the Arrow format specification, and return an error in Array::ValidateFull (ARROW-10580).

Compute layer

Automatic implicit casting in compute kernels (ARROW-8919). For example, for the addition of two arrays, the arrays are first cast to their common numeric type instead of erroring when the types are not equal.

Compute functions quantile (ARROW-10831) and power (ARROW-11070) have been added for numeric data.

Compute functions for string processing have been added for:

  • Trimming characters (ARROW-9128).
  • Extracting substrings captured by a regex pattern (extract_regex, ARROW-10195).
  • Computing UTF8 string lengths (utf8_length, ARROW-11693).
  • Matching strings against regex pattern (match_substring_regex, ARROW-12134).
  • Replacing non-overlapping substrings that match a literal pattern or regular expression (replace_substring and replace_substring_regex, ARROW-10306).

It is now possible to sort decimal and fixed-width binary data (ARROW-11662).

The precision of the sum kernel was improved (ARROW-11758).

CSV

A CSV writer has been added (ARROW-2229).

The CSV reader can now infer timestamp columns with fractional seconds (ARROW-12031).

Dataset

Arrow Datasets received various performance improvements and new features. Some highlights:

  • New columns can be projected from arbitrary expressions at scan time (ARROW-11174)
  • Read performance was improved for Parquet on high-latency filesystems (ARROW-11601) and in general when there are thousands of files or more (ARROW-8658)
  • Null partition keys can be written (ARROW-10438)
  • Compressed CSV files can be read (ARROW-10372)
  • Filesystems support async operations (ARROW-10846)
  • Usage and API documentation were added (ARROW-11677)

Files and filesystems

Fixed some rare instances of GZip files could not be read properly (ARROW-12169).

Support for setting S3 proxy parameters has been added (ARROW-8900).

The HDFS filesystem is now able to write more than 2GB of data at a time (ARROW-11391).

IPC

The IPC reader now supports reading data with dictionaries shared between different schema fields (ARROW-11838).

The IPC reader now supports optional endian conversion when receiving IPC data represented with a different endianness. It is therefore possible to exchange Arrow data between systems with different endiannesses (ARROW-8797).

The IPC file writer now optionally unifies dictionaries when writing a file in a single shot, instead of erroring out if unequal dictionaries are encountered (ARROW-10406).

An interoperability issue with the C# implementation was fixed (ARROW-12100).

JSON

A possible crash when reading a line-separated JSON file has been fixed (ARROW-12065).

ORC

The Arrow C++ library now includes an ORC file writer. Hence it is possible to both read and write ORC files from/to Arrow data.

Parquet

The Parquet C++ library version is now synced with the Arrow version (ARROW-7830).

Parquet DECIMAL statistics were previously calculated incorrectly, this has now been fixed (PARQUET-1655).

Initial support for high-level Parquet encryption APIs similar to those in parquet-mr is available (ARROW-9318).

C# notes

Arrow Flight is now packaged for C#/.NET.

Go notes

The go implementation now supports IPC buffer compression

Java notes

Java now supports IPC buffer compression (ZSTD is recommended as the current performance of LZ4 is quite slow).

JavaScript notes

  • The Arrow JS module is now tree-shakeable.
  • Iterating over Tables or Vectors is ~2X faster. Demo
  • The default bundles use modern JS.

Python notes

  • Limited support for writing out CSV files (only types that have cast implementation to String) is now available.
  • Writing parquet list types now has the option of enabling the canonical group naming according to the Parquet specification.
  • The ORC Writer is now available.

Creating a dataset with pyarrow.dataset.write_dataset is now possible from a Python iterator of record batches (ARROW-10882). The Dataset interface can now use custom projections using expressions when scanning (ARROW-11750). The expressions gained basic support for arithmetic operations (e.g. ds.field('a') / ds.field('b')) (ARROW-12058). See the Dataset docs for more details.

See the C++ notes above for additional details.

R notes

The dplyr interface to Arrow data gained many new features in this release, including support for mutate(), relocate(), and more. You can also call in filter() or mutate() over 100 functions supported by the Arrow C++ library, and many string functions are available both by their base R (grepl(), gsub(), etc.) and stringr (str_detect(), str_replace()) spellings.

Datasets can now read compressed CSVs automatically, and you can also open a dataset that is based on a single file, enabling you to use write_dataset() to partition a very large file without having to read the whole file into memory.

For more on what’s in the 4.0.0 R package, see the R changelog.

C GLib and Ruby notes

C GLib

In Arrow GLib version 4.0.0, the following changes are introduced in addition to the changes by Arrow C++.

  • gandiva-glib supports filtering by using the newly introduced GGandivaFilter, GGandivaCondition, and GGandivaSelectableProjector
  • The input property is added in GArrowCSVReader and GArrowJSONReader
  • GNU Autotools, namely configure script, support is dropped
  • GADScanContext is removed, and use_threads property is moved to GADScanOptions
  • garrow_chunked_array_combine function is added
  • garrow_array_concatenate function is added
  • GADFragment and its subclass GADInMemoryFragment are added
  • GADScanTask now holds the corresponding GADFragment
  • gad_scan_options_replace_schema function is removed
  • The name of Decimal128DataType is changed to decimal128

Ruby

In Red Arrow version 4.0.0, the following changes are introduced in addition to the changes by Arrow GLib.

  • ArrowDataset::ScanContext is removed, and use_threads attribute is moved to ArrowDataset::ScanOptions
  • Arrow::Array#concatenate is added; it can concatenate not only an Arrow::Array but also a normal Array
  • Arrow::SortKey and Arrow::SortOptions are added for accepting Ruby objects as sort key and options
  • ArrowDataset::InMemoryFragment is added

Rust notes

This release of Arrow continues to add new features and performance improvements. Much of our time this release was spent hammering out the necessary details so we can release the Rust versions to cargo at a more regular rate. In addition, we welcomed the Ballista distributed compute project officially to the fold.

Arrow

  • Improved LargeUtf8 support
  • Improved null handling in AND/OR kernels
  • Added JSON writer support (ARROW-11310)
  • JSON reader improvements
  • LargeUTF8
    • Improved schema inference for nested list and struct types
  • Various performance improvements
  • IPC writer no longer calls finish() implicitly on drop
  • Compute kernels
    • Support for optional limit in sort kernel
    • Divide by a single scalar
    • Support for casting to timestamps
    • Cast: Improved support between casting List, LargeList, Int32, Int64, Date64
    • Kernel to combine two arrays based on boolean mask
    • Pow kernel
  • new_null_array for creating Arrays full of nulls.

Parquet

  • Added support for filtering row groups (used by DataFusion to implement filter push-down)
  • Added support for Parquet v 2.0 logical types

DataFusion

New Features

  • SQL Support
    • CTEs
    • UNION
    • HAVING
    • EXTRACT
    • SHOW TABLES
    • SHOW COLUMNS
    • INTERVAL
    • SQL Information schema
    • Support GROUP BY for more data types, including dictionary columns, boolean, Date32
  • Extensibility API
    • Catalogs and schemas support
    • Table deregistration
    • Better support for multiple optimizers
    • User defined functions can now provide specialized implementations for scalar values
  • Physical Plans
  • Hash Repartitioning
  • SQL Metrics

  • Additional Postgres compatible function library:
    • Length functions
    • Pad/trim functions
    • Concat functions
    • Ascii/Unicode functions
    • Regex
  • Proper identifier case identification (e.g. “Foo” vs Foo vs foo)
  • Upgraded to Tokio 1.x

Performance Improvements:

  • LIMIT pushdown
  • Constant folding
  • Partitioned hash join
  • Create hashes vectorized in hash join
  • Improve parallelism using repartitioning pass
  • Improved hash aggregate performance with large number of grouping values
  • Predicate pushdown support for table scans
  • Predicate push-down to parquet enables DataFusion to quickly eliminate entire parquet row-groups based on query filter expressions and parquet row group min/max statistics

API Changes

  • DataFrame methods now take Vec<Expr> rather than &[Expr]
  • TableProvider now consistently uses Arc<TableProvider> rather than Box<TableProvider>

Ballista

Ballista was donated shortly before the Arrow 4.0.0 release and there is no new release of Ballista as part of Arrow 4.0.0