Apache Arrow 4.0.0 Release
Published
03 May 2021
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 4.0.0 release. This covers 3 months of development work and includes 711 resolved issues from 114 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Community
Since the 3.0.0 release, Yibo Cai, Ian Cook, and Jonathan Keane have been invited as committers to Arrow, and Andrew Lamb and Jorge Leitão have joined the Project Management Committee (PMC). Thank you for all of your contributions!
Arrow Flight RPC notes
In Java, applications can now enable zero-copy optimizations when writing data (ARROW-11066). This potentially breaks source compatibility, so it is not enabled by default.
Arrow Flight is now packaged for C#/.NET.
Linux packages notes
There are Linux packages for C++ and C GLib. They were provided by Bintray but Bintray is no longer available as of 2021-05-01. They are provided by Artifactory now. Users needs to change the install instructions because the URL has changed. See the install page for new instructions. Here is a summary of needed changes.
For Debian GNU Linux and Ubuntu users:
- Users need to change the
apache-arrow-archive-keyring
install instruction:- Package name is changed to
apache-arrow-apt-source
. - Download URL is changed to
https://apache.jfrog.io/artifactory/arrow/...
fromhttps://apache.bintray.com/arrow/...
.
- Package name is changed to
For CentOS and Red Hat Enterprise Linux users:
- Users need to change the
apache-arrow-release
install instruction:- Download URL is changed to
https://apache.jfrog.io/artifactory/arrow/...
fromhttps://apache.bintray.com/arrow/...
.
- Download URL is changed to
C++ notes
The Arrow C++ library now includes a vcpkg.json
manifest file and a new CMake option -DARROW_DEPENDENCY_SOURCE=VCPKG
to
simplify installation of dependencies using the vcpkg
package manager. This provides an alternative means of installing C++ library
dependencies on Linux, macOS, and Windows. See the
Building Arrow C++
and Developing on Windows
docs pages for details.
The default memory allocator on macOS has been changed from jemalloc to mimalloc, yielding performance benefits on a range of macro-benchmarks (ARROW-12316).
Non-monotonic dense union offsets are now disallowed as per the Arrow format
specification, and return an error in Array::ValidateFull
(ARROW-10580).
Compute layer
Automatic implicit casting in compute kernels (ARROW-8919). For example, for the addition of two arrays, the arrays are first cast to their common numeric type instead of erroring when the types are not equal.
Compute functions quantile
(ARROW-10831) and power
(ARROW-11070) have been
added for numeric data.
Compute functions for string processing have been added for:
- Trimming characters (ARROW-9128).
- Extracting substrings captured by a regex pattern (
extract_regex
, ARROW-10195). - Computing UTF8 string lengths (
utf8_length
, ARROW-11693). - Matching strings against regex pattern (
match_substring_regex
, ARROW-12134). - Replacing non-overlapping substrings that match a literal pattern or regular
expression (
replace_substring
andreplace_substring_regex
, ARROW-10306).
It is now possible to sort decimal and fixed-width binary data (ARROW-11662).
The precision of the sum
kernel was improved (ARROW-11758).
CSV
A CSV writer has been added (ARROW-2229).
The CSV reader can now infer timestamp columns with fractional seconds (ARROW-12031).
Dataset
Arrow Datasets received various performance improvements and new features. Some highlights:
- New columns can be projected from arbitrary expressions at scan time (ARROW-11174)
- Read performance was improved for Parquet on high-latency filesystems (ARROW-11601) and in general when there are thousands of files or more (ARROW-8658)
- Null partition keys can be written (ARROW-10438)
- Compressed CSV files can be read (ARROW-10372)
- Filesystems support async operations (ARROW-10846)
- Usage and API documentation were added (ARROW-11677)
Files and filesystems
Fixed some rare instances of GZip files could not be read properly (ARROW-12169).
Support for setting S3 proxy parameters has been added (ARROW-8900).
The HDFS filesystem is now able to write more than 2GB of data at a time (ARROW-11391).
IPC
The IPC reader now supports reading data with dictionaries shared between different schema fields (ARROW-11838).
The IPC reader now supports optional endian conversion when receiving IPC data represented with a different endianness. It is therefore possible to exchange Arrow data between systems with different endiannesses (ARROW-8797).
The IPC file writer now optionally unifies dictionaries when writing a file in a single shot, instead of erroring out if unequal dictionaries are encountered (ARROW-10406).
An interoperability issue with the C# implementation was fixed (ARROW-12100).
JSON
A possible crash when reading a line-separated JSON file has been fixed (ARROW-12065).
ORC
The Arrow C++ library now includes an ORC file writer. Hence it is possible to both read and write ORC files from/to Arrow data.
Parquet
The Parquet C++ library version is now synced with the Arrow version (ARROW-7830).
Parquet DECIMAL statistics were previously calculated incorrectly, this has now been fixed (PARQUET-1655).
Initial support for high-level Parquet encryption APIs similar to those in parquet-mr is available (ARROW-9318).
C# notes
Arrow Flight is now packaged for C#/.NET.
Go notes
The go implementation now supports IPC buffer compression
Java notes
Java now supports IPC buffer compression (ZSTD is recommended as the current performance of LZ4 is quite slow).
JavaScript notes
- The Arrow JS module is now tree-shakeable.
- Iterating over Tables or Vectors is ~2X faster. Demo
- The default bundles use modern JS.
Python notes
- Limited support for writing out CSV files (only types that have cast implementation to String) is now available.
- Writing parquet list types now has the option of enabling the canonical group naming according to the Parquet specification.
- The ORC Writer is now available.
Creating a dataset with pyarrow.dataset.write_dataset
is now possible from a
Python iterator of record batches (ARROW-10882).
The Dataset interface can now use custom projections using expressions when
scanning (ARROW-11750). The expressions gained basic support for arithmetic
operations (e.g. ds.field('a') / ds.field('b')
) (ARROW-12058). See
the Dataset docs for more details.
See the C++ notes above for additional details.
R notes
The dplyr
interface to Arrow data gained many new features in this release, including support for mutate()
, relocate()
, and more. You can also call in filter()
or mutate()
over 100 functions supported by the Arrow C++ library, and many string functions are available both by their base R (grepl()
, gsub()
, etc.) and stringr
(str_detect()
, str_replace()
) spellings.
Datasets can now read compressed CSVs automatically, and you can also open a dataset that is based on a single file, enabling you to use write_dataset()
to partition a very large file without having to read the whole file into memory.
For more on what’s in the 4.0.0 R package, see the R changelog.
C GLib and Ruby notes
C GLib
In Arrow GLib version 4.0.0, the following changes are introduced in addition to the changes by Arrow C++.
- gandiva-glib supports filtering by using the newly introduced
GGandivaFilter
,GGandivaCondition
, andGGandivaSelectableProjector
- The
input
property is added inGArrowCSVReader
andGArrowJSONReader
- GNU Autotools, namely
configure
script, support is dropped GADScanContext
is removed, anduse_threads
property is moved toGADScanOptions
garrow_chunked_array_combine
function is addedgarrow_array_concatenate
function is addedGADFragment
and its subclassGADInMemoryFragment
are addedGADScanTask
now holds the correspondingGADFragment
gad_scan_options_replace_schema
function is removed- The name of
Decimal128DataType
is changed todecimal128
Ruby
In Red Arrow version 4.0.0, the following changes are introduced in addition to the changes by Arrow GLib.
ArrowDataset::ScanContext
is removed, anduse_threads
attribute is moved toArrowDataset::ScanOptions
Arrow::Array#concatenate
is added; it can concatenate not only anArrow::Array
but also a normalArray
Arrow::SortKey
andArrow::SortOptions
are added for accepting Ruby objects as sort key and optionsArrowDataset::InMemoryFragment
is added
Rust notes
This release of Arrow continues to add new features and performance improvements. Much of our time this release was spent hammering out the necessary details so we can release the Rust versions to cargo at a more regular rate. In addition, we welcomed the Ballista distributed compute project officially to the fold.
Arrow
- Improved LargeUtf8 support
- Improved null handling in AND/OR kernels
- Added JSON writer support (ARROW-11310)
- JSON reader improvements
- LargeUTF8
- Improved schema inference for nested list and struct types
- Various performance improvements
- IPC writer no longer calls finish() implicitly on drop
- Compute kernels
- Support for optional
limit
in sort kernel - Divide by a single scalar
- Support for casting to timestamps
- Cast: Improved support between casting List, LargeList, Int32, Int64, Date64
- Kernel to combine two arrays based on boolean mask
- Pow kernel
- Support for optional
new_null_array
for creating Arrays full of nulls.
Parquet
- Added support for filtering row groups (used by DataFusion to implement filter push-down)
- Added support for Parquet v 2.0 logical types
DataFusion
New Features
- SQL Support
-
- CTEs
- UNION
- HAVING
- EXTRACT
- SHOW TABLES
- SHOW COLUMNS
- INTERVAL
- SQL Information schema
- Support GROUP BY for more data types, including dictionary columns, boolean, Date32
- Extensibility API
- Catalogs and schemas support
- Table deregistration
- Better support for multiple optimizers
- User defined functions can now provide specialized implementations for scalar values
- Physical Plans
- Hash Repartitioning
-
SQL Metrics
- Additional Postgres compatible function library:
- Length functions
- Pad/trim functions
- Concat functions
- Ascii/Unicode functions
- Regex
- Proper identifier case identification (e.g. “Foo” vs Foo vs foo)
- Upgraded to Tokio 1.x
Performance Improvements:
- LIMIT pushdown
- Constant folding
- Partitioned hash join
- Create hashes vectorized in hash join
- Improve parallelism using repartitioning pass
- Improved hash aggregate performance with large number of grouping values
- Predicate pushdown support for table scans
- Predicate push-down to parquet enables DataFusion to quickly eliminate entire parquet row-groups based on query filter expressions and parquet row group min/max statistics
API Changes
- DataFrame methods now take
Vec<Expr>
rather than&[Expr]
- TableProvider now consistently uses
Arc<TableProvider>
rather thanBox<TableProvider>
Ballista
Ballista was donated shortly before the Arrow 4.0.0 release and there is no new release of Ballista as part of Arrow 4.0.0