Apache Arrow 3.0.0 Release
Published
25 Jan 2021
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 3.0.0 release. This covers over 3 months of development work and includes 666 resolved issues from 106 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Columnar Format Notes
The Decimal256 data type, which was already supported by the Arrow columnar format specification, is now implemented in C++ and Java (ARROW-9747).
Arrow Flight RPC notes
Authentication in C++/Java/Python has been overhauled, allowing more flexible authentication methods and use of standard headers. Support for cookies has also been added. The C++/Java implementations are now more permissive when parsing messages in order to interoperate better with other Flight implementations.
A basic Flight implementation for C#/.NET has been added. See the implementation status matrix for details.
C++ notes
The default memory pool can now be changed at runtime using the environment
variable ARROW_DEFAULT_MEMORY_POOL
(ARROW-11009). The environment variable
is inspected at process startup. This is useful when trying to diagnose memory
consumption issues with Arrow.
STL-like iterators are now provided over concrete arrays. Those are useful for non-performance critical tasks, for example testing (ARROW-10776).
It is now possible to concatenate dictionary arrays with unequal dictionaries. The dictionaries are unified when concatenating, for supported data types (ARROW-5336).
Threads in a thread pool are now spawned lazily as needed for enqueued tasks, up to the configured capacity. They used to be spawned upfront on creation of the thread pool (ARROW-10038).
Compute layer
Comprehensive documentation for compute functions is now available: https://arrow.apache.org/docs/cpp/compute.html
Compute functions for string processing have been added for:
- splitting on whitespace (ASCII and Unicode flavors) and splitting on a pattern (ARROW-9991);
- trimming characters (ARROW-9128).
Behavior of the index_in
and is_in
compute functions with nulls has been
changed for consistency (ARROW-10663).
Multiple-column sort kernels are now available for tables and record batches (ARROW-8199, ARROW-10796, ARROW-10790).
Performance of table filtering has been vastly improved (ARROW-10569).
Scalar arguments are now accepted for more compute functions.
Compute functions quantile
(ARROW-10831) and is_nan
(ARROW-11043) have been
added for numeric data.
Aggregation functions any
(ARROW-1846) and all
(ARROW-10301) have been
added for boolean data.
Dataset
The Expression
hierarchy has simplified to a wrapper around literals, field references,
or calls to named functions. This enables usage of any compute function while filtering
with no boilerplate.
Parquet statistics are lazily parsed in ParquetDatasetFactory
and
ParquetFileFragment
for shorter construction time.
CSV
Conversion of string columns is now faster thanks to faster UTF-8 validation of small strings (ARROW-10313).
Conversion of floating-point columns is now faster thanks to optimized string-to-double conversion routines (ARROW-10328).
Parsing of ISO8601 timestamps is now more liberal: trailing zeros can be omitted in the fractional part (ARROW-10337).
Fixed a bug where null detection could give the wrong results on some platforms (ARROW-11067).
Added type inference for Date32 columns for values in the form YYYY-MM-DD
(ARROW-11247).
Feather
Fixed reading of compressed Feather files written with Arrow 0.17 (ARROW-11163).
Filesystem layer
S3 recursive tree walks now benefit from a parallel implementation, where reads of multiple child directories are now issued concurrently (ARROW-10788).
Improved empty directory detection to be mindful of differences between Amazon and Minio S3 implementations (ARROW-10942).
Flight RPC
IPv6 host addresses are now supported (ARROW-10475).
IPC
It is now possible to emit dictionary deltas where possible using the IPC
stream writer. This is governed by a new variable in the IpcWriteOptions
class
(ARROW-6883).
It is now possible to read wider tables, which used to fail due to reaching a limit during Flatbuffers verification (ARROW-10056).
Parquet
Fixed reading of LZ4-compressed Parquet columns emitted by the Java Parquet implementation (ARROW-11301).
Fixed a bug where writing multiple batches of nullable nested strings to Parquet would not write any data in batches after the first one (ARROW-10493)
The Decimal256 data type can be read from and written to Parquet (ARROW-10607).
LargeString and LargeBinary data can now be written to Parquet (ARROW-10426).
C# notes
The .NET package added initial support for Arrow Flight clients and servers. Support is enabled through two new NuGet packages Apache.Arrow.Flight (client) and Apache.Arrow.Flight.AspNetCore (server).
Also fixed an issue where ArrowStreamWriter wasn’t writing schema metadata to Arrow streams.
Julia notes
This is the first release to officially include an implementation for the Julia language. The pure Julia implementation includes support for wide coverage of the format specification. Additional details can be found in the julialang.org blog post.
Python notes
Support for Python 3.9 was added (ARROW-10224), and support for Python 3.5 was removed (ARROW-5679).
Support for building manylinux1 packages has been removed (ARROW-11212). PyArrow continues to be available as manylinux2010 and manylinux2014 wheels.
The minimal required version for NumPy is now 1.16.6. Note that when upgrading NumPy to 1.20, you also need to upgrade pyarrow to 3.0.0 to ensure compatibility, as this pyarrow release fixed a compatibility issue with NumPy 1.20 (ARROW-10833).
Compute functions are now automatically exported from C++ to the pyarrow.compute
module, and they have docstrings matching their C++ definition.
An iter_batches()
method is now available for reading a Parquet file iteratively
(ARROW-7800).
Alternate memory pools (such as mimalloc, jemalloc or the C malloc-based memory pool) are now available from Python (ARROW-11049).
Fixed a potential deadlock when importing pandas from several threads (ARROW-10519).
See the C++ notes above for additional details.
R notes
This release contains new features for the Flight RPC wrapper, better
support for saving R metadata (including sf
spatial data) to Feather and
Parquet files, several significant improvements to speed and memory management,
and many other enhancements.
For more on what’s in the 3.0.0 R package, see the R changelog.
Ruby and C GLib notes
Ruby
In Ruby binding, 256-bit decimal support and Arrow::FixedBinaryArrayBuilder
are added likewise C GLib below.
C GLib
In the version 3.0.0 of C GLib consists of many new features.
A chunked array, a record batch, and a table support sort_indices
function as well as an array.
These functions including array’s support to specify sorting option.
garrow_array_sort_to_indices
has been renamed to garrow_array_sort_indices
and the previous name has been deprecated.
GArrowField
supports functions to handle metadata.
GArrowSchema
supports garrow_schema_has_metadata()
function.
GArrowArrayBuilder
supports to add single null, multiple nulls, single empty value, and multiple empty values.
GArrowFixedSizedBinaryArrayBuilder
is newly supported.
256-bit decimal and extension types are newly supported. Filesystem module supports Mock, HDFS, S3 file systems. Dataset module supports CSV, IPC, and Parquet file formats.
Rust notes
Core Arrow Crate
The development of the arrow crate was focused on four main aspects:
- Make the crate usable in stable Rust
- Bug fixing and removal of
unsafe
code - Extend functionality to keep up with the specification
- Increase performance of existing kernels
Stable Rust
Possibly the biggest news for this release is that all project crates, including arrow, parquet, and datafusion now build with stable Rust by default. Nightly / unstable Rust is still required when enabling the SIMD feature.
Parquet Arrow writer
The Parquet Writer for Arrow arrays is now available, allowing the Rust programs to easily read and write Parquet files and making it easier to integrate with the overall Arrow ecosystem. The reader and writer include both basic and nested type support (List, Dictionary, Struct)
First Class Arrow Flight IPC Support
This release the Arrow Flight IPC implementation in Rust became fully-featured enough to participate in the regular cross-language integration tests, thus ensuring Rust applications written using Arrow can interoperate with the rest of the ecosystem
Performance
There have been numerous performance improvements in this release across the board. This includes both kernel
operations, such as take
, filter
, and cast
, as well as more fundamental parts such as bitwise comparison
and reading and writing to CSV.
Increased Data Type Support
New DataTypes:
- Decimal data type for fixed-width decimal values
Improved operation support for nested structures Dictionary, and Lists (filter, take, etc)
Other improvements:
- Added support for Date and time on FFI
- Added support for Binary type on FFI
- Added support for i64 sized arrays to “take” kernel
- Support for the i128 Decimal Type
- Added support to cast string to date
- Added API to create arrays out of existing arrays (e.g. for join, merge-sort, concatenate)
- The simd feature is now also available on aarch64
API Changes
- BooleanArray is no longer a PrimitiveArray
- ArrowNativeType no longer includes bool since arrows boolean type is represented using bitpacking
- Several Buffer methods are now infallible instead of returning a Result
- DataType::List now contains a Field to track metadata about the contained elements
- PrimitiveArray::raw_values, values_slice and values methods got replaced by a values method returning a slice
- Buffer::data and raw_data were renamed to as_slice and as_ptr
- MutableBuffer::data_mut and freeze were renamed to as_slice_mut and into to be more consistent with the stdlib naming conventions
- The generic type parameter for BufferBuilder was changed from ArrowPrimitiveType to ArrowNativeType
DataFusion
SQL
In this release, we clarified that DataFusion will standardize on the PostgreSQL SQL dialect.
New SQL support:
- JOIN, LEFT JOIN, RIGHT JOIN
- COUNT DISTINCT
- CASE WHEN
- USING
- BETWEEN
- IS IN
- Nested SELECT statements
- Nested expressions in aggregations
- LOWER(), UPPER(), TRIM()
- NULLIF()
- SHA224(), SHA256(), SHA384(), SHA512()
- DATE_TRUNC()
Performance
There have been numerous performance improvements in this release:
- Optimizations for JOINs such as using vectorized hashing.
- We started with adding statistics and cost-based optimizations. We choose the smaller side of a join as the build side if possible.
- Improved parallelism when reading partitioned Parquet data sources
- Concurrent writes of CSV and Parquet partitions to file
Parquet Crate
The Parquet has the following improvements:
- Nested reading
- Support to write booleans
- Add support to write temporal types
Roadmap for 4.0.0
We have also started building up a shared community roadmap for 4.0: Apache Arrow: Crowd Sourced Rust Roadmap for Arrow 4.0, January 2021.