Apache Arrow 5.0.0 Release
29 Jul 2021
By The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 5.0.0 release. This covers 3 months of development work and includes 684 commits from 99 distinct contributors in 2 repositories. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights
of the release. Many other bugfixes and improvements have been made: we refer
you to the complete changelogs for the
Since the 4.0.0 release, Daniël Heres, Kazuaki Ishizaki, Dominik Moritz, and Weston Pace have been invited as committers to Arrow, and Benjamin Kietzman and David Li have joined the Project Management Committee (PMC). Thank you for all of your contributions!
Columnar Format Notes
.arrow as the IPC file format file extension and
the IPC streaming format file extension.
Arrow Flight RPC notes
The Go implementation now supports custom metadata and middleware, and has been added to integration testing.
In Python, some operations can now be interrupted via Control-C.
MakeArrayFromScalar now works for fixed-size binary types (ARROW-13321).
The following compute functions were added:
scalar arithmetic and math functions:
scalar bitwise functions:
scalar string functions:
scalar temporal functions:
other scalar functions:
Duplicates are now allowed in
Decimal types are now supported by some basic arithmetic functions (ARROW-12074).
take function now supports dense unions (ARROW-13005).
It is now possible to cast between dictionary types with different index types (ARROW-11673).
Sorting is now implemented for boolean input (ARROW-12016).
The streaming CSV reader can now take some advantage of multiple threads (ARROW-11889).
The CSV reader tries to make its errors more informative by adding the row number when it is known, i.e. when parallel reading is disabled (ARROW-12675).
A new option
ReaderOptions::skip_rows_after_names allows skipping a number
of rows after reading the column names (as opposed to
Quoted strings can now be treated as always non-null (ARROW-10115).
The asynchronous scanner introduced in 4.0.0 has been improved with truly
asynchronous readers implemented for CSV, Parquet, and IPC file formats and
file-level parallelism added. This mode is controlled by a flag
can be passed into methods which scan a dataset. Setting this flag to True
will have significant improvements on filesystems with high latency or parallel
reads (e.g. S3).
A CountRows method has been added to count rows matching a predicate; where possible, this will use metadata in files instead of reading the data itself.
CSV datasets can now be written, and when reading a CSV dataset, explicit types can now be specified for a subset of columns while allowing the rest to still be inferred.
IO and Filesystem layer
The I/O thread pool size can now be adjusted at runtime (ARROW-12760). The default size remains 8 threads.
Streams now can have auxiliary metadata, depending on the backend. This
has been implemented for the S3 filesystems, where a couple metadata
keys are supported such as
ACL (ARROW-11161, ARROW-12719).
The HadoopFileSystem implementation now implements the FileSystem abstraction more faithfully (ARROW-12790).
The new LZ4_RAW compression scheme was implemented (PARQUET-1998). Unlike the legacy LZ4 compression scheme, it is defined unambiguously and should provide better portability once other Parquet implementations catch up.
- Flight Client and Server now support Custom Metadata through the functions
flight.CreateServerBearerTokenAuthInterceptorshave been deprecated in favor of using the new middleware. #10633
- Flight Client
AuthHandlerno longer overwrites outgoing metadata, correctly appending new metadata without overwriting existing metadata #10297
- Flight AppMetadata field is now exposed both for Reading and Writing via
- Map and Extension Datatypes are now implemented for Arrow Arrays
- Schema package and first part of Encoding package added for Golang Parquet Implementation
Highlighted improvements and fixes:
- Improved support for extension types using a complex storage type, e.g. struct, map or union. These can now extend
- Union vectors now extend
AbstractContainerVectorto be consistent with other vectors.
- Guava dependency updated to 30.1.1
- Memory leak fixed if an exception occurs when reading IPC messages from a channel. #10423
- Flight error metadata is now propagated to the client. #10370
- JDBC adapter now preserves nullability. #10285
- The memory rounding policy is respected when allocating vector buffers. This helps saving memory space. #10576
API compatibility changes:
- Complex vectors now return covariant types from
- Tables do not extend DataFrames anymore. This enables smaller bundles. #10277
- Arrow uses closure compiler for all UMD bundles, making them smaller. #10281
- The npm package now comes with declaration maps for better navigation from types to source code. #10673
- Updated dependencies and improvements to the code.
- Datasets can now scan files asynchronously when the
use_async=Trueoption is provided to
Dataset.to_batchesmethods. This should provide better performance in environments where I/O can be slow, such as with remote sources.
- Arrow now provides builtin support for writing CSV files through
- Wheels for Apple M1 Macs are now provided.
- Many new
pyarrow.computefunctions are available (see the C++ notes above for more details), and introspection of the functions was improved so that they look more like standard Python functions.
- It is now possible to access ORC file metadata from Python
- Building a
StructArraynow accepts a
masklike other arrays
- Many updates and fixes for the documentation
In this release, we’ve more than doubled the number of functions you can call on Arrow Datasets inside
arrange(), including many more string, datetime, and math functions. You can also write Datasets to CSV files, in addition to Parquet and Feather. We’ve also deepened support for the Arrow C interface, which is used in the Python interface and allows integration with other projects, such as DuckDB.
For more on what’s in the 5.0.0 R package, see the R changelog.
Ruby and C GLib notes
Apache Arrow Flight support is started. But
ListFlights is only supported for now. More features will be implemented in the next major release.
You need gobject-introspection gem 3.4.5 or later to implement your Apache Arrow Flight server. If you only use Apache Arrow Flight client, gobject-introspection gem 3.4.5 or later isn’t required.
Here are highlighted improvements:
Compute functions accept raw Ruby objects such as
add_function = Arrow::Function.find("add") # Not shortcut version augend = Arrow::Int8Array.new([1, 2, 3]) addend = Arrow::Int8Scalar.new(5) args = [ Arrow::ArrayDatum.new(augend), Arrow::ScalarDatum.new(addend), ] add_function.execute(args).value.to_a # => [6, 7, 8] # Shortcut version add_function.execute([[1, 2, 3], 5]).value.to_a # => [6, 7, 8]
Arrow::Buffercan be used as MemoryView that is added in Ruby 3.0.
There are some backward incompatible changes:
Arrow::CountModeare removed. Use
There are some backward incompatible changes:
GArrowCountModeare removed. Use
- Prefix in arrow-dataset-glib is changed to
GADInMemoryScanTaskare removed. Use
garrow_*_array_compare()are removed. Use
greater_than_equalcompute functions directly instead.
The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the 5.0.0 release of the Rust implementation, see the Arrow Rust changelog and the Apache Arrow Rust 5.0.0 Release blog post.