Apache Arrow 18.0.0 Release
Published
28 Oct 2024
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 18.0.0 release. This covers over 3 months of development work and includes 334 resolved issues on 530 distinct commits from 89 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Community
Since the 17.0.0 release, Will Ayd and Rossi Sun have been invited to be committer. No new members have joined the Project Management Committee (PMC).
Thanks for your contributions and participation in the project!
Columnar format
The Arrow columnar format now allows 32-bit and 64-bit decimal data, in addition to the already existing 128-bit and 256-bit decimal data types (GH-43956).
Arrow Flight RPC notes
The Java implementation now transparently handles compressed Arrow data when reading, instead of requiring explicit configuration. (GH-43469) The Ruby bindings now support implementing DoPut on the server. (GH-43814)
C++ notes
The default memory pool has changed to mimalloc on all platforms (GH-43254).
Previously, jemalloc was used by default on Linux. Using mimalloc by default
provides a more consistent experience across different platforms, and
makes configuration easier. It is expected that this might either increase
or decrease performance on user workloads that use the default memory pool;
please benchmark accordingly. Jemalloc can still be selected by setting
the ARROW_DEFAULT_MEMORY_POOL
environment variable to “jemalloc”.
A new class arrow::ArrayStatistics
has been added to encode basic statistics
about an Arrow array. It provides a source-agnostic representation for statistics
provided by third-party sources such as Parquet files (GH-41909).
The new Decimal32 and Decimal64 types have been made available (GH-43956).
Several canonical extension types have been implemented:
- the Opaque extension type (GH-43454);
- the 8-bit boolean extension type (GH-17682);
- the UUID extension type (GH-15058);
- the JSON extension type (GH-32538).
Flight UCX is deprecated. We plan to remove this experiment in the next couple of releases.
Acero
- Enhanced the row-oriented representation by widening the offset type from 32-bit to 64-bit, resolving crashes and data corruption in aggregation and hash join on large datasets due to offset overflow (GH-43495).
- Improved ordered aggregation performance by reducing complexity from
O(n*m)
toO(n)
, wheren
is the number of rows andm
the number of segments in the batch (GH-44052).
Compute
Casting between string-like and string-view-like types has been implemented (GH-42247).
Dataset
Filesystems
Writing small files to S3 can use a single S3 API call instead of three,
provided the new option allow_delayed_open
is enabled (GH-40557).
Files larger than 5 MB still go through the regular multipart
upload mechanism.
Background writes are now implemented and enabled by default for the Azure filesystem, dramatically improving the performance of writing to remote files (GH-40036).
Finalization of the S3 filesystem layer should hopefully be more robust (GH-44071).
Gandiva
LLVM 19.1 is now supported (GH-44222).
GPU
IPC
The seed corpus used for fuzzing the IPC reader has been improved, hopefully helping make the IPC reader even more robust against corrupt or malicious IPC streams (GH-38041).
Parquet
A new command line utility parquet-dump-footer
allows dumping the Thrift-encoded
footer metadata of a Parquet file, optionally scrubbing confidential data
(GH-42102). This is part of the effort to collect real-world Parquet metadata
so as to evaluate the efficiency of future improvements to the Parquet format.
Please see https://github.com/apache/parquet-benchmark for instructions to submit
footers representative of your own workloads.
C# notes
- Partial support has been added for LargeBinary, LargeString and LargeList. The column sizes cannot exceed 2 GB in length. (GH-43266).
- Changes to Flight support were made for better control and compatibility, and to allow Flight Server to be hosted in pre-Kestrel versions of .NET (GH-43907, GH-43672, GH-41347).
- Support has been added for newly-defined types decimal32 and decimal64 (GH-44271).
- The import of sliced arrays through the C Data interface now works correctly. (GH-43267)
Java notes
Java 8 is no longer supported. (GH-38051)
Gandiva may not work in this release. For details, please see GH-43576.
Basic support for RunEndEncoded was added (GH-39982). The ListView/StringView vector implementations are now more complete, including C Data support (multiple issues).
Several APIs have been updated to accept long
for addresses in preparation for FFM/large buffer support (GH-43902). We no longer expose sun.misc.Unsafe
(GH-43479). We no longer ship the shaded
flight-core JARs (GH-43217).
More options were added to the Dataset ScanOptions API (GH-28866).
JavaScript notes
- Accessing individual rows in Tables or Structs should now be more performant (GH-30863).
Python notes
Compatibility notes:
- NumPy required dependency has been removed from pyarrow packaging GH-43846 and has been made an optional runtime dependency GH-25118.
- Support for Python 3.8 has been dropped GH-43518
- No longer used serialize/deserialize Pyarrow C++ functions have been deprecated GH-44063.
- Passing of build flags to setup.py (e.g.
setup.py --with-parquet
) has been deprecated GH-43514
New features:
- Non-cpu work has continued with GH-43973, GH-43728, GH-43727, GH-43391, GH-42222 and GH-41665.
- Arrow C++
arrow::dataset::Partitioning::Format
method has been exposed in Python GH-43684. - UUID canonical extension type is now supported in Python GH-15058.
- Opaque canonical extension type has been implemented GH-43454.
StructArray.from_array
now accepts a type in addition to names or fields GH-42014.- New attributes have been added to
StructType
in order to access all its fields GH-30058.
Other improvements:
- In order to support free-threaded build of CPython 3.13 additional work has been made: GH-44046, GH-44355 and GH-43964. Umbrella issue GH-43536.
- PyCapsule interface now has precedence over others in pa.schema(..) GH-43388.
- Usage of deprecated
pkg_resources
in setup.py has been replaced withnumpy.get_include()
GH-43532. - Conversion from Arrow to JAX via dlpack as added to the documentation examples GH-44229.
Relevant bug fixes:
pyarrow.Table.rename_columns
has been updated and should have acceptedtuples
, not onlylist
ordict
. This has been fixed GH-43588.- Python reference handling in UDF implementation has been sanitized GH-43487.
- Files included when building wheels have been cleaned (unnecessary files removed) GH-43299.
R notes
For more on what’s in the 18.0.0 R package, see the R changelog.
Ruby and C GLib notes
Ruby
- Add workaround for install failure due to
re2.pc
on Ubuntu 20.04: GH-41396 - Add support for
0
decimal value: GH-43877
C GLib related improvements are also available in Ruby.
C GLib
- Add support for Azure file system: GH-43738
- FlightRPC: Add support for DoPut: GH-41056
- FlightRPC: Add support for timeout: GH-44178
- Parquet: Add support for writing a record batch: GH-40860
- Add support for pull style IPC stream format decoder: GH-40493
Rust notes and Go notes
The Rust and Go projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog. For notes on the latest release of the Go implementation, see the latest Arrow Go changelog
Linux packages notes
The Azure file system is now enabled.