Apache Arrow 19.0.0 Release
Published
16 Jan 2025
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 19.0.0 release. This release covers over 2 months of development work and includes 202 resolved issues on 330 distinct commits from 67 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Community
Since the 18.1.0 release, Adam Reeve and Laurent Goujon have been invited to become committers. Gang Wu has been invited to join the Project Management Committee (PMC).
Thanks for your contributions and participation in the project!
Release Highlights
A bug has been identified in the 19.0.0 versions of the C++ and Python libraries which prevents reading Parquet files written by Arrow Rust v53.0.0 or higher. The files written by Arrow Rust are correct and the bug was in the patch adding support for Parquet’s SizeStatistics feature to Arrow C++ and Python. See #45293 for more details including a potential workaround.
As a result, we plan to create a 19.0.1 release to include a fix for this which should be available in next few weeks.
Columnar Format
We’ve added a new experimental specification for representing statistics on Arrow Arrays as Arrow Arrays. This is useful for preserving and exchanging statistics between systems such as when converting Parquet data to Arrow. See the statistics schema documentation for details.
We’ve expanded the Arrow C Device Data Interface to include an experimental Async Device Stream Interface. While the existing Arrow C Device Data Interface is a pull-oriented API, the Async interface provides a push-oriented design for other workflows. See the documentation for more information. It currently has implementations in the C++ and Go libraries.
Arrow Flight RPC Notes
The precision of a Timestamp (used for timeouts) is now nanoseconds on all platforms; previously it was platform-dependent. This may be a breaking change depending on your use case. (#44679)
The Python bindings now support various new fields that were added to
FlightEndpoint/FlightInfo (like expiration_time
).
(#36954)
C++ Notes
Compute
- It is now possible to cast from a struct type to another struct type with additional columns, provided the additional columns are nullable (#44555).
- The compute function
expm1
has been added to computeexp(x) - 1
with better accuracy when the input value is close to 0 (#44903). - Hyperbolic trigonometric functions and their reciprocals have also been added. (#44952).
- The new Decimal32 and Decimal64 types have been further supported by allowing casting between numeric, string, and other decimal types (#43956).
Acero
- Added AVX2 support for decoding row tables in the Swiss join specialization of hash joins, enabling up to 40% performance improvement for build-heavy workloads. (#43693)
Filesystems
- The S3 filesystem has gained support for server-side encryption with customer provided keys, aka SSE-C. (#43535)
- The S3 filesystem also gained an option to disable the SIGPIPE signals that may be emitted on some network events. (#44695)
- The Azure filesystem now supports SAS token authentication. (#44308).
Flight RPC
- The precision of a Timestamp (used for timeouts) is now nanoseconds on all platforms; previously it was platform-dependent. This may be a breaking change depending on your use case. (#44679)
- The Python bindings now support various new fields that were added to
FlightEndpoint/FlightInfo (like
expiration_time
). (#36954) - The UCX backend has been deprecated and is scheduled for removal. (#45079)
Parquet
- The initial footer read size can now be configured to reduce the number of potential round-trips on hi-latency filesystems such as S3. (#45015)
- The new
SizeStatistics
format feature has been implemented, though it is disabled by default when writing. (#40592) - We’ve added a new method to the ParquetFileReader class, GetReadRanges, which can calculate the byte ranges necessary to read a given set of columns and row groups. This may be useful to pre-buffer file data via caching mechanisms. (#45092)
- We’ve added
arrow::Result
-returning variants forparquet::arrow::OpenFile()
andparquet::arrow::FileReader::GetRecordBatchReader()
. (#44784, #44808)
C# Notes
- The
PrimitiveArrayBuilder
constructor has been made public to allow writing custom builders. (#23995) - Improved the performance of looking up schema fields by name. (#44575)
Java, Go, and Rust Notes
The Java, Go, and Rust Go projects have moved to separate repositories outside the main Arrow monorepo.
- For notes on the latest release of the Java implementation, see the latest Arrow Java changelog.
- For notes on the latest release of the Rust implementation see the latest Arrow Rust changelog.
- For notes on the latest release of the Go implementation, see the latest Arrow Go changelog.
Linux Packaging Notes
- Debian: Fixed keyring format to support newer libapt (e.g., used by Trixie). (#45118)
Python Notes
New features:
- The upcoming pandas 3.0 string
dtype is now
supported by PyArrow’s
to_pandas
routine. In the future, when using pandas >=3.0, the new pandas behavior will be enabled by default. You can opt into the new behavior under pandas >=2.3 by settingpd.options.future.infer_string = True
. This may be considered a breaking change. (#43683) - Support for 32-bit and 64-bit decimal types was added. (#44713)
- Arrow PyCapsule stream objects are supported in
write_dataset
. (#43410) - New Flight features have been exposed. (#36954)
- Bindings for
JsonExtensionType
andJsonArray
were added. (#44066) - Hyperbolic trigonometry functions added to the Arrow C++ compute kernels are also available in PyArrow. (#44952)
Other improvements:
strings_to_categorical
keyword into_pandas
can now be used for string view type. (#45175)from_buffers
is updated to work withStringView
. (#44651)- Version suffixes are also set for Arrow Python C++ (
libarrow_python*
) libraries. (#44614)
Ruby and C GLib Notes
Ruby
- Added basic support for JRuby with an implementation based on Arrow Java (#44346). The plan is to release this as a gem once it covers a base set of features. See #45324 for more information.
- Added support for 32bit and 64bit decimal, binary view, and string view. See issues listed in the 19.0.0 milestone for more details.
- Fixed a bug that empty struct list can’t be built. (#44742)
- Fixed a bug that
record_batch[:column].size
raises an exception. (#45119)
C GLib
- Added support for 32bit and 64bit decimal, binary view, and string view. See issues listed in the 19.0.0 milestone for more details.