Apache Arrow 15.0.0 Release
Published
21 Jan 2024
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 15.0.0 release. This covers over 3 months of development work and includes 344 resolved issues on 536 distinct commits from 101 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Community
Since the 14.0.0 release, Curt Hagenlocher, Xuwei Fu, James Duong and Felipe Oliveira Carvalho have been invited to be committers. Jonathan Keane and Raúl Cumplido have joined the Project Management Committee (PMC).
As per our tradition of rotating the PMC chair once a year Andy Grove was elected as the new PMC chair and VP.
Thanks for your contributions and participation in the project!
C Data Interface notes
New format strings have been added for ListView, LargeListView, BinaryView and StringView array types.
Arrow Flight RPC notes
Flight SQL is now considered stable (GH-39037). The Flight SQL specification was clarified regarding how the result set schema of a prepared statement is affected by bound parameters (GH-37061).
The JDBC Arrow Flight SQL driver now supports mTLS authentication (GH-38460) and bind parameters (GH-33475), follows the Flight RPC spec when fetching data (GH-34532), and can reuse credentials across metadata and data connections (GH-38576). On macOS it will also use the system keychain to be consistent with other platforms (GH-39014). Applications can also retrieve the underlying Flight RPC metadata from the JDBC driver (GH-38024, GH-38022).
C++ notes
For C++ notes refer to the full changelog.
Parquet
New features:
- Support row group filtering for nested paths for C++ and Parquet (GH-39064)
- Implement Parquet Float16 logical type (GH-36036)
- Expose sorting_columns in RowGroupMetaData for Parquet files (GH-35331)
- Support decompressing concatenated gzip members (stream) (GH-38271)
API change:
- Move EstimatedBufferedValueBytes from TypedColumnWriter to ColumnWriter (GH-38887)
- Change parquet TypedComparator operation to const methods (GH-38874)
- Remove deprecated AppendRowGroup(int64_t num_rows) (GH-39208)
- Add api to get RecordReader from RowGroupReader (GH-37002)
Bug fixes:
- Add more closed file checks for ParquetFileWriter to Prevent from used-after-close (GH-38390)
Performance enhancement:
- Faster Scalar BYTE_STREAM_SPLIT encoding/decoding (GH-38542)
- Faster reading Parquet FLBA (GH-39124, GH-39413)
- Using bloom_filter_length in parquet 2.10 to optimize bloom filter read (GH-38860)
Miscellaneous
- Upgrade ORC to 1.9.2 (GH-39340)
C# notes
Removal of build targets:
- Remove out-of-support versions of .NET and update C# README GH-31579
New features:
- Better support for decimal values which exceed the range of the BCL’s System.Decimal GH-38351, GH-38483
- Expose ArrayDataConcentrator.Concatenate publicly GH-38153
- Add ToString methods to Arrow classes GH-36566
- Implement common interfaces for structure arrays and record batches GH-38757
- Make primitive arrays support IReadOnlyList<T?> GH-38348, GH-39223
- Add ToList to Decimal128Array and Decimal256Array GH-37359
- Support additional types Interval, Utf8View, BinaryView and ListView GH-38316, GH-39341
- Support creating FlightClient with Grpc.Core.Channel GH-39335
Fixes and improved compatibility:
- Make dictionaries in file and memory implementations work correctly and support integration tests GH-32662
- Support blank column names and enable more integration tests GH-36588
Go Notes
Bug Fixes
Arrow
- Ensured reliability of AuthenticateBasicToken behind proxies (GH-38198)
- Ensured release callback is properly called on C Data imported arrays/batches (GH-38281)
- Fixed rounding errors in decimal256 string functions (GH-38395)
- Added
ValueLen
to Binary and String array interface (GH-38458) - Fixed Decimal128 rounding issues (GH-38477)
- Fixed memory leak in IPC LZ4 decompressor (GH-38728)
- Addressed Data race in
GetToTimeFunc
for fixed timestamp data types (GH-38795) - Fixed “index out of range” error for empty resultsets of FlightSQL driver (GH-39238)
Parquet
- Fixed issue with max definition levels when writing a Parquet file under certain circumstances (GH-38503)
- File writer now properly tracks the number of rows written beyond the last row group (GH-38516)
Enhancements
Arrow
- Added an Avro OCF reader for converting Avro files directly to Arrow record batches (GH-36760)
- Added support for StringView (GH-38718) and C Data ABI StringViews (GH-39013)
- GC Checks were enabled for CI running integration tests (GH-38824)
Parquet
- Implemented Float16 logical type for Parquet files (GH-37582)
- Added proper boolean RLE encoding/decoding (GH-38462)
Bug Fixes
Enhancements
Java notes
We expect a breaking change in the next release, Arrow 16.0.0. Support for Java 9 modules is coming, but that will require changing the JVM flags used to launch your application (GH-38998). Arrow 15.0.0 is not affected.
A bill-of-materials (BOM) package was added to make it easier to depend on multiple Arrow libraries (GH-38264).
The JDBC adapter (separate from the JDBC driver) now supports 256-bit decimals (GH-39484) and throws more informative exceptions (GH-39355).
Various improvements were made to utilities for working with vectors (GH-38662, GH-38614, GH-38511, GH-38254, GH-38246).
JavaScript notes
This release comes with new features and APIs. We also removed getByteLength
to reduce bundle sizes.
New Features with API changes
- GH-39017: [JS] Add typeId as attribute
- GH-39257: [JS] LargeBinary
- GH-15060: [JS] Add LargeUtf8 type
- GH-39259: [JS] Remove getByteLength
- GH-39435: [JS] Add Vector.nullable
- GH-39255: [JS] Allow customization of schema when passing vectors to table constructor
- GH-37983: [JS] Allow nullable fields in table when constructed from vector with nulls
Package changes
Python notes
Compatibility notes:
- Legacy
ParquetDataset
custom implementation has been removed and only the new dataset API is now in use GH-31303.
New features:
- PyArrow version 14.0.0 included a new specification for Arrow PyCapsules and related dunder methods GH-35531 and now a public
RecordBatchReader
constructor from stream object implementing the PyCapsule Protocol has been added GH-[39217](https://github.com/apache/arrow/issues/39217) together with some additional documentation GH-[39196](https://github.com/apache/arrow/issues/39196). - DLPack protocol support (producer) was added to the Arrow C++ and is exposed in Python through
__dlpack__
and__dlpack_device__
dunder methods GH-33984. - Python now exposes enabling CRC checksum for read and write operations in Paquet GH-37242. CRC checksum are optional and can detect data corruption.
CacheOptions
are now configurable from Python as part of thepyarrow.dataset.ParquetFragmentScanOptions
GH-36441.- Parquet metadata to indicate sort order of the data are now exposed in
RowGroupMetaData
GH-35331. - Parquet Support write and validate Page CRC (GH-37242)
Other improvements:
- Append parameter from
FileOutputStream
is exposed for theOSFile
class GH-38857. - File size can be passed to
make_fragment
in the pyarrow datasets (pyarrow.dataset.FileFormat
andpyarrow.dataset.ParquetFileFormat
) GH-37857. - Support for mask parameter is added to
FixedSizeListArray.from_arrays
GH-34316 to/from_struct_array
are added to thepyarrow.Table
class GH-33500.- GIL is released in
.nbytes
which is improving performance when calculating the data size GH-39096. - Usage of pandas internals
DatetimeTZBlock
has been removed GH-38341. DataType
instance can be passed toMapType.from_arrays
constructor GH-39515.
Relevant bug fixes:
S3FileSystem
equalsNone
segfault has been fixed GH-38535.- No-op kernel is added for
dictionary_encode(dictionary)
GH-34890. - PrettyPrint for Timestamp type now adds “Z” at the end of the print string when tz is defined in order to add minimum information about the values being stored in UTC GH-30117.
R notes
New features:
- Bindings for
base::prod
have been added so you can now use it in your dplyr pipelines (i.e.,tbl |> summarize(prod(col))
) without having to pull the data into R GH-38601. - Calling
dimnames
orcolnames
onDataset
objects now returns a useful result rather than justNULL
GH-38377. - The
code()
method on Schema objects now takes an optionalnamespace
argument which, whenTRUE
, prefixes names witharrow::
which makes the output more portable GH-38144.
Other improvements:
- To make debugging problems easier when using arrow with AWS S3 (e..g,
s3_bucket
,S3FileSystem
), the debug log level for S3 can be set with theAWS_S3_LOG_LEVEL
environment variable. See?S3FileSystem
for more information. GH-38267 - An error is now thrown instead of warning and pulling the data into R when any of
sub
,gsub
,stringr::str_replace
,stringr::str_replace_all
are passed a length > 1 vector of values inpattern
GH-39219. - Missing documentation was added to
?open_dataset
documenting how to use the ND-JSON support added in arrow 13.0.0 GH-38258. - Using arrow with duckdb (i.e.,
to_duckdb()
) no longer results in warnings when quitting your R session. GH-38495
For more on what’s in the 15.0.0 R package, see the R changelog.
Ruby and C GLib notes
Ruby
C GLib
- Follow C++ changes.
Rust notes
The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.