Apache Arrow 17.0.0 Release


Published 16 Jul 2024
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 17.0.0 release. This covers over 3 months of development work and includes 331 resolved issues on 529 distinct commits from 92 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Community

Since the 16.0.0 release, Dane Pitkin has been invited to be committer. No new members have joined the Project Management Committee (PMC).

Thanks for your contributions and participation in the project!

Linux packages notes

  • We dropped support for Debian GNU/Linux bullseye

C Data Interface notes

  • ArrowDeviceArrayStream can now be imported and exported (GH-40078)

Arrow Flight RPC notes

  • Flight SQL was formally stabilized (GH-39204).
  • Flight SQL added a bulk ingestion command (GH-38255).
  • The JDBC Flight SQL driver now accepts “catalog” as a connection parameter (GH-41947).
  • “Stateless” prepared statements are now supported (GH-37220, GH-41262).
  • Java added FlightStatusCode.RESOURCE_EXHAUSTED (GH-35888).
  • C++ has some basic support for logging with OpenTelemetry (GH-39898).

C++ notes

For C++ notes refer to the full changelog.

Highlights

  • Half-float values can now be parsed and formatted correctly (GH-41089).
  • Record batches can now be converted to row-major tensors, not only column-major (GH-40866).
  • The CSV writer is now able to write large string arrays that are larger than 2 GiB (GH-40270).
  • A possible invalid memory access in BooleanArray.true_count() has been fixed (GH-41016).
  • A new method FlattenRecursively allows recursive nesting of list and fixed-size list arrays (GH-41055).
  • The scratch space in some Scalar subclasses is now immutable. This is required for proper concurrent access to Scalar instances (GH-40069).
  • Calling the bit_width or byte_width method of an extension type now defers to the underlying storage type (GH-41353).
  • Fixed a bug where MapArray::FromArrays would behave incorrectly if the given offsets array has a non-zero offset (GH-40750).
  • MapArray::FromArrays now accepts an optional null bitmap argument (GH-41684).
  • The ARROW_NO_DEPRECATED_API macro was unused and has been removed (GH-41343).

Acero

  • The left anti join filter no longer crashes when the filter rows are empty (GH-41121).
  • A race condition was fixed in the asof join (GH-41149).
  • A potential stack overflow has been fixed (GH-41334, GH-41738).
  • A potential crash on very large data has been fixed (GH-41813).
  • Asof join and sort merge join now support single threaded mode (GH-41190).

Compute

  • List views and maps are now supported by the if_else, case_when and coalesce functions (GH-41418).
  • List views are now supported by the functions list_slice (GH-42065), list_parent_indices (GH-42235), take and filter (GH-42116).
  • list_flatten can now be recursive based on new optional argument (GH-41183, GH-41055)
  • The take and filter functions have been made significantly faster on fixed-width types, including fixed-size lists of fixed-width types (GH-39798).

Dataset

  • Repeated scanning of an encrypted Parquet dataset now works correctly (GH-41431).

Filesystems

  • Standard filesystem implementations are now tracked in a global registry which also allows loading third-party filesystem implementations, for example from runtime-loaded DLLs (GH-40342,
  • Directory metadata operations on Azure filesystems are now more aligned with the common expectations for filesystems (GH-41034).
  • CopyFile is now supported for Azure filesystems with hierarchical namespace enabled (GH-41095).
  • Azure credentials can now be loaded explicitly from the environment (GH-39345), or using the Azure CLI (GH-39344).
  • A potential deadlock was fixed when closing an S3 output stream (GH-41862).

GPU

  • Non-CPU data can now be pretty-printed (GH-41664).
  • Non-CPU data with offsets, such as list and binary data, can now be properly sent over IPC (GH-42198).

IPC

  • Flatbuffers serialization is now more deterministic (GH-40361).

Parquet

  • A crash was fixed when reading an invalid Parquet file where columns claim to be of different lengths (GH-41317).
  • Definition and repetition levels are now more strictly checked, avoiding later crashes when reading an invalid Parquet file (GH-41321).
  • A crash was fixed when reading an invalid encrypted Parquet file (GH-43070).
  • Fixed a bug where DeltaLengthByteArrayEncoder::EstimatedDataEncodedSize could return an invalid estimate in some situations (GH-41545).
  • Delimiting records is now faster for columns with nested repeating (GH-41361).

Substrait

  • Support for more Arrow data types was added: some temporal types, half floats, large string and large binary (GH-40695).

C# notes

  • The performance of building Decimal arrays using SqlDecimal values was improved for .NET 7+ (GH-41349)
  • Scalar arrays now implement ICollection<T?> (GH-38692)
  • Concatenating arrays with a non-zero offset with ArrowArrayConcatenator was fixed (GH-41164)
  • Concatenating union arrays with ArrowArrayConcatenator was fixed (GH-41198)
  • Accessing values of decimal arrays with a non-zero offset was fixed (GH-41199)

Go Notes

Bug Fixes

Arrow

  • Prevent exposure of invalid Go pointers in CGO code (GH-43062)
  • Fix memory leak for 0-length C array imports (GH=41534)
  • Ensure statement handle is updated so stateless prepared statements work properly (GH-41427)

Parquet

  • Fix memory leak in BufferedPageWriter (GH-41697)
  • Fix performance regression in PooledBufferWriter (GH-41541)

Enhancements

Arrow

  • Arrow Schemas and Records can now be created from Protobuf messages (GH-40494)

Parquet

  • Performance improvement for BitWriter VlqInt (GH-41160)

Java notes

Some changes are coming up in the next version, Arrow 18. Java 8 support will be removed. The version of the flight-core artifact with shaded gRPC will no longer be distributed.

  • Basic support for ListView (GH-41287) and StringView (GH-40339) has been added. These types should still be considered experimental.

JavaScript notes

  • General maintenance. Clean up packaging (GH-39722), update dependencies (GH-41905).

Python notes

Compatibility notes:

  • To ensure Python 3.13 compatibility, _Py_IsFinalizing has been replaced with a public API (GH-41475).
  • The C Data Interface now supports CUDA devices (GH-40384).

New features:

  • Added support for Emscripten via Pyodide (GH-41910).

Other improvements:

  • The ParquetWriter added the store_decimal_as_integer option (GH-42168).
  • The Float16 logical type is supported in Parquet (GH-42016).
  • Exposed bit_width and byte_width to extension types (GH-41389).
  • Added bindings for Device and MemoryManager classes (GH-41126).
  • The PyCapsule interface now exposes the device interface (GH-38325).
  • Various PyArrow APIs have been updated to work with non-CPU architectures gracefully. (GH-42112, GH-41664, GH-41662,

Relevant bug fixes:

  • Fixed Numpy 2.0 compatibility issues (GH-42170, GH-41924).
  • Fixed sporadic as_of join failures (GH-41149).
  • Fixed a bug in RecordBatch.filter() when passing a ChunkedArray, which would cause a segfault (GH-38770).
  • Fixed a bug in RecordBatch.from_arrays() when passing a storage array, which would cause a segfault (GH-37669).
  • Fixed a bug where constructing a MapArray from Array could drop nulls (GH-41684).
  • FIxed a bug where RunEndEncodedArray.from_arrays fails if run_ends are pyarrow.Array (GH-40560).
  • FIxed a regression introduced in PyArrow v16 in RecordBatchReader.cast() (GH-41884).

R notes

  • R functions that users write that use functions that Arrow supports in dataset queries now can be used in queries too. Previously, only functions that used arithmetic operators worked. For example, time_hours <- function(mins) mins / 60 worked, but time_hours_rounded <- function(mins) round(mins / 60) did not; now both work. These are automatic translations rather than true user-defined functions (UDFs); for UDFs, see register_scalar_function(). GH-41223
  • mutate() expressions can now include aggregations, such as x - mean(x). GH-41350
  • summarize() supports more complex expressions, and correctly handles cases where column names are reused in expressions. GH-41323
  • The na_matches argument to the dplyr::*_join() functions is now supported. This argument controls whether NA values are considered equal when joining. GH-41223
  • R metadata, stored in the Arrow schema to support round-tripping data between R and Arrow/Parquet, is now serialized and deserialized more strictly. This makes it safer to load data from files from unknown sources into R data.frames. GH-41223

For more on what’s in the 17.0.0 R package, see the R changelog.

Ruby and C GLib notes

Ruby

  • Improved Arrow::Table#to_s format
    • This is a breaking change

C GLib

  • Added support for Microsoft Visual C++
  • Added gadataset_dataset_to_record_batch_reader()

Rust notes

The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.