Apache Arrow 14.0.0 Release


Published 01 Nov 2023
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 14.0.0 release. This covers over 3 months of development work and includes 483 resolved issues from 116 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Community

Since the 13.0.0 release, Metehan Yildirim and Oleks V. have been invited to be committers.

Thanks for your contributions and participation in the project!

Columnar Format Notes

Motivated by recent innovations in DuckDB and Meta’s Velox engine, new “view” data types were added to the Arrow columnar format spec:

  • 16-byte StringView and BinaryView data types which enable better buffer reuse, faster “false” string comparisons (due to maintaining a prefix) and short string inlining (GH-35627).
  • ListView and LargeListView types for more performant “out-of-order” building and processing of lists and better buffer reuse (GH-37876).

A VariableShapeTensorType was added to the Arrow specification as a canonical extension type (GH-24868).

C Data Interface notes

Integration testing has been added for the C Data Interface accross Arrow implementations, ensuring mutual compatibility. (GH-37537). The C++, C# and Go implementations are covered, with Arrow Java soon to come.

Arrow Flight RPC notes

A new RPC method was added to allow polling for completion in long-running queries as an alternative to the blocking GetFlightInfo call (GH-36155). Also, app_metadata was added to FlightInfo and FlightEndpoint (GH-37635).

In C++ and Python, an experimental asynchronous GetFlightInfo call was added to the client-side API (GH-36512). ServerCallContext now exposes conveniences to send headers/trailers without having to use middleware (GH-36952). The implementation was fixed to not reject unknown field tags to enable interoperability with future versions of Flight that could add new fields (GH-36975). The CMake configuration was fixed to correctly require linking to Arrow Flight RPC when using Arrow Flight SQL (GH-37406).

In Go, the underlying generated Protobuf code is now exposed for easier low-level integrations with Flight (GH-36893).

In Java, the stateful “login” authentication APIs using the Handshake RPC are deprecated; it will not be removed, but it should not be used unless you specifically want the old behavior (GH-37722). Utilities were added to help implement basic Flight SQL services for unit testing (GH-37795).

C++ notes

Experimental APIs for exporting and importing non-CPU arrays using the C Device Data Interface have been added (GH-36488), together with an experimental API for device synchronization (GH-36103).

Initial compatibility with Emscripten without threading support has been added (GH-35176).

Compute layer

New compute functions:

  • a cumulative_mean function on numeric data (GH-36931);

Improved compute functions:

  • rounding functions now work natively on integer inputs instead of casting them to floats (GH-35273);
  • the divide function now supports duration inputs (GH-36789);
  • take and filter now support sparse unions in addition to dense unions (GH-36905);
  • if_else, coalesce, choose and case_when now support duration inputs (GH-37028);
  • casting between fixed-size lists and variable-size lists is now supported (GH-20086);
  • casting from strings to dates is now supported (GH-37411);
  • mean on integer inputs now uses a floating-point representation for its intermediate sum, avoiding integer overflow on large inputs (GH-34909);

Datasets

Support for writing encrypted Parquet datasets has been added (GH-29238).

Gandiva

Gandiva now supports linking dynamically to LLVM on non-Windows platforms (GH-37410). Previously, Gandiva would always link LLVM statically into libgandiva.

Parquet

RLE is used by default when encoding boolean values if v2 data pages are enabled (GH-36882).

Page indexes can now be encrypted as per the specification (GH-34950).

A bug in the DELTA_BINARY_PACKED encoder leading to suboptimal column sizes was fixed (GH-37939).

Substrait

It is now possible to serialize and deserialize individual expressions using Substrait, not only full query plans (GH-33985).

Miscellaneous

A new CodecOptions class allows customizing compression parameters per-codec (GH-35287).

The environment variable AWS_ENDPOINT_URL is now respected when resolving S3 URIs (GH-36770).

Recursively listing S3 filesystem trees should now issue less requests, leading to improved performance (GH-34213).

Comparing a ChunkedArray to itself now behaves correctly with NaN values (GH-37515).

The use of BMI2 instructions on x86 was incorrectly guarded. Those instructions could be executed on platforms without BMI2 support, leading to crashes (GH-37017).

C# notes

The following features have been added to the C# implementation apart from other minor ones and some fixes.

  • Support fixed-size lists (GH-33032)
  • Support DateOnly and TimeOnly on .NET 6.0+ (GH-34620)
  • Implement MapType (GH-35243)
  • Flight SQL implementation for C# (GH-36078)
  • Implement support for dense and sparse unions (GH-36795)

Go Notes

  • The minimum version of Go officially supported is now go1.19 instead of go1.17 (GH-37636)

Bug Fixes

Arrow

  • Documentation fixed to correctly state that the default unit for TimestampType is seconds (GH-35770)
  • Fixed leak in the Concatenate function if there is a panic that is recovered (GH-36850)
  • Ensure Binary dictionary indices are released on panic (GH-36858)
  • Fix overflow value causing invalid dates for MarshalJSON on some timestamps (GH-36935)
  • Fix leaking dictionary allocations in IPC reader (GH-36981)

Parquet

  • Fixed an issue where DeltaLengthByteArray encoding fails on certain null value scenarios (GH-36318)
  • Correctly propagate internal writer.sink.Close() errors from writer.Close() when writing a Parquet file (GH-36645)
  • Fixed a panic when writing some specific DeltaBitPacked datasets (GH-37102)
  • Proper support for Decimal256 data type in Parquet lib (GH-37419)
  • Corrected inconsistent behavior in pqarrow column chunk reader (GH-37845)
  • Rewrote and Fixed ARM64 assembly for bitmap bit extractions and integer packing (GH-37712)

Enhancements

  • C Data Interface integration testing has been added and implemented (GH-37789)
  • pkg.go.dev link is fixed in the Readme (GH-37779)

Arrow

  • Added String() method to arrow.Table (GH-35296)
  • Add proper array.Null type support handling for arrow/csv writing (GH-36623)
  • Optimized GetOrInsert function for memo table handling of dictionary builders (GH-36671)
  • Made it possible to add custom functions in the compute package (GH-36936)
  • Improved performance of dictionary unifier (GH-37306)
  • Added direct access to dictionary builder indices (GH-37416)
  • Added ability to read back values from Boolean builders (GH-37465)
  • Add ValueLen function to string array (GH-37584)
  • Avoid unnecessary copying in the default go allocator (GH-37687)
  • Add SetNull(i int) to array builders (GH-37694)

Parquet

  • Parquet metadata is allowed to write metadata after writing rowgroups using pqarrow.FileWriter (GH-35775)
  • MapOf and ListOf helper functions have been improved to provide clearer error messages and have better documentation (GH-36696)
  • Struct tag of parquet:"-" will be allowed to skip fields when converting a struct to a parquet schema (GH-36793)

Java notes

Java 21 is enabled and validated in CI (GH-37914).

The Gandiva module implemented a breaking change by moving Types.proto into a subfolder (GH-37893).

DefaultVectorComparators added support for LargeVarCharVector, LargeVarBinaryVector (GH-25659) and for BitVector, DateDayVector, DateMilliVector Decimal256Vector, DecimalVector, DurationVector, IntervalDayVector, TimeMicroVector, TimeMilliVector, TimeNanoVector, TimeSecVector, TimeStampVector (GH-37701).

A bug was fixed in VectorAppender to prevent resizing the data buffer twice when appending variable-length vectors (GH-37829).

VarCharWriter added support for writing from Text and String (GH-37706). VarBinaryWriter added support for writing from byte[] and ByteBuffer (GH-37705).

The JDBC driver will now ignore username and password authentication if a token is provided (GH-37073).

A bug was fixed in the Java C-Data interface when importing a vector with an empty array (GH-37056).

A bug was fixed in the S3 file system implementation when closing the connection (GH-36069).

Arrow datasets now support Substrait ExtendedExpressions as inputs to filter and project operations (GH-34252).

JavaScript notes

  • GH-21815: [JS] Add support for Duration type #37341
  • GH-31621: [JS] Fix Union null bitmaps #37122

Python notes

Compatibility notes:

  • Support for Python 3.12 was added (GH-37880)
  • Support for Cython 3 was added (GH-37742)
  • PyArrow is now compatible with numpy 2.0 (GH-37574)
  • pyarrow.compute.CumulativeSumOptions has been deprecated, use pyarrow.compute.CumulativeOptions instead (GH-36240)

New features:

  • Allowing type promotion is now possible in pyarrow.concat_tables (GH-36845)
  • Support for vector function UDF was added (GH-36672)

Other improvements:

  • The default of pre_buffer is now set to True for reading Parquet when using pyarrow.dataset directly. This can give significant speed-up on filesystems like S3 and is now aligned to pyarrow.parquet.read_table interface (GH-36765)
  • Path to timezone database can now be set through python API (GH-35600, [GH-38145] (https://github.com/apache/arrow/issues/38145))
  • pyarrow.MapScalar.as_pycan now be called with custom field name (GH-36809)
  • FixedShapeTensorType string representation now prints the type parameters (GH-35623)

Relevant bug fixes:

  • String to date cast kernel was added to fix python scalar cast regression (GH-37411)
  • Fix conversion from Python to Arrow when chunking large nested structs (GH-32439)
  • Fix segfault when passing table as argument to pyarrow.Table.filter (GH-37650)
  • use_threads keyword was added to the group_by method on pyarrow.Table which gets passed through to the pyarrow.acero.Declaration.to_table call. Specifing use_threads=Falseallows to get stable ordering of the output (GH-36709)
  • Fix printable representation for pyarrow.TimestampScalar when values are outside datetime range (GH-36323)
  • Empty dataframes with zero chunks can now be consumed by the Dataframe Interchange Protocol implementation (GH-37050)
  • Fix dtype information for categorical columns in the Dataframe Interchange Protocol implementation (GH-38034)
  • Boolean columns with bitsize 1 are now supported in from_dataframeof the Dataframe Interchange Protocol (GH-37145)

Further, the Python bindings benefit from improvements in the C++ library (e.g. new compute functions); see the C++ notes above for additional details.

The Arrow documentation is now built with an updated Pydata Sphinx Theme which includes light/dark theme, new colors from Accessible pygments themes, version switcher dropdown, search button, etc. (GH-36590, GH-32451)

R notes

This release of the R package features a substantial refactor of the package configuration, build, and installation. This change should be transparent to most users; however, package contributors can take advantage of a substantially simplified development setup: in most cases, package contributors should be able to use a pre-built nightly version of Arrow C++ in place of a local Arrow development setup. Special thanks to Jacob Wujciak-Jens for taking on this incredible refactor!

In addition to a number of bugfixes and improvements, this release includes several new features related to CSV input/output:

  • Added support for , or other characters as a decimal point.
  • Added write_csv_dataset() to better document CSV-specific dataset writing options.
  • Ensured that the schema argument can be specified when reading a CSV dataset with partitions.

For more on what’s in the 14.0.0 R package, see the R changelog.

Ruby and C GLib notes

Ruby

  • Add support for prepared INSERT queries (GH-37143)
  • When a prepared statement is automatically closed upon exiting a block, use the same options as when the statement was prepared (GH-37257)

C GLib

  • Support more properties of ArrowFlight::ClientOptions (GH-37141)
  • Add support for prepared INSERT queries (GH-37143)

Rust notes

The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.