Apache Arrow 12.0.0 Release


Published 02 May 2023
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 12.0.0 release. This covers over 3 months of development work and includes 476 resolved issues with 531 commits from 97 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Community

Since the 11.0.0 release, Wang Mingming, Mustafa Akur and Ruihang Xia have been invited to be committers. Will Jones have joined the Project Management Committee (PMC).

Thanks for your contributions and participation in the project!

Columnar Format Notes

A first “canonical” extension type has been formalized: arrow.fixed_shape_tensor to represent an Array where each slot contains a tensor, with all tensors having the same dimension and shape, GH-33923. This is based on a Fixed-Size List layout as storage array (https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#fixed-shape-tensor-extension).

Arrow Flight RPC notes

The JDBC driver for Arrow Flight SQL has had some bugfixes, and has been refactored into a core library (which is not distributed as an uberjar with shaded names) and a driver (which is distributed as an uberjar).

The Java server builder API now offers easier access to the underlying gRPC builder.

Go now implements the Flight SQL extensions for Substrait and transaction support.

Plasma notes

Plasma was deprecated since 10.0.0. Plasma is removed in this release. GH-33243

Linux packages notes

We dropped support for Ubuntu 18.04 because Ubuntu 18.04 reached EOL.

C++ notes

  • Run-End Encoded Arrays have been implemented and are accessible (GH-32104)
  • The FixedShapeTensor Logical value type has been implemented using ExtensionType (GH-15483, GH-34796)

Compute

  • New kernel to convert timestamp with timezone to wall time (GH-33143)
  • Cast kernels are now built into libarrow by default (GH-34388)

Acero

  • Acero has been moved out of libarrow into it’s own shared library, allowing for smaller builds of the core libarrow (GH-15280)
  • Exec nodes now can have a concept of “ordering” and will reject non-sensible plans (GH-34136)
  • New exec nodes: “pivot_longer” (GH-34266), “order_by” (GH-34248) and “fetch” (GH-34059)
  • Breaking Change: Reorder output fields of “group_by” node so that keys/segment keys come before aggregates (GH-33616)

Substrait

  • Add support for the round function GH-33588
  • Add support for the cast expression element GH-31910
  • Added API reference documentation GH-34011
  • Added an extension relation to support segmented aggregation GH-34626
  • The output of the aggregate relation now conforms to the spec GH-34786

Parquet

  • Added support for DeltaLengthByteArray encoding to the Parquet writer (GH-33024)
  • NaNs are correctly handled now for Parquet predicate push-downs (GH-18481)
  • Added support for reading Parquet page indexes (GH-33596) and writing page indexes (GH-34053)
  • Parquet writer can write columns in parallel now (GH-33655)
  • Fixed incorrect number of rows in Parquet V2 page headers (GH-34086)
  • Fixed incorrect Parquet page null_count when stats are disabled (GH-34326)
  • Added support for reading BloomFilters to the Parquet Reader (GH-34665)
  • Parquet File-writer can now add additional key-value metadata after it has been opened (GH-34888)
  • Breaking Change: The default row group size for the Arrow writer changed from 64Mi rows to 1Mi rows. GH-34280

ORC

  • Added support for the union type in ORC writer (GH-34262)
  • Fixed ORC CHAR type mapping with Arrow (GH-34823)
  • Fixed timestamp type mapping between ORC and arrow (GH-34590)

Datasets

  • Added support for reading JSON datasets (GH-33209)
  • Dataset writer now supports specifying a function callback to construct the file name in addition to the existing file name template (GH-34565)

Filesystems

  • GcsFileSystem::OpenInputFile avoids unnecessary downloads (GH-34051)

Other changes

  • Convenience Append(std::optional...) methods have been added to array builders ([GH-14863](https://github.com/apache/arrow/issues/14863))
  • A deprecated OpenTelemetry header was removed from the Flight library (GH-34417)
  • Fixed crash in “take” kernels on ExtensionArrays with an underlying dictionary type (GH-34619)
  • Fixed bug where the C-Data bridge did not preserve nullability of map values on import (GH-34983)
  • Added support for EqualOptions to RecordBatch::Equals (GH-34968)
  • zstd dependency upgraded to v1.5.5 (GH-34899)
  • Improved handling of “logical” nulls such as with union and RunEndEncoded arrays (GH-34361)
  • Fixed incorrect handling of uncompressed body buffers in IPC reader, added IpcWriteOptions::min_space_savings for optional compression optimizations (GH-15102)

C# notes

  • Support added for importing / exporting schemas and types via the C data interface GH-34737
  • Support added for the half-float data type GH-25163
  • Schemas are now allowed to have multiple fields with the same name GH-34076
  • Added support for reading compressed IPC files GH-32240
  • Add [] operator to Schema GH-34119

Go notes

  • Run-End Encoded Arrays have been added to the Golang implementation (GH-32104, GH-32946, GH-20407, GH-32949)
  • The SQLite Flight SQL Example has been improved and you can now go get a simple SQLite Flight SQL Server mainprog using go get github.com/apache/arrow/go/v12/arrow/flight/flightsql/example/cmd/sqlite_flightsql_server (GH-33840)
  • Fixed bug causing builds to fail with the noasm build tag (GH-34044) and added a CI test run that uses the noasm tag (GH-34055)
  • Fixed issue allowing building on riscv64-freebsd (GH-34629)
  • Fixed issue preventing building on 32-bit platforms (GH-35133)

Arrow

  • Fixed bug in C-Data handling of ArrowArrayStream.get_next when handling uninitialized ArrowArrays (GH-33767)
  • Breaking Change: Added Err() method to RecordReader interface so that it can propagate errors (GH-33789)
  • Fixed potential panic in C-Data API for misusing an invalid handle (GH-33864)
  • A new cgo-based Allocator that does not depend on libarrow has been added to the memory package (GH-33901)
  • CSV Reader and Writer now support Extension type arrays (GH-34334)
  • Fixed bug preventing the reading of IPC streams/files with compression enabled but uncompressed buffers (GH-34385)
  • Added interface which can be added to an ExtensionType to allow Builders to be created via NewBuilder, enabling easy building of nested fields containing extension types (GH-34453)
  • Added utilities to perform Array diffing (GH-34790)
  • Added SetColumn method to arrow.Record (GH-34832)
  • Added GetValue method to arrow.Metadata (GH-34855)
  • Added Pow method to decimal128.Num and decimal256.Num (GH-34863)

Flight

  • Fixed bug in StreamChunks for Flight SQL to correctly propagate to the gRPC client (GH-33717)
  • Fixed issue that prevented compatibility with gRPC < v1.45 (GH-33734)
  • Added support to bind a RecordReader for supplying parameters to a Flight SQL Prepared statement (GH-33794)
  • Prepared Statement methods for FlightSQL client now allows gRPC Call Options (GH-33867)
  • FlightSQL Extensions have been implemented (for transactions and Substrait support) (GH-33935)
  • A driver compatible with database/sql for FlightSQL has been added (GH-34332)

Compute

  • “run_end_encode” and “run_end_decode” functions added to compute functions (GH-20408)
  • “unique” function added (GH-34171)

Parquet

  • pqarrow pkg now handles DICTIONARY fields natively (GH-33466)
  • Fixed rare panic when writing list of 8 structs (GH-33600)
  • Added support for pqarrow pkg to write LargeString and LargeBinary types (GH-33875)
  • Fixed bug where pqarrow.NewSchemaManifest created the wrong field type for Array Object fields (GH-34101)
  • Added support to Parquet Writer for Extension type Arrays (GH-34330)

Java notes

  • Update Java JNI modules to consider Arrow ACERO GH-34862
  • Ability to register additional GRPC services with FlightServer GH-34778
  • Allow sending custom headers/properties through Arrow Flight SQL JDBC GH-33874

JavaScript notes

No changes.

Python notes

Compatibility notes:

  • Plasma has been removed in this release (GH-33243). In addition, the deprecated serialization module in PyArrow was also removed (GH-29705). IPC (Inter-Process Communication) functionality of pyarrow or the standard library pickle should be used instead.
  • The deprecated use_async keyword has been removed from the dataset module (GH-30774)
  • Minimum Cython version to build PyArrow from source has been raised to 0.29.31 (GH-34933). In addition, PyArrow can now be compiled using Cython 3 (GH-34564).

New features:

  • A new pyarrow.acero module with initial bindings for the Acero execution engine has been added (GH-33976)
  • A new canonical extension type for fixed shaped tensor data has been defined. This is exposed in PyArrow as the FixedShapeTensorType (GH-34882, GH-34956)
  • Run-End Encoded arrays binding has been implemented (GH-34686, GH-34568)
  • Method is_nan has been added to Array, ChunkedArray and Expression (GH-34154)
  • Dataframe interchange protocol has been implemented for RecordBatch (GH-33926)

Other improvements:

  • Extension arrays can now be concatenated (GH-31868)
  • get_partition_keys helper function is implemented in the dataset module to access the partitioning field’s key/value from the partition expression of a certain dataset fragment (GH-33825)
  • PyArrow Array objects can now be accepted by the pa.array() constructor (GH-34411)
  • The default row group size when writing parquet files has been changed (GH-34280)
  • RecordBatch has the select() method implemented (GH-34359)
  • New method drop_column on the pyarrow.Table supports passing a single column as a string (GH-33377)
  • User-defined tabular functions, which are a user-functions implemented in Python that return a stateful stream of tabular data, are now also supported (GH-32916)
  • Arrow Archery tool now includes linting of the Cython files (GH-31905)
  • Breaking Change: Reorder output fields of “group_by” node so that keys/segment keys come before aggregates (GH-33616)

Relevant bug fixes:

  • Acero can now detect and raise an error in case a join operation needs too much bytes of key data (GH-34474)
  • Fix for converting non-sequence object in pa.array() (GH-34944)
  • Fix erroneous table conversion to pandas if table includes an extension array that does not implement to_pandas_dtype (GH-34906)
  • Reading from a closed ArrayStreamBatchReader now returns invalid status instead of segfaulting (GH-34165)
  • array() now returns pyarrow.Array and not pyarrow.ChunkedArray for columns with __arrow_array__ method and only one chunk so that the conversion of pandas dataframe with categorical column of dtype string[pyarrow] does not fail (GH-33727)
  • Custom type mapper in to_pandas now converts index dtypes together with column dtypes (GH-34283)

R notes

  • The read_parquet() and read_feather() functions can now accept URL arguments.
  • The json_credentials argument in GcsFileSystem$create() now accepts a file path containing the appropriate authentication token.
  • The $options member of GcsFileSystem objects can now be inspected.
  • The read_csv_arrow() and read_json_arrow() functions now accept literal text input wrapped in I() to improve compatability with readr::read_csv().
  • Nested fields can now be accessed using $ and [[ in dplyr expressions.

For more on what’s in the 12.0.0 R package, see the R changelog.

Ruby and C GLib notes

  • Flight SQL: Added support for authentication. GH-34074
  • Compute: Added GArrowRankOptions. GH-34425
  • Compute: Added GArrowFilterOptions. GH-34650
  • Compute: Added GArrowIndexOptions. GH-15286
  • Compute: Added GArrowMatchSubstringOptions. GH-15285
  • Added GArrowDenseUnionArrayBuilder. GH-21429
  • Added GArrowSparseUnionArrayBuilder. GH-21430

Ruby notes

  • Improved Arrow::Table#join API. GH-15287
  • Flight SQL: Added ArrowFlight::RecordBatchReader#each. GH-15287
  • Added Arrow::DenseUnionArray#get_value. GH-34837
  • Added Arrow::SparseUnionArray#get_value. GH-34837
  • Expression: Added support for table.slice {|slicer| slicer.column.match_substring(string) and related shortcuts. GH-34819 GH-34951
  • Breaking change: Arrow::Table#slice with a filter removes null records. GH-34953

Rust notes

The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.