Apache Arrow 12.0.0 Release
Published
02 May 2023
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 12.0.0 release. This covers over 3 months of development work and includes 476 resolved issues with 531 commits from 97 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Community
Since the 11.0.0 release, Wang Mingming, Mustafa Akur and Ruihang Xia have been invited to be committers. Will Jones have joined the Project Management Committee (PMC).
Thanks for your contributions and participation in the project!
Columnar Format Notes
A first “canonical” extension type has been formalized: arrow.fixed_shape_tensor
to
represent an Array where each slot contains a tensor, with all tensors having the same
dimension and shape, GH-33923.
This is based on a Fixed-Size List layout as storage array
(https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#fixed-shape-tensor-extension).
Arrow Flight RPC notes
The JDBC driver for Arrow Flight SQL has had some bugfixes, and has been refactored into a core library (which is not distributed as an uberjar with shaded names) and a driver (which is distributed as an uberjar).
The Java server builder API now offers easier access to the underlying gRPC builder.
Go now implements the Flight SQL extensions for Substrait and transaction support.
Plasma notes
Plasma was deprecated since 10.0.0. Plasma is removed in this release. GH-33243
Linux packages notes
We dropped support for Ubuntu 18.04 because Ubuntu 18.04 reached EOL.
C++ notes
- Run-End Encoded Arrays have been implemented and are accessible (GH-32104)
- The FixedShapeTensor Logical value type has been implemented using ExtensionType (GH-15483, GH-34796)
Compute
- New kernel to convert timestamp with timezone to wall time (GH-33143)
- Cast kernels are now built into libarrow by default (GH-34388)
Acero
- Acero has been moved out of libarrow into it’s own shared library, allowing for smaller builds of the core libarrow (GH-15280)
- Exec nodes now can have a concept of “ordering” and will reject non-sensible plans (GH-34136)
- New exec nodes: “pivot_longer” (GH-34266), “order_by” (GH-34248) and “fetch” (GH-34059)
- Breaking Change: Reorder output fields of “group_by” node so that keys/segment keys come before aggregates (GH-33616)
Substrait
- Add support for the
round
function GH-33588 - Add support for the
cast
expression element GH-31910 - Added API reference documentation GH-34011
- Added an extension relation to support segmented aggregation GH-34626
- The output of the aggregate relation now conforms to the spec GH-34786
Parquet
- Added support for DeltaLengthByteArray encoding to the Parquet writer (GH-33024)
- NaNs are correctly handled now for Parquet predicate push-downs (GH-18481)
- Added support for reading Parquet page indexes (GH-33596) and writing page indexes (GH-34053)
- Parquet writer can write columns in parallel now (GH-33655)
- Fixed incorrect number of rows in Parquet V2 page headers (GH-34086)
- Fixed incorrect Parquet page null_count when stats are disabled (GH-34326)
- Added support for reading BloomFilters to the Parquet Reader (GH-34665)
- Parquet File-writer can now add additional key-value metadata after it has been opened (GH-34888)
- Breaking Change: The default row group size for the Arrow writer changed from 64Mi rows to 1Mi rows. GH-34280
ORC
- Added support for the union type in ORC writer (GH-34262)
- Fixed ORC CHAR type mapping with Arrow (GH-34823)
- Fixed timestamp type mapping between ORC and arrow (GH-34590)
Datasets
- Added support for reading JSON datasets (GH-33209)
- Dataset writer now supports specifying a function callback to construct the file name in addition to the existing file name template (GH-34565)
Filesystems
- GcsFileSystem::OpenInputFile avoids unnecessary downloads (GH-34051)
Other changes
- Convenience Append(std::optional
...) methods have been added to array builders ([GH-14863](https://github.com/apache/arrow/issues/14863)) - A deprecated OpenTelemetry header was removed from the Flight library (GH-34417)
- Fixed crash in “take” kernels on ExtensionArrays with an underlying dictionary type (GH-34619)
- Fixed bug where the C-Data bridge did not preserve nullability of map values on import (GH-34983)
- Added support for EqualOptions to RecordBatch::Equals (GH-34968)
- zstd dependency upgraded to v1.5.5 (GH-34899)
- Improved handling of “logical” nulls such as with union and RunEndEncoded arrays (GH-34361)
- Fixed incorrect handling of uncompressed body buffers in IPC reader, added IpcWriteOptions::min_space_savings for optional compression optimizations (GH-15102)
C# notes
- Support added for importing / exporting schemas and types via the C data interface GH-34737
- Support added for the half-float data type GH-25163
- Schemas are now allowed to have multiple fields with the same name GH-34076
- Added support for reading compressed IPC files GH-32240
- Add [] operator to Schema GH-34119
Go notes
- Run-End Encoded Arrays have been added to the Golang implementation (GH-32104, GH-32946, GH-20407, GH-32949)
- The SQLite Flight SQL Example has been improved and you can now
go get
a simple SQLite Flight SQL Server mainprog usinggo get github.com/apache/arrow/go/v12/arrow/flight/flightsql/example/cmd/sqlite_flightsql_server
(GH-33840) - Fixed bug causing builds to fail with the
noasm
build tag (GH-34044) and added a CI test run that uses thenoasm
tag (GH-34055) - Fixed issue allowing building on riscv64-freebsd (GH-34629)
- Fixed issue preventing building on 32-bit platforms (GH-35133)
Arrow
- Fixed bug in C-Data handling of
ArrowArrayStream.get_next
when handling uninitializedArrowArrays
(GH-33767) - Breaking Change: Added
Err()
method toRecordReader
interface so that it can propagate errors (GH-33789) - Fixed potential panic in C-Data API for misusing an invalid handle (GH-33864)
- A new cgo-based Allocator that does not depend on libarrow has been added to the memory package (GH-33901)
- CSV Reader and Writer now support Extension type arrays (GH-34334)
- Fixed bug preventing the reading of IPC streams/files with compression enabled but uncompressed buffers (GH-34385)
- Added interface which can be added to an
ExtensionType
to allow Builders to be created viaNewBuilder
, enabling easy building of nested fields containing extension types (GH-34453) - Added utilities to perform Array diffing (GH-34790)
- Added
SetColumn
method toarrow.Record
(GH-34832) - Added
GetValue
method toarrow.Metadata
(GH-34855) - Added
Pow
method todecimal128.Num
anddecimal256.Num
(GH-34863)
Flight
- Fixed bug in
StreamChunks
for Flight SQL to correctly propagate to the gRPC client (GH-33717) - Fixed issue that prevented compatibility with gRPC < v1.45 (GH-33734)
- Added support to bind a
RecordReader
for supplying parameters to a Flight SQL Prepared statement (GH-33794) - Prepared Statement methods for FlightSQL client now allows gRPC Call Options (GH-33867)
- FlightSQL Extensions have been implemented (for transactions and Substrait support) (GH-33935)
- A driver compatible with
database/sql
for FlightSQL has been added (GH-34332)
Compute
- “run_end_encode” and “run_end_decode” functions added to compute functions (GH-20408)
- “unique” function added (GH-34171)
Parquet
pqarrow
pkg now handles DICTIONARY fields natively (GH-33466)- Fixed rare panic when writing list of 8 structs (GH-33600)
- Added support for
pqarrow
pkg to write LargeString and LargeBinary types (GH-33875) - Fixed bug where
pqarrow.NewSchemaManifest
created the wrong field type for Array Object fields (GH-34101) - Added support to Parquet Writer for Extension type Arrays (GH-34330)
Java notes
- Update Java JNI modules to consider Arrow ACERO GH-34862
- Ability to register additional GRPC services with FlightServer GH-34778
- Allow sending custom headers/properties through Arrow Flight SQL JDBC GH-33874
JavaScript notes
No changes.
Python notes
Compatibility notes:
- Plasma has been removed in this release (GH-33243). In addition, the deprecated serialization module in PyArrow was also removed (GH-29705). IPC (Inter-Process Communication) functionality of pyarrow or the standard library pickle should be used instead.
- The deprecated
use_async
keyword has been removed from the dataset module (GH-30774) - Minimum Cython version to build PyArrow from source has been raised to 0.29.31 (GH-34933). In addition, PyArrow can now be compiled using Cython 3 (GH-34564).
New features:
- A new
pyarrow.acero
module with initial bindings for the Acero execution engine has been added (GH-33976) - A new canonical extension type for fixed shaped tensor data has been defined. This is exposed in PyArrow as the FixedShapeTensorType (GH-34882, GH-34956)
- Run-End Encoded arrays binding has been implemented (GH-34686, GH-34568)
- Method
is_nan
has been added to Array, ChunkedArray and Expression (GH-34154) - Dataframe interchange protocol has been implemented for RecordBatch (GH-33926)
Other improvements:
- Extension arrays can now be concatenated (GH-31868)
get_partition_keys
helper function is implemented in thedataset
module to access the partitioning field’s key/value from the partition expression of a certain dataset fragment (GH-33825)- PyArrow Array objects can now be accepted by the
pa.array()
constructor (GH-34411) - The default row group size when writing parquet files has been changed (GH-34280)
- RecordBatch has the
select()
method implemented (GH-34359) - New method
drop_column
on thepyarrow.Table
supports passing a single column as a string (GH-33377) - User-defined tabular functions, which are a user-functions implemented in Python that return a stateful stream of tabular data, are now also supported (GH-32916)
- Arrow Archery tool now includes linting of the Cython files (GH-31905)
- Breaking Change: Reorder output fields of “group_by” node so that keys/segment keys come before aggregates (GH-33616)
Relevant bug fixes:
- Acero can now detect and raise an error in case a join operation needs too much bytes of key data (GH-34474)
- Fix for converting non-sequence object in
pa.array()
(GH-34944) - Fix erroneous table conversion to pandas if table includes an extension array that does not implement
to_pandas_dtype
(GH-34906) - Reading from a closed
ArrayStreamBatchReader
now returns invalid status instead of segfaulting (GH-34165) array()
now returnspyarrow.Array
and notpyarrow.ChunkedArray
for columns with__arrow_array__
method and only one chunk so that the conversion of pandas dataframe with categorical column of dtypestring[pyarrow]
does not fail (GH-33727)- Custom type mapper in
to_pandas
now converts index dtypes together with column dtypes (GH-34283)
R notes
- The
read_parquet()
andread_feather()
functions can now accept URL arguments. - The
json_credentials
argument inGcsFileSystem$create()
now accepts a file path containing the appropriate authentication token. - The
$options
member ofGcsFileSystem
objects can now be inspected. - The
read_csv_arrow()
andread_json_arrow()
functions now accept literal text input wrapped inI()
to improve compatability withreadr::read_csv()
. - Nested fields can now be accessed using
$
and[[
in dplyr expressions.
For more on what’s in the 12.0.0 R package, see the R changelog.
Ruby and C GLib notes
- Flight SQL: Added support for authentication. GH-34074
- Compute: Added
GArrowRankOptions
. GH-34425 - Compute: Added
GArrowFilterOptions
. GH-34650 - Compute: Added
GArrowIndexOptions
. GH-15286 - Compute: Added
GArrowMatchSubstringOptions
. GH-15285 - Added
GArrowDenseUnionArrayBuilder
. GH-21429 - Added
GArrowSparseUnionArrayBuilder
. GH-21430
Ruby notes
- Improved
Arrow::Table#join
API. GH-15287 - Flight SQL: Added
ArrowFlight::RecordBatchReader#each
. GH-15287 - Added
Arrow::DenseUnionArray#get_value
. GH-34837 - Added
Arrow::SparseUnionArray#get_value
. GH-34837 - Expression: Added support for
table.slice {|slicer| slicer.column.match_substring(string)
and related shortcuts. GH-34819 GH-34951 - Breaking change:
Arrow::Table#slice
with a filter removes null records. GH-34953
Rust notes
The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.