Apache Arrow 11.0.0 Release

Published 25 Jan 2023
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 11.0.0 release. This covers over 3 months of development work and includes 423 resolved issues from 95 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Community

Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson, Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak, Jie Wen and Brent Gardner have been invited to be committers. Kun Liu have joined the Project Management Committee (PMC).

As per our newly started tradition of rotating the PMC chair once a year Andrew Lamb was elected as the new PMC chair and VP.

Thanks for your contributions and participation in the project!

Columnar Format Notes

Arrow Flight RPC notes

In the C++/Python Flight clients, DoAction now properly streams the results, instead of blocking until the call finishes. Applications that did not consume the iterator before should fully consume the result. (#15069)

C++ notes

It is now possible to specify alignment when making allocations with a MemoryPool GH-33056
It is now possible to run an ExecPlan without using any CPU threads
Added kernel for slicing list values GH-33168
Added kernel for slicing binary arrays GH-20357
When comparing list arrays for equality the list field name is now ignored GH-30519
Add support for partitioning on columns that contain special characters GH-33448
Added a streaming reader for JSON GH-33140
Added support for incremental writes to the ORC writer GH-33047
Added support for casting decimal to string and writing decimal to CSV GH-33002
Fixed an assert in the scanner that would occur when batch_readahead was set to 0 GH-15264
Fixed bug where arrays with a null data buffer would not be accepted when imported via the C data API GH-14875
Fixed bug where arrays with a zero-case union data type would not be accepted when imported via the C data API GH-14855
Fixed bug where case_when could return incorrect values GH-33382
Fixed bug where RecordBatch::Equals was ignoring field names GH-33285
C# notes

No major changes to C#.

Go notes

Go’s benchmarks will now get added to Conbench alongside the benchmarks for other implementations GH-32983
Exposed FlightService_ServiceDesc and RegisterFlightServiceServer to allow easily incorporating a flight service into an existing gRPC server GH-15174

Arrow

Function ApproxEquals was implemented for scalar values GH-29581
UnmarshalJSON for the RecordBuilder now properly handles extra unknown fields with complex/nested values GH-31840
Decimal128 and Decimal256 type support has been added to the CSV reader GH-33111
Fixed bug in array.UnionBuilder where Len method always returned 0 GH-14775
Fixed bug for handling slices of Map arrays when marshalling to JSON and for IPC GH-14780
Fixed memory leak when compressing IPC message body buffers GH-14883
Added the ability to easily append scalar values to array builders GH-15005

Compute

Scalar binary (add/subtract/multiply/divide/etc.) and unary arithmetic (abs/neg/sqrt/sign/etc.) has been implemented for the compute package GH-33086 this includes easy functions like compute.Add and compute.Divide etc.
Scalar boolean functions like AND/OR/XOR/etc. have been implemented for compute GH-33279
Scalar comparison function kernels have been implemented for compute (equal/greater/greater_equal/less/less_equal) GH-33308
Scalar compute functions are compatible with dictionary encoded arrays by casting them to their value types GH-33502

Parquet

Panic when decoding a delta_bit_packed encoded column has been fixed GH-33483
Fixed memory leak from Allocator in pqarrow.WriteArrowToColumn GH-14865
Fixed writer.WriteBatch to properly handle writing encrypted parquet columns and no longer silently fail, but instead propagate an error GH-14940

Java notes

Implement support for writing compressed files (#15223)
Improve performance by short-circuiting null checks when comparing non null field types (#15106)
Several enhancements to dictionary encoding (#14891, (#14902, (#14874)
Extend Table to support additional vector types (#14573)
Enhance and simplify handling of allocation management by integrating C Data into allocator hierarchy (#14506)
Make ComplexCopier agnostic of specific implementation of MapWriter (#14557)
Distribute Apple M1 compatible JNI libraries via mavencentral (#14472)
Extend Table copy functionality, and support returning copies of individual vectors (#14389)

JavaScript notes

Bugfixes and dependency updates.
Arrow now requires BigInt support. GH-33681

Python notes

Compatibility notes:

PyArrow now requires pandas >= 1.0 (ARROW-18173)
The pyarrow.parquet.ParquetDataset() class now by default uses the new Dataset API under the hood (use_legacy_dataset=False). You can still pass use_legacy_dataset=True to get the legacy implementation, but this option will be removed in a next release (ARROW-16728).

New features:

Added support for the DataFrame Interchange Protocol for pyarrow.Table (GH-33346).
New kernels: list_slice() to slice each list element of a ListArray returning a new ListArray (ARROW-17960).
A new filter() method on the Dataset class as additional API to filter a Dataset before consuming it (ARROW-16616).
New sort() method for (Chunked)Array and sort_by() method for RecordBatch, providing a convenience on top of the sort_indices kernel (GH-14778), and a new Dataset.sort_by() method (GH-14975).

Other improvements:

Support for custom metadata of record batches in the IPC read and write APIs (ARROW-16430).
Support URIs and the filesystem parameter in pyarrow.parquet.ParquetFile (ARROW-18272) and pyarrow.parquet.write_metadata (ARROW-18225).
When writing a dataset to IPC using pyarrow.dataset.write_dataset(), you can now specify IPC specific options, such as compression (ARROW-17991)
The pyarrow.array() function now allows to construct a MapArray from a sequence of dicts (in addition to a sequence of tuples) (ARROW-17832).
The struct_field() kernel now also accepts field names in addition to integer indices (ARROW-17989).
Casting to string is now supported for duration (ARROW-15822) and decimal (ARROW-17458) types, which also means those can now be written to CSV.
When writing to CSV, you can now specify the quoting style (GH-14755).
The pyarrow.ipc.read_schema() function now accepts a Message object (ARROW-18423).
The Time32Scalar, Time64Scalar, Date32Scalar and Date64Scalar classes got a .value attribute to access the underlying integer value, similar to the other date-time related scalars (ARROW-18264)
Duration type is now supported in the hash kernels like dictionary_encode (GH-15226).
Fix silent overflow when converting datetime.timedelta to duration type (ARROW-15026).

Relevant bug fixes:

Numpy conversion for ListArray is improved taking into account sliced offset, avoiding increased memory usage (GH-20512
Fix writing files with multi-byte characters in file name (ARROW-18123).

R notes

map_batches() is lazy by default; it now returns a RecordBatchReader instead of a list of RecordBatch objects unless lazy = FALSE. GH-14521
A substantial reorganisation, rewrite of and addition to, many of the vignettes and README. GH-14514

For more on what’s in the 11.0.0 R package, see the R changelog.

Ruby and C GLib notes

Ruby

Arrow::Table#save now always returns self instead of the result of its raw_recordsGH-15289
Improve the GC-related crash prevention system by guarding the shared objects from GC ARROW-18161
Add Arrow::HalfFloat and raw_records support in Arrow::HalfFloatArray ARROW-18086
Support omitting join keys in Table#join GH-15084
Add support for Arrow::Table.load(uri, schema:) ARROW-15206
Add Arrow::ColumnContainable#column_names (e.g. Arrow::Table#column_names) GH-15085
Add to_arrow_chunked_array methods to support converting to Arrow::ChunkedArray ARROW-18405

C GLib

Add garrow_chunked_array_new_empty() GH-33671
Add GArrowProjectNodeOptions GH-33670
Add GADatasetHivePartitioning GH-15257
The signature of garrow_execute_plain_wait() was changed to take the error argument and to return the finished status GH-15254
Add support for half float GH-15168
Add GADatasetFinishOptions GH-15146

Rust notes

The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.