Apache Arrow 11.0.0 Release


Published 25 Jan 2023
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 11.0.0 release. This covers over 3 months of development work and includes 423 resolved issues from 95 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Community

Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson, Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak, Jie Wen and Brent Gardner have been invited to be committers. Kun Liu have joined the Project Management Committee (PMC).

As per our newly started tradition of rotating the PMC chair once a year Andrew Lamb was elected as the new PMC chair and VP.

Thanks for your contributions and participation in the project!

Columnar Format Notes

Arrow Flight RPC notes

In the C++/Python Flight clients, DoAction now properly streams the results, instead of blocking until the call finishes. Applications that did not consume the iterator before should fully consume the result. (#15069)

C++ notes

  • It is now possible to specify alignment when making allocations with a MemoryPool GH-33056
  • It is now possible to run an ExecPlan without using any CPU threads
  • Added kernel for slicing list values GH-33168
  • Added kernel for slicing binary arrays GH-20357
  • When comparing list arrays for equality the list field name is now ignored GH-30519
  • Add support for partitioning on columns that contain special characters GH-33448
  • Added a streaming reader for JSON GH-33140
  • Added support for incremental writes to the ORC writer GH-33047
  • Added support for casting decimal to string and writing decimal to CSV GH-33002
  • Fixed an assert in the scanner that would occur when batch_readahead was set to 0 GH-15264
  • Fixed bug where arrays with a null data buffer would not be accepted when imported via the C data API GH-14875
  • Fixed bug where arrays with a zero-case union data type would not be accepted when imported via the C data API GH-14855
  • Fixed bug where case_when could return incorrect values GH-33382
  • Fixed bug where RecordBatch::Equals was ignoring field names GH-33285

    C# notes

No major changes to C#.

Go notes

  • Go’s benchmarks will now get added to Conbench alongside the benchmarks for other implementations GH-32983
  • Exposed FlightService_ServiceDesc and RegisterFlightServiceServer to allow easily incorporating a flight service into an existing gRPC server GH-15174

Arrow

  • Function ApproxEquals was implemented for scalar values GH-29581
  • UnmarshalJSON for the RecordBuilder now properly handles extra unknown fields with complex/nested values GH-31840
  • Decimal128 and Decimal256 type support has been added to the CSV reader GH-33111
  • Fixed bug in array.UnionBuilder where Len method always returned 0 GH-14775
  • Fixed bug for handling slices of Map arrays when marshalling to JSON and for IPC GH-14780
  • Fixed memory leak when compressing IPC message body buffers GH-14883
  • Added the ability to easily append scalar values to array builders GH-15005

Compute

  • Scalar binary (add/subtract/multiply/divide/etc.) and unary arithmetic (abs/neg/sqrt/sign/etc.) has been implemented for the compute package GH-33086 this includes easy functions like compute.Add and compute.Divide etc.
  • Scalar boolean functions like AND/OR/XOR/etc. have been implemented for compute GH-33279
  • Scalar comparison function kernels have been implemented for compute (equal/greater/greater_equal/less/less_equal) GH-33308
  • Scalar compute functions are compatible with dictionary encoded arrays by casting them to their value types GH-33502

Parquet

  • Panic when decoding a delta_bit_packed encoded column has been fixed GH-33483
  • Fixed memory leak from Allocator in pqarrow.WriteArrowToColumn GH-14865
  • Fixed writer.WriteBatch to properly handle writing encrypted parquet columns and no longer silently fail, but instead propagate an error GH-14940

Java notes

  • Implement support for writing compressed files (#15223)
  • Improve performance by short-circuiting null checks when comparing non null field types (#15106)
  • Several enhancements to dictionary encoding (#14891, (#14902, (#14874)
  • Extend Table to support additional vector types (#14573)
  • Enhance and simplify handling of allocation management by integrating C Data into allocator hierarchy (#14506)
  • Make ComplexCopier agnostic of specific implementation of MapWriter (#14557)
  • Distribute Apple M1 compatible JNI libraries via mavencentral (#14472)
  • Extend Table copy functionality, and support returning copies of individual vectors (#14389)

JavaScript notes

  • Bugfixes and dependency updates.
  • Arrow now requires BigInt support. GH-33681

Python notes

Compatibility notes:

  • PyArrow now requires pandas >= 1.0 (ARROW-18173)
  • The pyarrow.parquet.ParquetDataset() class now by default uses the new Dataset API under the hood (use_legacy_dataset=False). You can still pass use_legacy_dataset=True to get the legacy implementation, but this option will be removed in a next release (ARROW-16728).

New features:

  • Added support for the DataFrame Interchange Protocol for pyarrow.Table (GH-33346).
  • New kernels: list_slice() to slice each list element of a ListArray returning a new ListArray (ARROW-17960).
  • A new filter() method on the Dataset class as additional API to filter a Dataset before consuming it (ARROW-16616).
  • New sort() method for (Chunked)Array and sort_by() method for RecordBatch, providing a convenience on top of the sort_indices kernel (GH-14778), and a new Dataset.sort_by() method (GH-14975).

Other improvements:

  • Support for custom metadata of record batches in the IPC read and write APIs (ARROW-16430).
  • Support URIs and the filesystem parameter in pyarrow.parquet.ParquetFile (ARROW-18272) and pyarrow.parquet.write_metadata (ARROW-18225).
  • When writing a dataset to IPC using pyarrow.dataset.write_dataset(), you can now specify IPC specific options, such as compression (ARROW-17991)
  • The pyarrow.array() function now allows to construct a MapArray from a sequence of dicts (in addition to a sequence of tuples) (ARROW-17832).
  • The struct_field() kernel now also accepts field names in addition to integer indices (ARROW-17989).
  • Casting to string is now supported for duration (ARROW-15822) and decimal (ARROW-17458) types, which also means those can now be written to CSV.
  • When writing to CSV, you can now specify the quoting style (GH-14755).
  • The pyarrow.ipc.read_schema() function now accepts a Message object (ARROW-18423).
  • The Time32Scalar, Time64Scalar, Date32Scalar and Date64Scalar classes got a .value attribute to access the underlying integer value, similar to the other date-time related scalars (ARROW-18264)
  • Duration type is now supported in the hash kernels like dictionary_encode (GH-15226).
  • Fix silent overflow when converting datetime.timedelta to duration type (ARROW-15026).

Relevant bug fixes:

  • Numpy conversion for ListArray is improved taking into account sliced offset, avoiding increased memory usage (GH-20512
  • Fix writing files with multi-byte characters in file name (ARROW-18123).

R notes

  • map_batches() is lazy by default; it now returns a RecordBatchReader instead of a list of RecordBatch objects unless lazy = FALSE. GH-14521
  • A substantial reorganisation, rewrite of and addition to, many of the vignettes and README. GH-14514

For more on what’s in the 11.0.0 R package, see the R changelog.

Ruby and C GLib notes

Ruby

  • Arrow::Table#save now always returns self instead of the result of its raw_recordsGH-15289
  • Improve the GC-related crash prevention system by guarding the shared objects from GC ARROW-18161
  • Add Arrow::HalfFloat and raw_records support in Arrow::HalfFloatArray ARROW-18086
  • Support omitting join keys in Table#join GH-15084
  • Add support for Arrow::Table.load(uri, schema:) ARROW-15206
  • Add Arrow::ColumnContainable#column_names (e.g. Arrow::Table#column_names) GH-15085
  • Add to_arrow_chunked_array methods to support converting to Arrow::ChunkedArray ARROW-18405

C GLib

  • Add garrow_chunked_array_new_empty() GH-33671
  • Add GArrowProjectNodeOptions GH-33670
  • Add GADatasetHivePartitioning GH-15257
  • The signature of garrow_execute_plain_wait() was changed to take the error argument and to return the finished status GH-15254
  • Add support for half float GH-15168
  • Add GADatasetFinishOptions GH-15146

Rust notes

The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.