Apache Arrow 11.0.0 Release
25 Jan 2023
By The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 11.0.0 release. This covers over 3 months of development work and includes 423 resolved issues from 95 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson, Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak, Jie Wen and Brent Gardner have been invited to be committers. Kun Liu have joined the Project Management Committee (PMC).
As per our newly started tradition of rotating the PMC chair once a year Andrew Lamb was elected as the new PMC chair and VP.
Thanks for your contributions and participation in the project!
Columnar Format Notes
Arrow Flight RPC notes
In the C++/Python Flight clients, DoAction now properly streams the results, instead of blocking until the call finishes. Applications that did not consume the iterator before should fully consume the result. (#15069)
- It is now possible to specify alignment when making allocations with a MemoryPool GH-33056
- It is now possible to run an ExecPlan without using any CPU threads
- Added kernel for slicing list values GH-33168
- Added kernel for slicing binary arrays GH-20357
- When comparing list arrays for equality the list field name is now ignored GH-30519
- Add support for partitioning on columns that contain special characters GH-33448
- Added a streaming reader for JSON GH-33140
- Added support for incremental writes to the ORC writer GH-33047
- Added support for casting decimal to string and writing decimal to CSV GH-33002
- Fixed an assert in the scanner that would occur when batch_readahead was set to 0 GH-15264
- Fixed bug where arrays with a null data buffer would not be accepted when imported via the C data API GH-14875
- Fixed bug where arrays with a zero-case union data type would not be accepted when imported via the C data API GH-14855
- Fixed bug where case_when could return incorrect values GH-33382
- Fixed bug where RecordBatch::Equals was ignoring field names GH-33285
No major changes to C#.
- Go’s benchmarks will now get added to Conbench alongside the benchmarks for other implementations GH-32983
- Exposed FlightService_ServiceDesc and RegisterFlightServiceServer to allow easily incorporating a flight service into an existing gRPC server GH-15174
ApproxEqualswas implemented for scalar values GH-29581
RecordBuildernow properly handles extra unknown fields with complex/nested values GH-31840
- Decimal128 and Decimal256 type support has been added to the CSV reader GH-33111
- Fixed bug in
Lenmethod always returned 0 GH-14775
- Fixed bug for handling slices of Map arrays when marshalling to JSON and for IPC GH-14780
- Fixed memory leak when compressing IPC message body buffers GH-14883
- Added the ability to easily append scalar values to array builders GH-15005
- Scalar binary (add/subtract/multiply/divide/etc.) and unary arithmetic (abs/neg/sqrt/sign/etc.) has been implemented for the compute package GH-33086 this includes easy functions like
- Scalar boolean functions like AND/OR/XOR/etc. have been implemented for compute GH-33279
- Scalar comparison function kernels have been implemented for compute (equal/greater/greater_equal/less/less_equal) GH-33308
- Scalar compute functions are compatible with dictionary encoded arrays by casting them to their value types GH-33502
- Panic when decoding a delta_bit_packed encoded column has been fixed GH-33483
- Fixed memory leak from Allocator in
writer.WriteBatchto properly handle writing encrypted parquet columns and no longer silently fail, but instead propagate an error GH-14940
- Implement support for writing compressed files (#15223)
- Improve performance by short-circuiting null checks when comparing non null field types (#15106)
- Several enhancements to dictionary encoding (#14891, (#14902, (#14874)
- Extend Table to support additional vector types (#14573)
- Enhance and simplify handling of allocation management by integrating C Data into allocator hierarchy (#14506)
- Make ComplexCopier agnostic of specific implementation of MapWriter (#14557)
- Distribute Apple M1 compatible JNI libraries via mavencentral (#14472)
- Extend Table copy functionality, and support returning copies of individual vectors (#14389)
- Bugfixes and dependency updates.
- Arrow now requires BigInt support. GH-33681
- PyArrow now requires pandas >= 1.0 (ARROW-18173)
pyarrow.parquet.ParquetDataset()class now by default uses the new Dataset API under the hood (
use_legacy_dataset=False). You can still pass
use_legacy_dataset=Trueto get the legacy implementation, but this option will be removed in a next release (ARROW-16728).
- Added support for the DataFrame Interchange Protocol
- New kernels:
list_slice()to slice each list element of a ListArray returning a new ListArray (ARROW-17960).
- A new
filter()method on the Dataset class as additional API to filter a Dataset before consuming it (ARROW-16616).
sort()method for (Chunked)Array and
sort_by()method for RecordBatch, providing a convenience on top of the
sort_indiceskernel (GH-14778), and a new
- Support for custom metadata of record batches in the IPC read and write APIs (ARROW-16430).
- Support URIs and the
- When writing a dataset to IPC using
pyarrow.dataset.write_dataset(), you can now specify IPC specific options, such as compression (ARROW-17991)
pyarrow.array()function now allows to construct a MapArray from a sequence of dicts (in addition to a sequence of tuples) (ARROW-17832).
struct_field()kernel now also accepts field names in addition to integer indices (ARROW-17989).
- Casting to string is now supported for duration (ARROW-15822) and decimal (ARROW-17458) types, which also means those can now be written to CSV.
- When writing to CSV, you can now specify the quoting style (GH-14755).
pyarrow.ipc.read_schema()function now accepts a Message object (ARROW-18423).
- The Time32Scalar, Time64Scalar, Date32Scalar and Date64Scalar classes got a
.valueattribute to access the underlying integer value, similar to the other date-time related scalars (ARROW-18264)
- Duration type is now supported in the hash kernels like
- Fix silent overflow when converting
datetime.timedeltato duration type (ARROW-15026).
Relevant bug fixes:
- Numpy conversion for ListArray is improved taking into account sliced offset, avoiding increased memory usage (GH-20512
- Fix writing files with multi-byte characters in file name (ARROW-18123).
- map_batches() is lazy by default; it now returns a RecordBatchReader instead of a list of RecordBatch objects unless lazy = FALSE. GH-14521
- A substantial reorganisation, rewrite of and addition to, many of the vignettes and README. GH-14514
For more on what’s in the 11.0.0 R package, see the R changelog.
Ruby and C GLib notes
Arrow::Table#savenow always returns self instead of the result of its
- Improve the GC-related crash prevention system by guarding the shared objects from GC ARROW-18161
- Support omitting join keys in
- Add support for
to_arrow_chunked_arraymethods to support converting to
- The signature of
garrow_execute_plain_wait()was changed to take the
errorargument and to return the finished status GH-15254
- Add support for half float GH-15168
The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.