Apache Arrow 11.0.0 Release
Published
25 Jan 2023
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 11.0.0 release. This covers over 3 months of development work and includes 423 resolved issues from 95 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Community
Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson, Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak, Jie Wen and Brent Gardner have been invited to be committers. Kun Liu have joined the Project Management Committee (PMC).
As per our newly started tradition of rotating the PMC chair once a year Andrew Lamb was elected as the new PMC chair and VP.
Thanks for your contributions and participation in the project!
Columnar Format Notes
Arrow Flight RPC notes
In the C++/Python Flight clients, DoAction now properly streams the results, instead of blocking until the call finishes. Applications that did not consume the iterator before should fully consume the result. (#15069)
C++ notes
- It is now possible to specify alignment when making allocations with a MemoryPool GH-33056
- It is now possible to run an ExecPlan without using any CPU threads
- Added kernel for slicing list values GH-33168
- Added kernel for slicing binary arrays GH-20357
- When comparing list arrays for equality the list field name is now ignored GH-30519
- Add support for partitioning on columns that contain special characters GH-33448
- Added a streaming reader for JSON GH-33140
- Added support for incremental writes to the ORC writer GH-33047
- Added support for casting decimal to string and writing decimal to CSV GH-33002
- Fixed an assert in the scanner that would occur when batch_readahead was set to 0 GH-15264
- Fixed bug where arrays with a null data buffer would not be accepted when imported via the C data API GH-14875
- Fixed bug where arrays with a zero-case union data type would not be accepted when imported via the C data API GH-14855
- Fixed bug where case_when could return incorrect values GH-33382
- Fixed bug where RecordBatch::Equals was ignoring field names GH-33285
C# notes
No major changes to C#.
Go notes
- Go’s benchmarks will now get added to Conbench alongside the benchmarks for other implementations GH-32983
- Exposed FlightService_ServiceDesc and RegisterFlightServiceServer to allow easily incorporating a flight service into an existing gRPC server GH-15174
Arrow
- Function
ApproxEquals
was implemented for scalar values GH-29581 UnmarshalJSON
for theRecordBuilder
now properly handles extra unknown fields with complex/nested values GH-31840- Decimal128 and Decimal256 type support has been added to the CSV reader GH-33111
- Fixed bug in
array.UnionBuilder
whereLen
method always returned 0 GH-14775 - Fixed bug for handling slices of Map arrays when marshalling to JSON and for IPC GH-14780
- Fixed memory leak when compressing IPC message body buffers GH-14883
- Added the ability to easily append scalar values to array builders GH-15005
Compute
- Scalar binary (add/subtract/multiply/divide/etc.) and unary arithmetic (abs/neg/sqrt/sign/etc.) has been implemented for the compute package GH-33086 this includes easy functions like
compute.Add
andcompute.Divide
etc. - Scalar boolean functions like AND/OR/XOR/etc. have been implemented for compute GH-33279
- Scalar comparison function kernels have been implemented for compute (equal/greater/greater_equal/less/less_equal) GH-33308
- Scalar compute functions are compatible with dictionary encoded arrays by casting them to their value types GH-33502
Parquet
- Panic when decoding a delta_bit_packed encoded column has been fixed GH-33483
- Fixed memory leak from Allocator in
pqarrow.WriteArrowToColumn
GH-14865 - Fixed
writer.WriteBatch
to properly handle writing encrypted parquet columns and no longer silently fail, but instead propagate an error GH-14940
Java notes
- Implement support for writing compressed files (#15223)
- Improve performance by short-circuiting null checks when comparing non null field types (#15106)
- Several enhancements to dictionary encoding (#14891, (#14902, (#14874)
- Extend Table to support additional vector types (#14573)
- Enhance and simplify handling of allocation management by integrating C Data into allocator hierarchy (#14506)
- Make ComplexCopier agnostic of specific implementation of MapWriter (#14557)
- Distribute Apple M1 compatible JNI libraries via mavencentral (#14472)
- Extend Table copy functionality, and support returning copies of individual vectors (#14389)
JavaScript notes
- Bugfixes and dependency updates.
- Arrow now requires BigInt support. GH-33681
Python notes
Compatibility notes:
- PyArrow now requires pandas >= 1.0 (ARROW-18173)
- The
pyarrow.parquet.ParquetDataset()
class now by default uses the new Dataset API under the hood (use_legacy_dataset=False
). You can still passuse_legacy_dataset=True
to get the legacy implementation, but this option will be removed in a next release (ARROW-16728).
New features:
- Added support for the DataFrame Interchange Protocol
for
pyarrow.Table
(GH-33346). - New kernels:
list_slice()
to slice each list element of a ListArray returning a new ListArray (ARROW-17960). - A new
filter()
method on the Dataset class as additional API to filter a Dataset before consuming it (ARROW-16616). - New
sort()
method for (Chunked)Array andsort_by()
method for RecordBatch, providing a convenience on top of thesort_indices
kernel (GH-14778), and a newDataset.sort_by()
method (GH-14975).
Other improvements:
- Support for custom metadata of record batches in the IPC read and write APIs (ARROW-16430).
- Support URIs and the
filesystem
parameter inpyarrow.parquet.ParquetFile
(ARROW-18272) andpyarrow.parquet.write_metadata
(ARROW-18225). - When writing a dataset to IPC using
pyarrow.dataset.write_dataset()
, you can now specify IPC specific options, such as compression (ARROW-17991) - The
pyarrow.array()
function now allows to construct a MapArray from a sequence of dicts (in addition to a sequence of tuples) (ARROW-17832). - The
struct_field()
kernel now also accepts field names in addition to integer indices (ARROW-17989). - Casting to string is now supported for duration (ARROW-15822) and decimal (ARROW-17458) types, which also means those can now be written to CSV.
- When writing to CSV, you can now specify the quoting style (GH-14755).
- The
pyarrow.ipc.read_schema()
function now accepts a Message object (ARROW-18423). - The Time32Scalar, Time64Scalar, Date32Scalar and Date64Scalar classes got a
.value
attribute to access the underlying integer value, similar to the other date-time related scalars (ARROW-18264) - Duration type is now supported in the hash kernels like
dictionary_encode
(GH-15226). - Fix silent overflow when converting
datetime.timedelta
to duration type (ARROW-15026).
Relevant bug fixes:
- Numpy conversion for ListArray is improved taking into account sliced offset, avoiding increased memory usage (GH-20512
- Fix writing files with multi-byte characters in file name (ARROW-18123).
R notes
- map_batches() is lazy by default; it now returns a RecordBatchReader instead of a list of RecordBatch objects unless lazy = FALSE. GH-14521
- A substantial reorganisation, rewrite of and addition to, many of the vignettes and README. GH-14514
For more on what’s in the 11.0.0 R package, see the R changelog.
Ruby and C GLib notes
Ruby
Arrow::Table#save
now always returns self instead of the result of itsraw_records
GH-15289- Improve the GC-related crash prevention system by guarding the shared objects from GC ARROW-18161
- Add
Arrow::HalfFloat
andraw_records
support inArrow::HalfFloatArray
ARROW-18086 - Support omitting join keys in
Table#join
GH-15084 - Add support for
Arrow::Table.load(uri, schema:)
ARROW-15206 - Add
Arrow::ColumnContainable#column_names
(e.g.Arrow::Table#column_names
) GH-15085 - Add
to_arrow_chunked_array
methods to support converting toArrow::ChunkedArray
ARROW-18405
C GLib
- Add
garrow_chunked_array_new_empty()
GH-33671 - Add
GArrowProjectNodeOptions
GH-33670 - Add
GADatasetHivePartitioning
GH-15257 - The signature of
garrow_execute_plain_wait()
was changed to take theerror
argument and to return the finished status GH-15254 - Add support for half float GH-15168
- Add
GADatasetFinishOptions
GH-15146
Rust notes
The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.