Apache Arrow 13.0.0 Release
Published
24 Aug 2023
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 13.0.0 release. This covers over 3 months of development work and includes 456 resolved issues from 108 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Community
Since the 12.0.0 release, Marco Neumann, Gang Wu, Mehmet Ozan Kabak and Kevin Gurney have been invited to be committers. Matt Topol, Jie Wen, Ben Baumgold and Dewey Dunnington have joined the Project Management Committee (PMC).
Thanks for your contributions and participation in the project!
Columnar Format Notes
The run-end encoded layout has been added. This layout can allow data with long runs of duplicate values to be encoded and processed efficiently. Initial support has been added for C++ and Go.
C Device Data Interface
An experimental new specification, the C Device Data Interface, has been accepted for inclusion GH-34971. It builds on the existing C Data Interface to provide a runtime-agnostic zero-copy sharing mechanism for Arrow data residing on non-CPU devices.
Reference implementations of the C Device Data Interface will progressively be added to the standard Arrow libraries after the 13.0.0 release.
Arrow Flight RPC notes
Support for flagging ordered result sets to clients is now added. (GH-34852)
gRPC 1.30 is now the minimum supported version in C++/Python/R/etc. (GH-34679)
In C++, various methods now receive a full ServerCallContext
(GH-35442, GH-35377) and the context now exposes headers sent by the client (GH-35375).
C++ notes
Building
CMake 3.16 or later is now required for building Arrow C++ GH-34921.
Optimizations are not disabled anymore when the RelWithDebInfo
build type
is selected GH-35850. Furthermore, compiler flags can now properly be
customized per-build type using ARROW_C_FLAGS_DEBUG
, ARROW_CXX_FLAGS_DEBUG
and related variables GH-35870.
Acero
Handling of unaligned buffers is input nodes can be configured programmatically
or by setting the environment variable ACERO_ALIGNMENT_HANDLING
. The default
behavior is to warn when an unaligned buffer is detected GH-35498.
Compute
Several new functions have been added:
- aggregate functions “first”, “last”, “first_last” GH-34911;
- vector functions “cumulative_prod”, “cumulative_min”, “cumulative_max” GH-32190;
- vector function “pairwise_diff” GH-35786.
Sorting now works on dictionary arrays, with a much better performance than
the naive approach of sorting the decoded dictionary GH-29887. Sorting also
works on struct arrays, and nested sort keys are supported using FieldRed
GH-33206.
The check_overflow
option has been removed from CumulativeSumOptions
as
it was redundant with the availability of two different functions:
“cumulative_sum” and “cumulative_sum_checked” GH-35789.
Run-end encoded filters are efficiently supported GH-35749.
Duration types are supported with the “is_in” and “index_in” functions GH-36047. They can be multiplied with all integer types GH-36128.
“is_in” and “index_in” now cast their inputs more flexibly: they first attempt to cast the value set to the input type, then in the other direction if the former fails GH-36203.
Multiple bugs have been fixed in “utf8_slice_codeunits” when the stop
option
is omitted GH-36311.
Dataset
A custom schema can now be passed when writing a dataset GH-35730. The custom schema can alter nullability or metadata information, but is not allowed to change the datatypes written.
Filesystems
The S3 filesystem now writes files in equal-sized chunks, for compatibility with Cloudflare’s “R2” Storage GH-34363.
A long-standing issue where S3 support could crash at shutdown because of resources still being alive after S3 finalization has been fixed GH-36346. Now, attempts to use S3 resources (such as making filesystem calls) after S3 finalization should result in a clean error.
The GCS filesystem accepts a new option to set the project id GH-36227.
IPC
Nullability and metadata information for sub-fields of map types is now preserved when deserializing Arrow IPC GH-35297.
Orc
The Orc adapter now maps Arrow field metadata to Orc type attributes when writing, and vice-versa when reading GH-35304.
Parquet
It is now possible to write additional metadata while a ParquetFileWriter
is
open GH-34888.
Writing a page index can be enabled selectively per-column GH-34949. In addition, page header statistics are not written anymore if the page index is enabled for the given column GH-34375, as the information would be redundant and less efficiently accessed.
Parquet writer properties allow specifying the sorting columns GH-35331. The user is responsible for ensuring that the data written to the file actually complies with the given sorting.
CRC computation has been implemented for v2 data pages GH-35171. It was already implemented for v1 data pages.
Writing compliant nested types is now enabled by default GH-29781. This should not have any negative implication.
Attempting to load a subset of an Arrow extension type is now forbidden
GH-20385. Previously, if an extension type’s storage is nested (for example
a “Point” extension type backed by a struct<x: float64, y: float64>
),
it was possible to load selectively some of the columns of the storage type.
Substrait
Support for various functions has been added: “stddev”, “variance”, “first”, “last” (GH-35247, GH-35506).
Deserializing sorts is now supported GH-32763. However, some features, such as clustered sort direction or custom sort functions, are not implemented.
Miscellaneous
FieldRef
sports additional methods to get a flattened version of nested
fields GH-14946. Compared to their non-flattened counterparts,
the methods GetFlattened
, GetAllFlattened
, GetOneFlattened
and
GetOneOrNoneFlattened
combine a child’s null bitmap with its ancestors’
null bitmaps such as to compute the field’s overall logical validity bitmap.
In other words, given the struct array [null, {'x': null}, {'x': 5}]
,
FieldRef("x")::Get
might return [0, null, 5]
while FieldRef("y")::GetFlattened
will always return [null, null, 5]
.
Scalar::hash()
has been fixed for sliced nested arrays GH-35360.
A new floating-point to decimal conversion algorithm exhibits much better precision GH-35576.
It is now possible to cast between scalars of different list-like types GH-36309.
C# notes
Enhancements
-
The C Data Interface is now supported in the .NET Apache.Arrow library. The main entry points are
CArrowArrayImporter.ImportArray
,CArrowArrayExporter.ExportArray
,CArrowArrayStreamImporter.ImportArrayStream
, andCArrowArrayStreamExporter.ExportArrayStream
in theApache.Arrow.C
namespace. (GH-33856, GH-33857, GH-36120, and GH-35809). -
ArrowBuffer.BitmapBuilder
addsAppend(ReadOnlySpan<byte> source, int validBits)
andAppendRange(bool value, int length)
to improve performance of array concatenation (GH-32605)
Bug Fixes
- TotalBytes and TotalRecords are now being serialized in FlightInfo (GH-35267)
Go notes
Enhancements
Arrow
- Compute arithmetic functions are now available for Float16 (GH-35162)
- Float16, Large* and Fixed types are all now supported by the CSV reader/writer (GH-36105 and GH-36141)
- CSV Reader uses
AppendValueFromString
for extension types and properly reads empty values as null (GH-35188 and GH-35190) - Substrait expressions can now be executed using the Compute library (GH-35652)
- You can now read back values from Dictionary Builders before finishing the array (GH-35711)
MapType.ValueField
andMapType.ValueType
are now deprecated in favor ofMapType.Elem().(*StructType)
(GH-35909)- Multiple equality functions which have been deprecated since v9 have now been removed (Such as
array.ArraySliceEqual
in favor ofarray.SliceEqual
) (GH-36198) ValueStr
method on Timestamp arrays now includes the zone in the output (GH-36568)- BREAKING CHANGE
FixedSizeListBuilder.AppendNull
no longer requires manually appending nulls to the underlying list (GH-35482)
Flight
- FlightSQL driver supports non-prepared queries now (GH-35136)
Parquet
- Error messages in row group writer have been improved (GH-36319)
Bug Fixes
- Cross architecture build failures with v12.0.1 have been fixed (GH-36052)
Arrow
- It is now possible to build the Arrow Go lib using tinygo for building smaller WASM binaries (GH-32832)
Fields
method for Schema and StructType now returns a copy of the slice to ensure immutability (GH-35306 and GH-35866)array.ApproxEqual
for Maps now allows entries for a given element to be presented in any order (GH-35828)- Fix issues with decimal256 arrays (GH-35911, GH-35965, and GH-35975)
- StructType now allows duplicate field names correctly (GH-36014)
Flight
- Fix crash in client middleware (GH-35240)
Parquet
- Various memory leaks addressed in pqarrow package (GH-35015)
- Fixed panic for
ListOf(types)
if null (GH-35684)
Java notes
The JNI bindings for Arrow Dataset now support execute Substrait plans via the Acero query engine. (GH-34223)
Arrow packages that depend on Netty (most notably, arrow-memory-netty
, but also Arrow Flight) now require either Netty < 4.1.94.Final or Netty >= 4.1.96.Final. In Netty versions 4.1.94.Final and 4.1.95.Final, there was a breaking change in an internal API that affected Arrow; this was reverted in 4.1.96.Final (GH-36209, GH-36928)
VectorSchemaRoot#slice
now always makes a copy, including when the slice covers all rows (previously it did not make a copy in this case). This is a potentially-breaking change if your application depended on the old behavior. (GH-35275)
Debug info for allocations is no longer automatically enabled when assertions are enabled (e.g. when running unit tests). Instead, support must be explicitly enabled. This is not quite a breaking change, but may be surprising if you are used to using this information while debugging tests. However, performance should be greatly improved while running tests. (GH-34338)
Support for the upcoming Java 21 was added, though we do not yet test this in CI (GH-5053). The JNI bindings for Arrow Dataset now expose JSON support (GH-36421). Dictionary replacement is now supported when writing the IPC stream format (GH-18547).
JavaScript notes
- Updated dependencies: https://github.com/apache/arrow/pull/36032
Python notes
Compatibility notes:
- The default format version for Parquet has been bumped from 2.4 to 2.6 GH-35746. In practice, this means that nanosecond timestamps now preserve its resolution instead of being converted to microseconds.
- Support for Python 3.7 is dropped GH-34788
New features:
- Conversion to non-nano datetime64 for pandas >= 2.0 is now supported GH-33321
- Write page index is now supported GH-36284
- Bindings for reading JSON format in Dataset are added GH-34216
keys_sorted
property of MapType is now exposed GH-35112
Other improvements:
- Common python functionality between
Table
andRecordBatch
classes has been consolidated ( GH-36129, GH-35415, GH-35390, GH-34979, GH-34868, GH-31868) - Some functionality for
FixedShapeTensorType
has been improved (__reduce__
GH-36038, picklability GH-35599) - Pyarrow scalars can now be accepted in the
array
constructor GH-21761 - DataFrame Interchange Protocol implementation and usage is now documented GH-33980
- Conversion between Arrow and Pandas for map/pydict now has enhanced support GH-34729
- Usability of
pc.map_lookup
/MapLookupOptions
is improved GH-36045 zero_copy_only
keyword can now also be accepted inChunkedArray.to_numpy()
GH-34787- Python C++ codebase now has linter support in Archery and the CI GH-35485
Relevant bug fixes:
__array__
numpy conversion for Table and RecordBatch is now corrected so thatnp.asarray(pa.Table)
doesn’t return a transposed result GH-34886parquet.write_to_dataset
doesn’t create empty files for non-observed dictionary (category) values anymore GH-23870- Dataset writer now also correctly follows default Parquet version of 2.6 GH-36537
- Comparing
pyarrow.dataset.Partitioning
with other type is now correctly handled GH-36659 - Pickling of pyarrow.dataset PartitioningFactory objects is now supported GH-34884
- None schema is now disallowed in parquet writer GH-35858
pa.FixedShapeTensorArray.to_numpy_ndarray
is not failing on sliced arrays GH-35573- Halffloat type is now supported in the conversion from Arrow list to pandas GH-36168
__from_arrow__
is now also implemented forArray.to_pandas
for pandas extension data types GH-36096
R notes
New features:
open_dataset()
now works with ND-JSON files GH-35055- Calling
schema()
on multiple Arrow objects now returns the object’s schema GH-35543 - dplyr
.by
/by
argument now supported in arrow implementation of dplyr verbs GH-35667
Other improvements:
- Convenience function
arrow_array()
can be used to create Arrow Arrays GH-36381 - Convenience function
scalar()
can be used to create Arrow Scalars GH-36265 - Prevent crashed when passing data between arrow and duckdb by always calling
RecordBatchReader::ReadNext()
from DuckDB from the main R thread GH-36307 - Issue a warning for
set_io_thread_count()
withnum_threads
< 2 GH-36304 - Ensure missing grouping variables are added to the beginning of the variable list GH-36305
- CSV File reader options class objects can print the selected values GH-35955
- Schema metadata can be set as a named character vector GH-35954
- Ensure that the RStringViewer helper class does not own any Array references GH-35812
strptime()
in arrow will return a timezone-aware timestamp if%z
is part of the format string GH-35671- Column ordering when combining
group_by()
andacross()
now matches dplyr GH-35473 - Link to correct version of OpenSSL when using autobrew GH-36551
- Require cmake 3.16 in bundled build script GH-36321
For more on what’s in the 13.0.0 R package, see the R changelog.
Ruby and C GLib notes
Ruby
Bug fixes
- Fixed GC-related issue against random segfault in hash join GH-35819
- Fixed segfault in
CallExpression.new
GH-35915
Improvements
- FlightRPC: Added a convenient wrapper for the authentication method GH-35435
- Added empty table support in
#select_columns
GH-35681 - Added optional hash support in
Expression.try_convert
GH-35915 - Parquet: Added
Parquet::ArrowFileReader#each_row_group
GH-36008 - Added support of the automatic installation of arrow-c-glib on Conda environment GH-36287
C GLib
Bug fixes
- Parquet: Fixed GC-related bug in metadata dependencies GH-35266
- Fixed potentially GC-related issue against random segfault in hash join GH-35819
Improvements
- FlightRPC: Added support to pass
GAFlightServerCallContext
object in several methods ofGAFlightServerCustomAuthHandler
GH-35377 - FlightSQL: Added support for INSERT/UPDATE/DELETE GH-36408
- Added
GArrowRunEndEncodedDataType
GH-35417 - Added
GArrowRunEndEncodedArray
GH-35418
Rust notes
The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.