Apache Arrow 13.0.0 Release


Published 24 Aug 2023
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 13.0.0 release. This covers over 3 months of development work and includes 456 resolved issues from 108 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Community

Since the 12.0.0 release, Marco Neumann, Gang Wu, Mehmet Ozan Kabak and Kevin Gurney have been invited to be committers. Matt Topol, Jie Wen, Ben Baumgold and Dewey Dunnington have joined the Project Management Committee (PMC).

Thanks for your contributions and participation in the project!

Columnar Format Notes

The run-end encoded layout has been added. This layout can allow data with long runs of duplicate values to be encoded and processed efficiently. Initial support has been added for C++ and Go.

C Device Data Interface

An experimental new specification, the C Device Data Interface, has been accepted for inclusion GH-34971. It builds on the existing C Data Interface to provide a runtime-agnostic zero-copy sharing mechanism for Arrow data residing on non-CPU devices.

Reference implementations of the C Device Data Interface will progressively be added to the standard Arrow libraries after the 13.0.0 release.

Arrow Flight RPC notes

Support for flagging ordered result sets to clients is now added. (GH-34852)

gRPC 1.30 is now the minimum supported version in C++/Python/R/etc. (GH-34679)

In C++, various methods now receive a full ServerCallContext (GH-35442, GH-35377) and the context now exposes headers sent by the client (GH-35375).

C++ notes

Building

CMake 3.16 or later is now required for building Arrow C++ GH-34921.

Optimizations are not disabled anymore when the RelWithDebInfo build type is selected GH-35850. Furthermore, compiler flags can now properly be customized per-build type using ARROW_C_FLAGS_DEBUG, ARROW_CXX_FLAGS_DEBUG and related variables GH-35870.

Acero

Handling of unaligned buffers is input nodes can be configured programmatically or by setting the environment variable ACERO_ALIGNMENT_HANDLING. The default behavior is to warn when an unaligned buffer is detected GH-35498.

Compute

Several new functions have been added:

  • aggregate functions “first”, “last”, “first_last” GH-34911;
  • vector functions “cumulative_prod”, “cumulative_min”, “cumulative_max” GH-32190;
  • vector function “pairwise_diff” GH-35786.

Sorting now works on dictionary arrays, with a much better performance than the naive approach of sorting the decoded dictionary GH-29887. Sorting also works on struct arrays, and nested sort keys are supported using FieldRed GH-33206.

The check_overflow option has been removed from CumulativeSumOptions as it was redundant with the availability of two different functions: “cumulative_sum” and “cumulative_sum_checked” GH-35789.

Run-end encoded filters are efficiently supported GH-35749.

Duration types are supported with the “is_in” and “index_in” functions GH-36047. They can be multiplied with all integer types GH-36128.

“is_in” and “index_in” now cast their inputs more flexibly: they first attempt to cast the value set to the input type, then in the other direction if the former fails GH-36203.

Multiple bugs have been fixed in “utf8_slice_codeunits” when the stop option is omitted GH-36311.

Dataset

A custom schema can now be passed when writing a dataset GH-35730. The custom schema can alter nullability or metadata information, but is not allowed to change the datatypes written.

Filesystems

The S3 filesystem now writes files in equal-sized chunks, for compatibility with Cloudflare’s “R2” Storage GH-34363.

A long-standing issue where S3 support could crash at shutdown because of resources still being alive after S3 finalization has been fixed GH-36346. Now, attempts to use S3 resources (such as making filesystem calls) after S3 finalization should result in a clean error.

The GCS filesystem accepts a new option to set the project id GH-36227.

IPC

Nullability and metadata information for sub-fields of map types is now preserved when deserializing Arrow IPC GH-35297.

Orc

The Orc adapter now maps Arrow field metadata to Orc type attributes when writing, and vice-versa when reading GH-35304.

Parquet

It is now possible to write additional metadata while a ParquetFileWriter is open GH-34888.

Writing a page index can be enabled selectively per-column GH-34949. In addition, page header statistics are not written anymore if the page index is enabled for the given column GH-34375, as the information would be redundant and less efficiently accessed.

Parquet writer properties allow specifying the sorting columns GH-35331. The user is responsible for ensuring that the data written to the file actually complies with the given sorting.

CRC computation has been implemented for v2 data pages GH-35171. It was already implemented for v1 data pages.

Writing compliant nested types is now enabled by default GH-29781. This should not have any negative implication.

Attempting to load a subset of an Arrow extension type is now forbidden GH-20385. Previously, if an extension type’s storage is nested (for example a “Point” extension type backed by a struct<x: float64, y: float64>), it was possible to load selectively some of the columns of the storage type.

Substrait

Support for various functions has been added: “stddev”, “variance”, “first”, “last” (GH-35247, GH-35506).

Deserializing sorts is now supported GH-32763. However, some features, such as clustered sort direction or custom sort functions, are not implemented.

Miscellaneous

FieldRef sports additional methods to get a flattened version of nested fields GH-14946. Compared to their non-flattened counterparts, the methods GetFlattened, GetAllFlattened, GetOneFlattened and GetOneOrNoneFlattened combine a child’s null bitmap with its ancestors’ null bitmaps such as to compute the field’s overall logical validity bitmap.

In other words, given the struct array [null, {'x': null}, {'x': 5}], FieldRef("x")::Get might return [0, null, 5] while FieldRef("y")::GetFlattened will always return [null, null, 5].

Scalar::hash() has been fixed for sliced nested arrays GH-35360.

A new floating-point to decimal conversion algorithm exhibits much better precision GH-35576.

It is now possible to cast between scalars of different list-like types GH-36309.

C# notes

Enhancements

  • The C Data Interface is now supported in the .NET Apache.Arrow library. The main entry points are CArrowArrayImporter.ImportArray, CArrowArrayExporter.ExportArray, CArrowArrayStreamImporter.ImportArrayStream, and CArrowArrayStreamExporter.ExportArrayStream in the Apache.Arrow.C namespace. (GH-33856, GH-33857, GH-36120, and GH-35809).

  • ArrowBuffer.BitmapBuilder adds Append(ReadOnlySpan<byte> source, int validBits) and AppendRange(bool value, int length) to improve performance of array concatenation (GH-32605)

Bug Fixes

  • TotalBytes and TotalRecords are now being serialized in FlightInfo (GH-35267)

    Go notes

Enhancements

Arrow

  • Compute arithmetic functions are now available for Float16 (GH-35162)
  • Float16, Large* and Fixed types are all now supported by the CSV reader/writer (GH-36105 and GH-36141)
  • CSV Reader uses AppendValueFromString for extension types and properly reads empty values as null (GH-35188 and GH-35190)
  • Substrait expressions can now be executed using the Compute library (GH-35652)
  • You can now read back values from Dictionary Builders before finishing the array (GH-35711)
  • MapType.ValueField and MapType.ValueType are now deprecated in favor of MapType.Elem().(*StructType) (GH-35909)
  • Multiple equality functions which have been deprecated since v9 have now been removed (Such as array.ArraySliceEqual in favor of array.SliceEqual) (GH-36198)
  • ValueStr method on Timestamp arrays now includes the zone in the output (GH-36568)
  • BREAKING CHANGE FixedSizeListBuilder.AppendNull no longer requires manually appending nulls to the underlying list (GH-35482)

Flight

  • FlightSQL driver supports non-prepared queries now (GH-35136)

Parquet

  • Error messages in row group writer have been improved (GH-36319)

Bug Fixes

  • Cross architecture build failures with v12.0.1 have been fixed (GH-36052)

Arrow

  • It is now possible to build the Arrow Go lib using tinygo for building smaller WASM binaries (GH-32832)
  • Fields method for Schema and StructType now returns a copy of the slice to ensure immutability (GH-35306 and GH-35866)
  • array.ApproxEqual for Maps now allows entries for a given element to be presented in any order (GH-35828)
  • Fix issues with decimal256 arrays (GH-35911, GH-35965, and GH-35975)
  • StructType now allows duplicate field names correctly (GH-36014)

Flight

  • Fix crash in client middleware (GH-35240)

Parquet

  • Various memory leaks addressed in pqarrow package (GH-35015)
  • Fixed panic for ListOf(types) if null (GH-35684)

Java notes

The JNI bindings for Arrow Dataset now support execute Substrait plans via the Acero query engine. (GH-34223)

Arrow packages that depend on Netty (most notably, arrow-memory-netty, but also Arrow Flight) now require either Netty < 4.1.94.Final or Netty >= 4.1.96.Final. In Netty versions 4.1.94.Final and 4.1.95.Final, there was a breaking change in an internal API that affected Arrow; this was reverted in 4.1.96.Final (GH-36209, GH-36928)

VectorSchemaRoot#slice now always makes a copy, including when the slice covers all rows (previously it did not make a copy in this case). This is a potentially-breaking change if your application depended on the old behavior. (GH-35275)

Debug info for allocations is no longer automatically enabled when assertions are enabled (e.g. when running unit tests). Instead, support must be explicitly enabled. This is not quite a breaking change, but may be surprising if you are used to using this information while debugging tests. However, performance should be greatly improved while running tests. (GH-34338)

Support for the upcoming Java 21 was added, though we do not yet test this in CI (GH-5053). The JNI bindings for Arrow Dataset now expose JSON support (GH-36421). Dictionary replacement is now supported when writing the IPC stream format (GH-18547).

JavaScript notes

  • Updated dependencies: https://github.com/apache/arrow/pull/36032

Python notes

Compatibility notes:

  • The default format version for Parquet has been bumped from 2.4 to 2.6 GH-35746. In practice, this means that nanosecond timestamps now preserve its resolution instead of being converted to microseconds.
  • Support for Python 3.7 is dropped GH-34788

New features:

  • Conversion to non-nano datetime64 for pandas >= 2.0 is now supported GH-33321
  • Write page index is now supported GH-36284
  • Bindings for reading JSON format in Dataset are added GH-34216
  • keys_sorted property of MapType is now exposed GH-35112

Other improvements:

  • Common python functionality between Table and RecordBatch classes has been consolidated ( GH-36129, GH-35415, GH-35390, GH-34979, GH-34868, GH-31868)
  • Some functionality for FixedShapeTensorType has been improved (__reduce__ GH-36038, picklability GH-35599)
  • Pyarrow scalars can now be accepted in the array constructor GH-21761
  • DataFrame Interchange Protocol implementation and usage is now documented GH-33980
  • Conversion between Arrow and Pandas for map/pydict now has enhanced support GH-34729
  • Usability of pc.map_lookup / MapLookupOptions is improved GH-36045
  • zero_copy_only keyword can now also be accepted in ChunkedArray.to_numpy() GH-34787
  • Python C++ codebase now has linter support in Archery and the CI GH-35485

Relevant bug fixes:

  • __array__ numpy conversion for Table and RecordBatch is now corrected so that np.asarray(pa.Table) doesn’t return a transposed result GH-34886
  • parquet.write_to_dataset doesn’t create empty files for non-observed dictionary (category) values anymore GH-23870
  • Dataset writer now also correctly follows default Parquet version of 2.6 GH-36537
  • Comparing pyarrow.dataset.Partitioning with other type is now correctly handled GH-36659
  • Pickling of pyarrow.dataset PartitioningFactory objects is now supported GH-34884
  • None schema is now disallowed in parquet writer GH-35858
  • pa.FixedShapeTensorArray.to_numpy_ndarray is not failing on sliced arrays GH-35573
  • Halffloat type is now supported in the conversion from Arrow list to pandas GH-36168
  • __from_arrow__ is now also implemented for Array.to_pandas for pandas extension data types GH-36096

R notes

New features:

  • open_dataset() now works with ND-JSON files GH-35055
  • Calling schema() on multiple Arrow objects now returns the object’s schema GH-35543
  • dplyr .by/by argument now supported in arrow implementation of dplyr verbs GH-35667

Other improvements:

  • Convenience function arrow_array() can be used to create Arrow Arrays GH-36381
  • Convenience function scalar() can be used to create Arrow Scalars GH-36265
  • Prevent crashed when passing data between arrow and duckdb by always calling RecordBatchReader::ReadNext() from DuckDB from the main R thread GH-36307
  • Issue a warning for set_io_thread_count() with num_threads < 2 GH-36304
  • Ensure missing grouping variables are added to the beginning of the variable list GH-36305
  • CSV File reader options class objects can print the selected values GH-35955
  • Schema metadata can be set as a named character vector GH-35954
  • Ensure that the RStringViewer helper class does not own any Array references GH-35812
  • strptime() in arrow will return a timezone-aware timestamp if %z is part of the format string GH-35671
  • Column ordering when combining group_by() and across() now matches dplyr GH-35473
  • Link to correct version of OpenSSL when using autobrew GH-36551
  • Require cmake 3.16 in bundled build script GH-36321

For more on what’s in the 13.0.0 R package, see the R changelog.

Ruby and C GLib notes

Ruby

Bug fixes

  • Fixed GC-related issue against random segfault in hash join GH-35819
  • Fixed segfault in CallExpression.new GH-35915

Improvements

  • FlightRPC: Added a convenient wrapper for the authentication method GH-35435
  • Added empty table support in #select_columns GH-35681
  • Added optional hash support in Expression.try_convert GH-35915
  • Parquet: Added Parquet::ArrowFileReader#each_row_group GH-36008
  • Added support of the automatic installation of arrow-c-glib on Conda environment GH-36287

C GLib

Bug fixes

  • Parquet: Fixed GC-related bug in metadata dependencies GH-35266
  • Fixed potentially GC-related issue against random segfault in hash join GH-35819

Improvements

  • FlightRPC: Added support to pass GAFlightServerCallContext object in several methods of GAFlightServerCustomAuthHandler GH-35377
  • FlightSQL: Added support for INSERT/UPDATE/DELETE GH-36408
  • Added GArrowRunEndEncodedDataType GH-35417
  • Added GArrowRunEndEncodedArray GH-35418

Rust notes

The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.