Apache Arrow 16.0.0 Release


Published 20 Apr 2024
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 16.0.0 release. This covers over 3 months of development work and includes 385 resolved issues on 586 distinct commits from 119 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Community

Since the 15.0.0 release, Jeffrey Vo, Jay Zhan, Bryce Mecum, Joel Lubinitsky, and Sarah Gilmore have been invited to be committers. No new members have joined the Project Management Committee (PMC).

Thanks for your contributions and participation in the project!

C Data Interface notes

  • Added RegisterDeviceMemoryManager and GetDeviceMemoryManage for managing mappings between a device type and id to a memory manager (GH-40698).
  • Added RegisterCUDADevice to register CUDA devices (GH-40698).
  • Added ImportFromChunkedArray and ExportChunkedArray for handling Chunked Arrays in the C Stream Interface (GH-38717).
  • Fixed an issue where string and nested types weren’t being correctly imported with DeviceArray (GH-39769).
  • Added support for copying Arrays and RecordBatches between memory types (GH-39771).

Arrow Flight RPC notes

  • Session variable RPCs were added (GH-34865)
  • Go: cookies can be copied to another connection to reuse existing credentials (GH-39837)
  • Go: enable PollFlightInfo for Flight SQL clients/servers (GH-39574)
  • Java: the JDBC driver now tries all locations the server sends it (GH-38573)
  • Java: tweak some options to give better performance (GH-40475, GH-40039)

C++ notes

For C++ notes refer to the full changelog.

Highlights

  • Initial support for the Azure Blob Storage has been added (GH-18014).
  • Arrow C++ can now be built with Emscripten (GH-37821) which lays the foundation for running Arrow C++ under WASM runtimes and eventually PyArrow as well.
  • Arrow’s filesystem modules have been separated out into individual libraries and this change enables writing and registering custom filesystem implementations (GH-38309).
  • Conversion from Table and RecordBatch to a Tensor (not the same as tensor extension array) is being developed. Umbrella issue is created (GH-40058) and issues connected to the RecordBatch conversion are included in this release (GH-40059, GH-40357, GH-40297, GH-40060, GH-40061 and GH-40866) which means RecordBatch can now be converted to a column or row-major two-dimensional structure.

Compute

Bug Fixes

  • Fixed a potential crash when accessing the true_count property on a BooleanArray (GH-41016).

Performance improvements

  • Significantly improved performance of the take kernel on certain types of inputs (GH-40207).

Enhancements

  • Support for casting to and from half-float (float16) has been added (GH-20213).
  • Added support for residual predicates to Swiss Join implementation (GH-20339).
  • Expanded support to primitive filter implementation for all fixed-width primitive types and take filter implementation for all well-known fixed-width types (GH-39740).
  • Added support for calling the binary_slice kernel on Fixed-Size Binary Arrays (GH-39231).
  • The cast kernel now supports casting from LargeString, Binary, and LargeBinary to Dictionary (GH-39463).
  • Fields of different decimal precision can now be used together in arithmetic operations without an explicit cast beforehand. (GH-40126).

Datasets

Parquet

  • Byte stream split encoding support has been added for FIXED_LEN_BYTE_ARRAY, INT32, and INT64 which enables this encoding for half-float (float16) and fixed-width decimal (GH-39978).
  • Decoding boolean values has been made faster for a variety of cases (GH-40872).

Filesystems

New Features

  • In addition to building the individual filesystem implementations as separate modules, users can now write and register custom filesystem implementations (GH-38309).
  • A new environment variable, AWS_ENDPOINT_URL_S3, has been added which allows separately overriding the endpoint for S3 operations alone (GH-38663).

Bug Fixes

  • Fixed a bug in the S3 filesystem implementation that could cause a crash when deleting an object having duplicate forward slashes in its name (GH-38821).
  • Fixed a bug where hash_mean could silently overflow (GH-38833).

Improvements

  • The S3 implementation now sets the content-type of directory-like objects to application/x-directory to improve compatibility with other S3 tools (GH-38794).
  • Repeated S3Client initialization is now roughly an order of magnitude faster (GH-40299).
  • The MemoryPoolStats implementation has been reworked to re-order loads and stores which may be an improvement for some allocation-heavy, multi-threaded applications (GH-40783).

Substrait

  • Support has been added to Substrait for a variety of Arrow types (GH-40695).
  • substrait-cpp has been upgraded to 0.44 (GH-40695).

Development

  • Added support the mold and lld linkers for building Arrow C++ (GH-40394, GH-40400).

Miscellaneous

  • Upgraded ORC to 2.0.0 (GH-40507).
  • Upgraded zstd to 1.5.6 (GH-40837).
  • Upgraded google benchmark to 1.8.3 (GH-39863).
  • Upgraded zlib 1.3.1 (GH-39876).
  • Various ToString methods now support an optional show_metadata argument which will print metadata that may exist in nested types. (GH-39864).

C# notes

  • IPC record batch compression has been implemented GH-24834
  • Optional materialization of C# string arrays is now supported GH-41047
  • A memory leak in the C Data interface has been fixed GH-40898
  • Various other bug fixes and improvements.

Go Notes

  • The Golang Arrow and Parquet libraries now require Go 1.21+ (GH-40733)

Bug Fixes

Arrow

  • FlightSQL Driver will now properly handle concurrent result sets instead of pulling the entire result into memory (GH-40089)
  • FlightSQL driver will now correctly respect the DriverConfig.TLSEnabled field (GH-40097)
  • Fixed a panic on 32-bit architectures (GH-40672)
  • Corrected a precision loss for Decimal types when converting to JSON (GH-40693)
  • Fixed an issue with array.RecordBuilder when using a NullType column (GH-40719)

Parquet

  • Fixed panic when writing a DeltaBinaryPacked column containing only nulls (GH-35718)
  • Fixed a panic when writing a ListOf(DeltaBinaryPacked) field with no data (GH-39309)
  • Arrow DATE64 types will now be properly coerced into Parquet DATE[32-bit] logical type (GH-39456)
  • Fixed the timezone semantics for timestamp conversion from Arrow to Parquet (GH-39466)
  • Corrected an inaccuracy with RowGroupTotalCompressedBytes and RowGroupTotalBytesWritten for Parquet file writer (GH-39870)
  • Fixed the TotalCompressedBytes count when falling back to plain encoding if a dictionary is too large (GH-39921)
  • Fixed a bug when reslicing a nullable dictionary in the chunked writer (GH-39925)

Enhancements

Arrow

  • Users can now access the underlying MemoTable of a dictionary builder (GH-38988)
  • Added an option to provide a string replacer for CSV writing (GH-39552)
  • Flight: Cookies can be copied to another connection to reuse existing credentials (GH-39837)
  • Flight: enable PollFlightInfo for Flight SQL clients/servers (GH-39574)
  • Added the ability to create a PreparedStatement from persisted data and provided access for FlightSQL users to the PreparedStatement handle property (GH-39774 GH-39910)
  • FlightRPC Session management extensions have been implemented (GH-40155)

Parquet

  • Can now register new compression codecs for Parquet (GH-40113)
  • Parquet footers can be incrementally written without closing the file (GH-40630)

Java notes

  • A breaking change to support Java 9 modules has been implemented in this release. GH-39001
  • A new Float16 type has been added. GH-39680
  • Java 22 is supported. GH-40680
  • Various bug fixes and improvements.

JavaScript notes

  • Dates are now stored as TimestampMillisecond (GH-40892)
  • Vectors created from typed arrays are now correctly not nullable and null counts are now correct (GH-40852)

Python notes

Compatibility notes:

  • To ensure PyArrow compatibility with NumPy 2.0 umbrella issue has been closed GH-39532 with last issues included in 16.0.0 Arrow release (GH-41098, GH-39848 and GH-40376).
  • We no longer use internals to create Block objects and started using new pandas API with pandas version 3 GH-35081
  • Pandas compatibility code has been simplified as old pandas and Python versions are not supported anymore GH-40720
  • Deprecated pyarrow.filesystem legacy implementations have been removed GH-20127

New features:

  • Converting Arrow Table and RecordBatch to a Tensor (not the same as tensor extension array) is being developed in Arrow C++ with bindings in Python. Umbrella issue: (GH-40058). In current release the option to convert a RecordBatch to Tensor with pyarrow.RecordBatch.to_tensor(...) is added returning a row or column major tensor with an option of writing missing values as NaN in the result.
  • ListView and LargeListView array formats are now supported by PyArrow (GH-39812, GH-39855, GH-40205, GH-41039, GH-40266)
  • Binary and StringView are now supported in PyArrow (GH-39651, GH-39852, GH-40092)
  • Final support for Run-End Encoded arrays in PyArrow has been included (conversion to numpy and pandas GH-40659, construction in pa.array(...) GH-40273)
  • AsofJoinNode C++ functionality is now exposed in Python as a join_asof GH-34235
  • Minimal python bindings are added for AzureFilesystem GH-39968
  • FixedSizeTensorScalar class is added GH-37484

Other improvements:

  • Add ChunkedArray import/export to/from C GH-39984
  • pyarrow.Field and pyarrow.ChunkedArray can now be constructed from objects supporting the PyCapsule Arrow C Data Interface GH-38010
  • Requested_schema is supported in __arrow_c_stream__ implementations GH-40066
  • Add low-level bindings for exporting/importing the C Device Interface GH-39979
  • Function to download and extract timezone database on a Windows machine is added GH-37328
  • Missing methods are added to pyarrow.RecordBatch GH-30915
  • Dictionary is now also accepted in pyarrow.record_batch factory function (as in pyarrow.table) GH-40291
  • Usage of scalar legacy cast has been removed GH-40023
  • Missing byte_width attribute are added to all DataType classes GH-39277
  • FileInfo instances can now be used to construct Dataset objects GH-40142
  • Support hashing for FileMetaData and ParquetSchema GH-39780
  • force_virtual_addressing is exposed in PyArrow GH-39779

Relevant bug fixes:

  • Calling pyarrow.dataset.ParquetFileFormat.make_write_options as a class method now returns a warning GH-39440
  • ScalarMemoTableis now initiated only when deduplication is enabled which fixes large memory consumption in the other case GH-40316
  • Slicing an array backwards beyond the start doesn’t include first item (GH-38768 and GH-40642)
  • Memory leaks when creating Arrow array from Python list of dicts is fixed GH-37989
  • FixedSizeListType has not been considered as a nested type and is now added to _NESTED_TYPES GH-40171
  • max_chunksize is now validated in Table.to_batches GH-39788
  • Raising ValueError on _ensure_partitioningin Dataset is fixed GH-39579

  • Python stacktrace is now attached to errors in ConvertPyError GH-37164

R notes

  • Arrow IPC streams (i.e., write_ipc_stream) can now be written to socket connections (GH-38897)
  • The print() output for Dataset and Table objects has been improved so it now shows dimensions and truncates its output in the case of wide schemas (GH-38917)
  • Various improvements and fixes to documentation, package build, and CI systems

For more on what’s in the 16.0.0 R package, see the R changelog.

Ruby and C GLib notes

Ruby

  • Added support for customizing timestamp parsers. GH-40590

C GLib

  • Added support for time zone in GArrowTimestampDataType. GH-39702
  • Added missing compute function options. GH-40402
    • GArrowSplitPatternOptions
    • GArrowStrftimeOptions
    • GArrowStrptimeOptions
    • GArrowStructFieldOptions
  • Changed documentation generator to GI-DocGen from GTK-Doc. GH-39935
  • Added GArrowTimestampParser. GH-40438
  • Added support for customizing timestamp parsers. GH-40590

Rust notes

The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.