Apache Arrow 16.0.0 Release
Published
20 Apr 2024
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 16.0.0 release. This covers over 3 months of development work and includes 385 resolved issues on 586 distinct commits from 119 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Community
Since the 15.0.0 release, Jeffrey Vo, Jay Zhan, Bryce Mecum, Joel Lubinitsky, and Sarah Gilmore have been invited to be committers. No new members have joined the Project Management Committee (PMC).
Thanks for your contributions and participation in the project!
C Data Interface notes
- Added
RegisterDeviceMemoryManager
andGetDeviceMemoryManage
for managing mappings between a device type and id to a memory manager (GH-40698). - Added
RegisterCUDADevice
to register CUDA devices (GH-40698). - Added
ImportFromChunkedArray
andExportChunkedArray
for handling Chunked Arrays in the C Stream Interface (GH-38717). - Fixed an issue where string and nested types weren’t being correctly imported with DeviceArray (GH-39769).
- Added support for copying Arrays and RecordBatches between memory types (GH-39771).
Arrow Flight RPC notes
- Session variable RPCs were added (GH-34865)
- Go: cookies can be copied to another connection to reuse existing credentials (GH-39837)
- Go: enable PollFlightInfo for Flight SQL clients/servers (GH-39574)
- Java: the JDBC driver now tries all locations the server sends it (GH-38573)
- Java: tweak some options to give better performance (GH-40475, GH-40039)
C++ notes
For C++ notes refer to the full changelog.
Highlights
- Initial support for the Azure Blob Storage has been added (GH-18014).
- Arrow C++ can now be built with Emscripten (GH-37821) which lays the foundation for running Arrow C++ under WASM runtimes and eventually PyArrow as well.
- Arrow’s filesystem modules have been separated out into individual libraries and this change enables writing and registering custom filesystem implementations (GH-38309).
- Conversion from
Table
andRecordBatch
to aTensor
(not the same as tensor extension array) is being developed. Umbrella issue is created (GH-40058) and issues connected to theRecordBatch
conversion are included in this release (GH-40059, GH-40357, GH-40297, GH-40060, GH-40061 and GH-40866) which meansRecordBatch
can now be converted to a column or row-major two-dimensional structure.
Compute
Bug Fixes
- Fixed a potential crash when accessing the
true_count
property on a BooleanArray (GH-41016).
Performance improvements
- Significantly improved performance of the take kernel on certain types of inputs (GH-40207).
Enhancements
- Support for casting to and from half-float (float16) has been added (GH-20213).
- Added support for residual predicates to Swiss Join implementation (GH-20339).
- Expanded support to primitive filter implementation for all fixed-width primitive types and take filter implementation for all well-known fixed-width types (GH-39740).
- Added support for calling the
binary_slice
kernel on Fixed-Size Binary Arrays (GH-39231). - The cast kernel now supports casting from LargeString, Binary, and LargeBinary to Dictionary (GH-39463).
- Fields of different decimal precision can now be used together in arithmetic operations without an explicit cast beforehand. (GH-40126).
Datasets
- Improved backpressure handling in the Dataset Writer which can significantly reduce memory usage for some use cases (https://github.com/apache/arrow/pull/40722).
Parquet
- Byte stream split encoding support has been added for FIXED_LEN_BYTE_ARRAY, INT32, and INT64 which enables this encoding for half-float (float16) and fixed-width decimal (GH-39978).
- Decoding boolean values has been made faster for a variety of cases (GH-40872).
Filesystems
New Features
- In addition to building the individual filesystem implementations as separate modules, users can now write and register custom filesystem implementations (GH-38309).
- A new environment variable,
AWS_ENDPOINT_URL_S3
, has been added which allows separately overriding the endpoint for S3 operations alone (GH-38663).
Bug Fixes
- Fixed a bug in the S3 filesystem implementation that could cause a crash when deleting an object having duplicate forward slashes in its name (GH-38821).
- Fixed a bug where
hash_mean
could silently overflow (GH-38833).
Improvements
- The S3 implementation now sets the content-type of directory-like objects to application/x-directory to improve compatibility with other S3 tools (GH-38794).
- Repeated S3Client initialization is now roughly an order of magnitude faster (GH-40299).
- The MemoryPoolStats implementation has been reworked to re-order loads and stores which may be an improvement for some allocation-heavy, multi-threaded applications (GH-40783).
Substrait
- Support has been added to Substrait for a variety of Arrow types (GH-40695).
- substrait-cpp has been upgraded to 0.44 (GH-40695).
Development
Miscellaneous
- Upgraded ORC to 2.0.0 (GH-40507).
- Upgraded zstd to 1.5.6 (GH-40837).
- Upgraded google benchmark to 1.8.3 (GH-39863).
- Upgraded zlib 1.3.1 (GH-39876).
- Various ToString methods now support an optional
show_metadata
argument which will print metadata that may exist in nested types. (GH-39864).
C# notes
- IPC record batch compression has been implemented GH-24834
- Optional materialization of C# string arrays is now supported GH-41047
- A memory leak in the C Data interface has been fixed GH-40898
- Various other bug fixes and improvements.
Go Notes
- The Golang Arrow and Parquet libraries now require Go 1.21+ (GH-40733)
Bug Fixes
Arrow
- FlightSQL Driver will now properly handle concurrent result sets instead of pulling the entire result into memory (GH-40089)
- FlightSQL driver will now correctly respect the
DriverConfig.TLSEnabled
field (GH-40097) - Fixed a panic on 32-bit architectures (GH-40672)
- Corrected a precision loss for Decimal types when converting to JSON (GH-40693)
- Fixed an issue with
array.RecordBuilder
when using a NullType column (GH-40719)
Parquet
- Fixed panic when writing a DeltaBinaryPacked column containing only nulls (GH-35718)
- Fixed a panic when writing a ListOf(DeltaBinaryPacked) field with no data (GH-39309)
- Arrow DATE64 types will now be properly coerced into Parquet DATE[32-bit] logical type (GH-39456)
- Fixed the timezone semantics for timestamp conversion from Arrow to Parquet (GH-39466)
- Corrected an inaccuracy with
RowGroupTotalCompressedBytes
andRowGroupTotalBytesWritten
for Parquet file writer (GH-39870) - Fixed the
TotalCompressedBytes
count when falling back to plain encoding if a dictionary is too large (GH-39921) - Fixed a bug when reslicing a nullable dictionary in the chunked writer (GH-39925)
Enhancements
Arrow
- Users can now access the underlying
MemoTable
of a dictionary builder (GH-38988) - Added an option to provide a string replacer for CSV writing (GH-39552)
- Flight: Cookies can be copied to another connection to reuse existing credentials (GH-39837)
- Flight: enable PollFlightInfo for Flight SQL clients/servers (GH-39574)
- Added the ability to create a PreparedStatement from persisted data and provided access for FlightSQL users to the PreparedStatement handle property (GH-39774 GH-39910)
- FlightRPC Session management extensions have been implemented (GH-40155)
Parquet
- Can now register new compression codecs for Parquet (GH-40113)
- Parquet footers can be incrementally written without closing the file (GH-40630)
Java notes
- A breaking change to support Java 9 modules has been implemented in this release. GH-39001
- A new Float16 type has been added. GH-39680
- Java 22 is supported. GH-40680
- Various bug fixes and improvements.
JavaScript notes
- Dates are now stored as TimestampMillisecond (GH-40892)
- Vectors created from typed arrays are now correctly not nullable and null counts are now correct (GH-40852)
Python notes
Compatibility notes:
- To ensure PyArrow compatibility with NumPy 2.0 umbrella issue has been closed GH-39532 with last issues included in 16.0.0 Arrow release (GH-41098, GH-39848 and GH-40376).
- We no longer use internals to create Block objects and started using new pandas API with pandas version 3 GH-35081
- Pandas compatibility code has been simplified as old pandas and Python versions are not supported anymore GH-40720
- Deprecated
pyarrow.filesystem
legacy implementations have been removed GH-20127
New features:
- Converting Arrow
Table
andRecordBatch
to aTensor
(not the same as tensor extension array) is being developed in Arrow C++ with bindings in Python. Umbrella issue: (GH-40058). In current release the option to convert aRecordBatch
toTensor
withpyarrow.RecordBatch.to_tensor(...)
is added returning a row or column major tensor with an option of writing missing values asNaN
in the result. ListView
andLargeListView
array formats are now supported by PyArrow (GH-39812, GH-39855, GH-40205, GH-41039, GH-40266)Binary
andStringView
are now supported in PyArrow (GH-39651, GH-39852, GH-40092)- Final support for Run-End Encoded arrays in PyArrow has been included (conversion to numpy and pandas GH-40659, construction in
pa.array(...)
GH-40273) AsofJoinNode
C++ functionality is now exposed in Python as ajoin_asof
GH-34235- Minimal python bindings are added for AzureFilesystem GH-39968
FixedSizeTensorScalar
class is added GH-37484
Other improvements:
- Add ChunkedArray import/export to/from C GH-39984
pyarrow.Field
andpyarrow.ChunkedArray
can now be constructed from objects supporting the PyCapsule Arrow C Data Interface GH-38010- Requested_schema is supported in
__arrow_c_stream__
implementations GH-40066 - Add low-level bindings for exporting/importing the C Device Interface GH-39979
- Function to download and extract timezone database on a Windows machine is added GH-37328
- Missing methods are added to
pyarrow.RecordBatch
GH-30915 - Dictionary is now also accepted in
pyarrow.record_batch
factory function (as inpyarrow.table
) GH-40291 - Usage of scalar legacy cast has been removed GH-40023
- Missing byte_width attribute are added to all DataType classes GH-39277
FileInfo
instances can now be used to construct Dataset objects GH-40142- Support hashing for
FileMetaData
andParquetSchema
GH-39780 force_virtual_addressing
is exposed in PyArrow GH-39779
Relevant bug fixes:
- Calling
pyarrow.dataset.ParquetFileFormat.make_write_options
as a class method now returns a warning GH-39440 ScalarMemoTable
is now initiated only when deduplication is enabled which fixes large memory consumption in the other case GH-40316- Slicing an array backwards beyond the start doesn’t include first item (GH-38768 and GH-40642)
- Memory leaks when creating Arrow array from Python list of dicts is fixed GH-37989
FixedSizeListType
has not been considered as a nested type and is now added to_NESTED_TYPES
GH-40171max_chunksize
is now validated inTable.to_batches
GH-39788-
Raising
ValueError
on_ensure_partitioning
in Dataset is fixed GH-39579 - Python stacktrace is now attached to errors in
ConvertPyError
GH-37164
R notes
- Arrow IPC streams (i.e.,
write_ipc_stream
) can now be written to socket connections (GH-38897) - The
print()
output forDataset
andTable
objects has been improved so it now shows dimensions and truncates its output in the case of wide schemas (GH-38917) - Various improvements and fixes to documentation, package build, and CI systems
For more on what’s in the 16.0.0 R package, see the R changelog.
Ruby and C GLib notes
Ruby
- Added support for customizing timestamp parsers. GH-40590
C GLib
- Added support for time zone in
GArrowTimestampDataType
. GH-39702 - Added missing compute function options.
GH-40402
GArrowSplitPatternOptions
GArrowStrftimeOptions
GArrowStrptimeOptions
GArrowStructFieldOptions
- Changed documentation generator to GI-DocGen from GTK-Doc. GH-39935
- Added
GArrowTimestampParser
. GH-40438 - Added support for customizing timestamp parsers. GH-40590
Rust notes
The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.