Apache Arrow 14.0.0 Release
Published
01 Nov 2023
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 14.0.0 release. This covers over 3 months of development work and includes 483 resolved issues from 116 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Community
Since the 13.0.0 release, Metehan Yildirim and Oleks V. have been invited to be committers.
Thanks for your contributions and participation in the project!
Columnar Format Notes
Motivated by recent innovations in DuckDB and Meta’s Velox engine, new “view” data types were added to the Arrow columnar format spec:
- 16-byte StringView and BinaryView data types which enable better buffer reuse, faster “false” string comparisons (due to maintaining a prefix) and short string inlining (GH-35627).
- ListView and LargeListView types for more performant “out-of-order” building and processing of lists and better buffer reuse (GH-37876).
A VariableShapeTensorType
was added to the Arrow specification as a canonical extension type (GH-24868).
C Data Interface notes
Integration testing has been added for the C Data Interface accross Arrow implementations, ensuring mutual compatibility. (GH-37537). The C++, C# and Go implementations are covered, with Arrow Java soon to come.
Arrow Flight RPC notes
A new RPC method was added to allow polling for completion in long-running queries as an alternative to the blocking GetFlightInfo call (GH-36155). Also, app_metadata
was added to FlightInfo
and FlightEndpoint
(GH-37635).
In C++ and Python, an experimental asynchronous GetFlightInfo call was added to the client-side API (GH-36512). ServerCallContext
now exposes conveniences to send headers/trailers without having to use middleware (GH-36952). The implementation was fixed to not reject unknown field tags to enable interoperability with future versions of Flight that could add new fields (GH-36975). The CMake configuration was fixed to correctly require linking to Arrow Flight RPC when using Arrow Flight SQL (GH-37406).
In Go, the underlying generated Protobuf code is now exposed for easier low-level integrations with Flight (GH-36893).
In Java, the stateful “login” authentication APIs using the Handshake RPC are deprecated; it will not be removed, but it should not be used unless you specifically want the old behavior (GH-37722). Utilities were added to help implement basic Flight SQL services for unit testing (GH-37795).
C++ notes
Experimental APIs for exporting and importing non-CPU arrays using the C Device Data Interface have been added (GH-36488), together with an experimental API for device synchronization (GH-36103).
Initial compatibility with Emscripten without threading support has been added (GH-35176).
Compute layer
New compute functions:
- a
cumulative_mean
function on numeric data (GH-36931);
Improved compute functions:
- rounding functions now work natively on integer inputs instead of casting them to floats (GH-35273);
- the
divide
function now supports duration inputs (GH-36789); take
andfilter
now support sparse unions in addition to dense unions (GH-36905);if_else
,coalesce
,choose
andcase_when
now support duration inputs (GH-37028);- casting between fixed-size lists and variable-size lists is now supported (GH-20086);
- casting from strings to dates is now supported (GH-37411);
mean
on integer inputs now uses a floating-point representation for its intermediate sum, avoiding integer overflow on large inputs (GH-34909);
Datasets
Support for writing encrypted Parquet datasets has been added (GH-29238).
Gandiva
Gandiva now supports linking dynamically to LLVM on non-Windows platforms (GH-37410).
Previously, Gandiva would always link LLVM statically into libgandiva
.
Parquet
RLE is used by default when encoding boolean values if v2 data pages are enabled (GH-36882).
Page indexes can now be encrypted as per the specification (GH-34950).
A bug in the DELTA_BINARY_PACKED encoder leading to suboptimal column sizes was fixed (GH-37939).
Substrait
It is now possible to serialize and deserialize individual expressions using Substrait, not only full query plans (GH-33985).
Miscellaneous
A new CodecOptions
class allows customizing compression parameters per-codec (GH-35287).
The environment variable AWS_ENDPOINT_URL
is now respected when resolving S3 URIs (GH-36770).
Recursively listing S3 filesystem trees should now issue less requests, leading to improved performance (GH-34213).
Comparing a ChunkedArray
to itself now behaves correctly with NaN values (GH-37515).
The use of BMI2 instructions on x86 was incorrectly guarded. Those instructions could be executed on platforms without BMI2 support, leading to crashes (GH-37017).
C# notes
The following features have been added to the C# implementation apart from other minor ones and some fixes.
- Support fixed-size lists (GH-33032)
- Support DateOnly and TimeOnly on .NET 6.0+ (GH-34620)
- Implement MapType (GH-35243)
- Flight SQL implementation for C# (GH-36078)
- Implement support for dense and sparse unions (GH-36795)
Go Notes
- The minimum version of Go officially supported is now
go1.19
instead ofgo1.17
(GH-37636)
Bug Fixes
Arrow
- Documentation fixed to correctly state that the default unit for
TimestampType
is seconds (GH-35770) - Fixed leak in the
Concatenate
function if there is a panic that is recovered (GH-36850) - Ensure Binary dictionary indices are released on panic (GH-36858)
- Fix overflow value causing invalid dates for
MarshalJSON
on some timestamps (GH-36935) - Fix leaking dictionary allocations in IPC reader (GH-36981)
Parquet
- Fixed an issue where DeltaLengthByteArray encoding fails on certain null value scenarios (GH-36318)
- Correctly propagate internal
writer.sink.Close()
errors fromwriter.Close()
when writing a Parquet file (GH-36645) - Fixed a panic when writing some specific DeltaBitPacked datasets (GH-37102)
- Proper support for Decimal256 data type in Parquet lib (GH-37419)
- Corrected inconsistent behavior in
pqarrow
column chunk reader (GH-37845) - Rewrote and Fixed ARM64 assembly for bitmap bit extractions and integer packing (GH-37712)
Enhancements
- C Data Interface integration testing has been added and implemented (GH-37789)
- pkg.go.dev link is fixed in the Readme (GH-37779)
Arrow
- Added
String()
method toarrow.Table
(GH-35296) - Add proper
array.Null
type support handling for arrow/csv writing (GH-36623) - Optimized
GetOrInsert
function for memo table handling of dictionary builders (GH-36671) - Made it possible to add custom functions in the
compute
package (GH-36936) - Improved performance of dictionary unifier (GH-37306)
- Added direct access to dictionary builder indices (GH-37416)
- Added ability to read back values from Boolean builders (GH-37465)
- Add
ValueLen
function to string array (GH-37584) - Avoid unnecessary copying in the default go allocator (GH-37687)
- Add
SetNull(i int)
to array builders (GH-37694)
Parquet
- Parquet metadata is allowed to write metadata after writing rowgroups using
pqarrow.FileWriter
(GH-35775) MapOf
andListOf
helper functions have been improved to provide clearer error messages and have better documentation (GH-36696)- Struct tag of
parquet:"-"
will be allowed to skip fields when converting a struct to a parquet schema (GH-36793)
Java notes
Java 21 is enabled and validated in CI (GH-37914).
The Gandiva module implemented a breaking change by moving Types.proto
into a subfolder (GH-37893).
DefaultVectorComparators
added support for LargeVarCharVector
, LargeVarBinaryVector
(GH-25659) and for BitVector
, DateDayVector
, DateMilliVector
Decimal256Vector
, DecimalVector
, DurationVector
, IntervalDayVector
, TimeMicroVector
, TimeMilliVector
, TimeNanoVector
, TimeSecVector
, TimeStampVector
(GH-37701).
A bug was fixed in VectorAppender
to prevent resizing the data buffer twice when appending variable-length vectors (GH-37829).
VarCharWriter
added support for writing from Text
and String
(GH-37706). VarBinaryWriter
added support for writing from byte[]
and ByteBuffer
(GH-37705).
The JDBC driver will now ignore username and password authentication if a token is provided (GH-37073).
A bug was fixed in the Java C-Data interface when importing a vector with an empty array (GH-37056).
A bug was fixed in the S3 file system implementation when closing the connection (GH-36069).
Arrow datasets now support Substrait ExtendedExpression
s as inputs to filter and project operations (GH-34252).
JavaScript notes
- GH-21815: [JS] Add support for Duration type #37341
- GH-31621: [JS] Fix Union null bitmaps #37122
Python notes
Compatibility notes:
- Support for Python 3.12 was added (GH-37880)
- Support for Cython 3 was added (GH-37742)
- PyArrow is now compatible with numpy 2.0 (GH-37574)
pyarrow.compute.CumulativeSumOptions
has been deprecated, usepyarrow.compute.CumulativeOptions
instead (GH-36240)
New features:
- Allowing type promotion is now possible in
pyarrow.concat_tables
(GH-36845) - Support for vector function UDF was added (GH-36672)
Other improvements:
- The default of
pre_buffer
is now set toTrue
for reading Parquet when usingpyarrow.dataset
directly. This can give significant speed-up on filesystems like S3 and is now aligned topyarrow.parquet.read_table
interface (GH-36765) - Path to timezone database can now be set through python API (GH-35600, [GH-38145] (https://github.com/apache/arrow/issues/38145))
pyarrow.MapScalar.as_py
can now be called with custom field name (GH-36809)FixedShapeTensorType
string representation now prints the type parameters (GH-35623)
Relevant bug fixes:
- String to date cast kernel was added to fix python scalar cast regression (GH-37411)
- Fix conversion from Python to Arrow when chunking large nested structs (GH-32439)
- Fix segfault when passing table as argument to
pyarrow.Table.filter
(GH-37650) use_threads
keyword was added to thegroup_by
method onpyarrow.Table
which gets passed through to thepyarrow.acero.Declaration.to_table
call. Specifinguse_threads=False
allows to get stable ordering of the output (GH-36709)- Fix printable representation for
pyarrow.TimestampScalar
when values are outside datetime range (GH-36323) - Empty dataframes with zero chunks can now be consumed by the Dataframe Interchange Protocol implementation (GH-37050)
- Fix dtype information for categorical columns in the Dataframe Interchange Protocol implementation (GH-38034)
- Boolean columns with bitsize 1 are now supported in
from_dataframe
of the Dataframe Interchange Protocol (GH-37145)
Further, the Python bindings benefit from improvements in the C++ library (e.g. new compute functions); see the C++ notes above for additional details.
The Arrow documentation is now built with an updated Pydata Sphinx Theme which includes light/dark theme, new colors from Accessible pygments themes, version switcher dropdown, search button, etc. (GH-36590, GH-32451)
R notes
This release of the R package features a substantial refactor of the package configuration, build, and installation. This change should be transparent to most users; however, package contributors can take advantage of a substantially simplified development setup: in most cases, package contributors should be able to use a pre-built nightly version of Arrow C++ in place of a local Arrow development setup. Special thanks to Jacob Wujciak-Jens for taking on this incredible refactor!
In addition to a number of bugfixes and improvements, this release includes several new features related to CSV input/output:
- Added support for
,
or other characters as a decimal point. - Added
write_csv_dataset()
to better document CSV-specific dataset writing options. - Ensured that the
schema
argument can be specified when reading a CSV dataset with partitions.
For more on what’s in the 14.0.0 R package, see the R changelog.
Ruby and C GLib notes
Ruby
- Add support for prepared INSERT queries (GH-37143)
- When a prepared statement is automatically closed upon exiting a block, use the same options as when the statement was prepared (GH-37257)
C GLib
- Support more properties of
ArrowFlight::ClientOptions
(GH-37141) - Add support for prepared INSERT queries (GH-37143)
Rust notes
The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.