Apache Arrow 14.0.0 Release
01 Nov 2023
By The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 14.0.0 release. This covers over 3 months of development work and includes 483 resolved issues from 116 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Since the 13.0.0 release, Metehan Yildirim and Oleks V. have been invited to be committers.
Thanks for your contributions and participation in the project!
Columnar Format Notes
Motivated by recent innovations in DuckDB and Meta’s Velox engine, new “view” data types were added to the Arrow columnar format spec:
- 16-byte StringView and BinaryView data types which enable better buffer reuse, faster “false” string comparisons (due to maintaining a prefix) and short string inlining (GH-35627).
- ListView and LargeListView types for more performant “out-of-order” building and processing of lists and better buffer reuse (GH-37876).
VariableShapeTensorType was added to the Arrow specification as a canonical extension type (GH-24868).
C Data Interface notes
Integration testing has been added for the C Data Interface accross Arrow implementations, ensuring mutual compatibility. (GH-37537). The C++, C# and Go implementations are covered, with Arrow Java soon to come.
Arrow Flight RPC notes
A new RPC method was added to allow polling for completion in long-running queries as an alternative to the blocking GetFlightInfo call (GH-36155). Also,
app_metadata was added to
In C++ and Python, an experimental asynchronous GetFlightInfo call was added to the client-side API (GH-36512).
ServerCallContext now exposes conveniences to send headers/trailers without having to use middleware (GH-36952). The implementation was fixed to not reject unknown field tags to enable interoperability with future versions of Flight that could add new fields (GH-36975). The CMake configuration was fixed to correctly require linking to Arrow Flight RPC when using Arrow Flight SQL (GH-37406).
In Go, the underlying generated Protobuf code is now exposed for easier low-level integrations with Flight (GH-36893).
In Java, the stateful “login” authentication APIs using the Handshake RPC are deprecated; it will not be removed, but it should not be used unless you specifically want the old behavior (GH-37722). Utilities were added to help implement basic Flight SQL services for unit testing (GH-37795).
Initial compatibility with Emscripten without threading support has been added (GH-35176).
New compute functions:
cumulative_meanfunction on numeric data (GH-36931);
Improved compute functions:
- rounding functions now work natively on integer inputs instead of casting them to floats (GH-35273);
dividefunction now supports duration inputs (GH-36789);
filternow support sparse unions in addition to dense unions (GH-36905);
case_whennow support duration inputs (GH-37028);
- casting between fixed-size lists and variable-size lists is now supported (GH-20086);
- casting from strings to dates is now supported (GH-37411);
meanon integer inputs now uses a floating-point representation for its intermediate sum, avoiding integer overflow on large inputs (GH-34909);
Support for writing encrypted Parquet datasets has been added (GH-29238).
Gandiva now supports linking dynamically to LLVM on non-Windows platforms (GH-37410).
Previously, Gandiva would always link LLVM statically into
RLE is used by default when encoding boolean values if v2 data pages are enabled (GH-36882).
Page indexes can now be encrypted as per the specification (GH-34950).
A bug in the DELTA_BINARY_PACKED encoder leading to suboptimal column sizes was fixed (GH-37939).
It is now possible to serialize and deserialize individual expressions using Substrait, not only full query plans (GH-33985).
CodecOptions class allows customizing compression parameters per-codec (GH-35287).
The environment variable
AWS_ENDPOINT_URL is now respected when resolving S3 URIs (GH-36770).
Recursively listing S3 filesystem trees should now issue less requests, leading to improved performance (GH-34213).
ChunkedArray to itself now behaves correctly with NaN values (GH-37515).
The use of BMI2 instructions on x86 was incorrectly guarded. Those instructions could be executed on platforms without BMI2 support, leading to crashes (GH-37017).
The following features have been added to the C# implementation apart from other minor ones and some fixes.
- Support fixed-size lists (GH-33032)
- Support DateOnly and TimeOnly on .NET 6.0+ (GH-34620)
- Implement MapType (GH-35243)
- Flight SQL implementation for C# (GH-36078)
- Implement support for dense and sparse unions (GH-36795)
- The minimum version of Go officially supported is now
- Documentation fixed to correctly state that the default unit for
TimestampTypeis seconds (GH-35770)
- Fixed leak in the
Concatenatefunction if there is a panic that is recovered (GH-36850)
- Ensure Binary dictionary indices are released on panic (GH-36858)
- Fix overflow value causing invalid dates for
MarshalJSONon some timestamps (GH-36935)
- Fix leaking dictionary allocations in IPC reader (GH-36981)
- Fixed an issue where DeltaLengthByteArray encoding fails on certain null value scenarios (GH-36318)
- Correctly propagate internal
writer.Close()when writing a Parquet file (GH-36645)
- Fixed a panic when writing some specific DeltaBitPacked datasets (GH-37102)
- Proper support for Decimal256 data type in Parquet lib (GH-37419)
- Corrected inconsistent behavior in
pqarrowcolumn chunk reader (GH-37845)
- Rewrote and Fixed ARM64 assembly for bitmap bit extractions and integer packing (GH-37712)
- C Data Interface integration testing has been added and implemented (GH-37789)
- pkg.go.dev link is fixed in the Readme (GH-37779)
- Add proper
array.Nulltype support handling for arrow/csv writing (GH-36623)
GetOrInsertfunction for memo table handling of dictionary builders (GH-36671)
- Made it possible to add custom functions in the
- Improved performance of dictionary unifier (GH-37306)
- Added direct access to dictionary builder indices (GH-37416)
- Added ability to read back values from Boolean builders (GH-37465)
ValueLenfunction to string array (GH-37584)
- Avoid unnecessary copying in the default go allocator (GH-37687)
SetNull(i int)to array builders (GH-37694)
- Parquet metadata is allowed to write metadata after writing rowgroups using
ListOfhelper functions have been improved to provide clearer error messages and have better documentation (GH-36696)
- Struct tag of
parquet:"-"will be allowed to skip fields when converting a struct to a parquet schema (GH-36793)
Java 21 is enabled and validated in CI (GH-37914).
The Gandiva module implemented a breaking change by moving
Types.proto into a subfolder (GH-37893).
DefaultVectorComparators added support for
LargeVarBinaryVector (GH-25659) and for
A bug was fixed in
VectorAppender to prevent resizing the data buffer twice when appending variable-length vectors (GH-37829).
The JDBC driver will now ignore username and password authentication if a token is provided (GH-37073).
A bug was fixed in the Java C-Data interface when importing a vector with an empty array (GH-37056).
A bug was fixed in the S3 file system implementation when closing the connection (GH-36069).
Arrow datasets now support Substrait
ExtendedExpressions as inputs to filter and project operations (GH-34252).
- GH-21815: [JS] Add support for Duration type #37341
- GH-31621: [JS] Fix Union null bitmaps #37122
- Support for Python 3.12 was added (GH-37880)
- Support for Cython 3 was added (GH-37742)
- PyArrow is now compatible with numpy 2.0 (GH-37574)
pyarrow.compute.CumulativeSumOptionshas been deprecated, use
- Allowing type promotion is now possible in
- Support for vector function UDF was added (GH-36672)
- The default of
pre_bufferis now set to
Truefor reading Parquet when using
pyarrow.datasetdirectly. This can give significant speed-up on filesystems like S3 and is now aligned to
- Path to timezone database can now be set through python API (GH-35600, [GH-38145] (https://github.com/apache/arrow/issues/38145))
pyarrow.MapScalar.as_pycan now be called with custom field name (GH-36809)
FixedShapeTensorTypestring representation now prints the type parameters (GH-35623)
Relevant bug fixes:
- String to date cast kernel was added to fix python scalar cast regression (GH-37411)
- Fix conversion from Python to Arrow when chunking large nested structs (GH-32439)
- Fix segfault when passing table as argument to
use_threadskeyword was added to the
pyarrow.Tablewhich gets passed through to the
use_threads=Falseallows to get stable ordering of the output (GH-36709)
- Fix printable representation for
pyarrow.TimestampScalarwhen values are outside datetime range (GH-36323)
- Empty dataframes with zero chunks can now be consumed by the Dataframe Interchange Protocol implementation (GH-37050)
- Fix dtype information for categorical columns in the Dataframe Interchange Protocol implementation (GH-38034)
- Boolean columns with bitsize 1 are now supported in
from_dataframeof the Dataframe Interchange Protocol (GH-37145)
Further, the Python bindings benefit from improvements in the C++ library (e.g. new compute functions); see the C++ notes above for additional details.
The Arrow documentation is now built with an updated Pydata Sphinx Theme which includes light/dark theme, new colors from Accessible pygments themes, version switcher dropdown, search button, etc. (GH-36590, GH-32451)
This release of the R package features a substantial refactor of the package configuration, build, and installation. This change should be transparent to most users; however, package contributors can take advantage of a substantially simplified development setup: in most cases, package contributors should be able to use a pre-built nightly version of Arrow C++ in place of a local Arrow development setup. Special thanks to Jacob Wujciak-Jens for taking on this incredible refactor!
In addition to a number of bugfixes and improvements, this release includes several new features related to CSV input/output:
- Added support for
,or other characters as a decimal point.
write_csv_dataset()to better document CSV-specific dataset writing options.
- Ensured that the
schemaargument can be specified when reading a CSV dataset with partitions.
For more on what’s in the 14.0.0 R package, see the R changelog.
Ruby and C GLib notes
- Add support for prepared INSERT queries (GH-37143)
- When a prepared statement is automatically closed upon exiting a block, use the same options as when the statement was prepared (GH-37257)
- Support more properties of
- Add support for prepared INSERT queries (GH-37143)
The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.