Apache Arrow 6.0.0 Release
04 Nov 2021
By The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 6.0.0 release. This covers over 3 months of development work and includes 572 resolved issues from 77 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Since the 5.0.0 release, Nic Crane, QP Hou, Jiayu Liu, and Matt Topol have been invited to be committers, and Neville Dipale has joined the Project Management Committee (PMC). Thanks for your contributions and participation in the project!
Columnar Format Notes
A new calendar interval type consisting of Month, Day and Nanoseconds has been added to the specification. Reference implementations existing in Java, C++ and Python.
Arrow Flight RPC notes
GLib and Ruby have added bindings for Arrow Flight.
While not part of the release, work is ongoing on Arrow Flight SQL, which defines a protocol for clients to communicate with SQL databases using Arrow Flight. For those interested in the project, please reach out on the mailing list.
The month-day-nano interval type has been added (ARROW-13628).
Various APIs, including extension types and scalars, are no longer experimental (ARROW-5244).
Support for Visual Studio 2015 was dropped (ARROW-14070).
A basic in-memory query engine has been implemented and is accessible from the R bindings. Operations including filter, project, sort, equality joins, and various aggregations are supported.
The following compute functions have been added:
- aggregate functions:
- hash aggregate functions:
- scalar arithmetic functions:
- scalar string functions:
- scalar temporal functions:
- other scalar functions:
- vector functions:
In general, type support has been improved for most of the compute functions, but work here is ongoing, particularly around decimal support.
Crashes have been fixed in particular cases for
value_counts (ARROW-13474, ARROW-13509, ARROW-14129).
Hash aggregations (i.e. group by) supports scalar and array values (ARROW-13737, ARROW-14027).
Temporal functions are now timezone-aware (e.g. when extracting the hour of a timestamp) (ARROW-12980).
count can optionally count all values, not just null or non-null values (ARROW-13574).
fill_null has been replaced by the more general
is_null can optionally consider NaN as null (ARROW-12959).
Sorting has been optimized (ARROW-10898, ARROW-14165). Also, null values can now be sorted at either the beginning or the end (ARROW-12063).
The CSV reader can read time32 and time64 types, and will infer time32 values for columns in the format “hh:mm” and “hh:mm:ss” (ARROW-11243).
The decimal point can be customized when reading (ARROW-13421).
The streaming reader will not unintentionally infer null-typed columns when using the various skip options (ARROW-13441).
If a row has an incorrect number of columns, now the row can be skipped instead of raising an error (ARROW-12673).
quoted_strings_can_be_null applies to all column types now, not just strings (ARROW-13580). When quoting is disabled entirely, the reader now takes advantage of this to improve performance (ARROW-14150).
A CSVWriter object is now exposed, allowing for incremental writing (ARROW-11828). Dates can now be written (ARROW-12540).
The dataset writer was refactored, and now supports more options, including a limit on the number of files open at once, compatibility with the async scanner, a limit on the number of rows written per file, and control over what to do when files already exist in the target directory (ARROW-13650). Additionally, the query engine can feed into the dataset writer as a sink (ARROW-13542).
The asynchronous scanner now properly respects backpressure (ARROW-13611, ARROW-14192), as does the writer (ARROW-14191).
ORC datasets are supported (ARROW-13572) with support for column projection pushdown (ARROW-13797).
The Parquet/IPC format readers now respect the batch_size scanner option (ARROW-14024). Also, the Parquet reader now properly implements readahead for better performance (ARROW-14026).
IO and Filesystem Layer
The retry strategy of S3FileSystem can be customized (ARROW-13508). When writing to an existing bucket as a user with limited permissions, Arrow will no longer emit a spurious “Access Denied” error (ARROW-13685).
On MacOS with NFS mounts, a “[errno 25] Inappropriate ioctl for device” error was fixed (ARROW-13983).
The basics of a Google Cloud Storage filesystem have been added; work is in progress for full support (ARROW-8147, ARROW-14222, ARROW-14223, ARROW-14232, ARROW-14236, ARROW-14345, ARROW-14157).
A crash was fixed when duplicate keys were present (ARROW-14109).
Written min/max and null_count statistics for dictionary types were corrected (ARROW-11634, ARROW-12513). null_count statistics for columns that contain repeated data where corrected.
file_offset for row groups was not being populated according to the specification, this issue has been corrected.
Column selection now works for repeated columns and structs of more then one level.
An error with large files when built with Thrift 0.14 was fixed (ARROW-13655).
The ParquetVersion enum was updated with more values to support finer-grained Parquet format version selection (ARROW-13794).
Writer performance was improved by avoiding repeated dynamic casts (ARROW-13965).
This release includes improved support for dictionary arrays, as well as integration testing with the other Arrow implementations for the primitive and decimal types
- Fixed handling of the zero value for Decimal128
- Fixed various “too many releases” errors in the tests allowing all tests to be run using the
assertbuild tag in CI from now on, including a bug when writing slices of String, Binary or FixedWidthType arrays via ipc.Writer #11270, #11276
- Fixed builds on ARM and s390x architectures #11299, #11305
- Added Concatenate function for concatenating arrays
- Implemented Scalar values and
- Added cgo optional allocator for allocating memory using the C++ memory pool for use with the C Data import and export APIs
- Added support for Month, Day, Nano interval type #11310
- Completed Encoding package for Parquet, added Metadata package.
Version Compatibility Update
- Release process updated to add proper git tags for Go release for Module aware version tracking (#11312) meaning that this release will be correctly tagged as v6.0.0 in go.mod and pkg.go.dev and future releases will be correctly versioned with the Go module system.
- Some dependent libraries were upgraded. In particular, grpc upgraded to 1.41.0, netty upgraded to 2.0.43, and orc upgraded to 1.7.0. (ARROW-14198) (ARROW-14049)
- Fixed the problem of appending BitVectors in batch (ARROW-13981)
- Code coverage support enabled for Java (ARROW-13859)
- Fixed the incorrect string representations for unsigned integer vectors (ARROW-13792)
- Reduced the memory consumption of JDBC adapters by reusing record batches (ARROW-13733)
- Allowed NullVectors to have distinct field names (ARROW-13645)
- Some APIs that have been deprecated for long have been removed (ARROW-13544)
- Allowed passing empty columns for projection in Dataset (ARROW-13257)
- A Java implementation of Arrow C data interface was provided (ARROW-12965)
This release fixes builds with the latest TypeScript versions and ESM tree shaking.
Deprecation notice: in Arrow 7, we will remove the compute code from Arrow JS.
- Many new
pyarrow.computefunctions are available (see the C++ notes above for more details), and introspection of the functions was improved so that they look more like standard Python functions.
- All Python functions and classes should now have documented parameters in the API reference.
- SIMD optimization is now enabled in M1 wheels
- Wheels are now built for more Python versions on M1 systems.
- PyArrow is now compatible with Python 3.10
- Creating Arrow arrays now supports more than just numpy arrays as masks
- Printing Tables now previews the values in the columns
copy_filesis now available in Python
- Datasets now support ORC files
- Sets are now supported when building arrays or converting from pandas.
- 39 bugs have been fixed.
This release adds grouped aggregation and joins in the
dplyr interface, on top of the new Arrow C++ query engine. There is also support for using
duckdb to query Arrow datasets. For more details, see the complete R changelog.
Ruby and C GLib notes
The updates of Red Arrow etc. consists of the following improvements:
- Red Arrow
- Support to convert an array of symbols to a dictionary array
- Add support for building and converting map
- Add support for group aggregation
- Use compute kernels for the implementation of slicers
- Support a Range and an Array of selectors in
- Separate min and max aggregators
- Support a hash slicer; a scalar value is for equality matching, and a range is for between matching
Arrow::TableConcatenateOptionsand conversion from a
Arrow::Expressionand conversion from
- Red Arrow Dataset
- Add support for loading from directories
- Add support for writing
- Add filter expression support in a loader
- Red Arrow Flight
The updates of Arrow GLib etc. consists of the following improvements:
- Arrow GLib
- Add the following new functions:
type_codesupports in union scalar types
- Add C ABI support
GArrowCountOptionsand let count functions support it
- Add support for group aggregation
GArrowSetLookupOptionsfor options of
GArrowVarianceOptionsto specify the calculation options for variance and standard deviation kernels
- Add expressions support
- Add the following new functions:
- Arrow Dataset GLib
- Add support for writing data
- Support recursive scanning in a directory
GADatasetScannerBuildera property, and remove
- Arrow Flight GLib
- Add DoGet support:
- New functions:
- New classes:
- New functions:
- Change the argument order of
- Add DoGet support:
- Parquet GLib
Rust continues to release minor versions every 2 weeks in addition to a major version with the rest of the Arrow language implementations. Thus most enhancements have been incrementally released over the last 3 months as part of the 5.x.
The DataFusion and Ballista sub projects have begun releasing at their own cadence which is expected to continue in the next few weeks.
Major changes in the 6.0.0 release include support for the
array type, improved lower level
ArrayData APIs to better
communicate safety, and a faster (but unstable) sorting kernel.
For additional details on the 6.0.0 Rust implementation, please see the Arrow Rust CHANGELOG