Apache Arrow 6.0.0 Release


Published 04 Nov 2021
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 6.0.0 release. This covers over 3 months of development work and includes 572 resolved issues from 77 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Community

Since the 5.0.0 release, Nic Crane, QP Hou, Jiayu Liu, and Matt Topol have been invited to be committers, and Neville Dipale has joined the Project Management Committee (PMC). Thanks for your contributions and participation in the project!

Columnar Format Notes

A new calendar interval type consisting of Month, Day and Nanoseconds has been added to the specification. Reference implementations existing in Java, C++ and Python.

Arrow Flight RPC notes

GLib and Ruby have added bindings for Arrow Flight.

While not part of the release, work is ongoing on Arrow Flight SQL, which defines a protocol for clients to communicate with SQL databases using Arrow Flight. For those interested in the project, please reach out on the mailing list.

C++ notes

The month-day-nano interval type has been added (ARROW-13628).

Various APIs, including extension types and scalars, are no longer experimental (ARROW-5244).

Support for Visual Studio 2015 was dropped (ARROW-14070).

Compute Layer

A basic in-memory query engine has been implemented and is accessible from the R bindings. Operations including filter, project, sort, equality joins, and various aggregations are supported.

The following compute functions have been added:

  • aggregate functions: approximate_median, count_distinct, max, min, product
  • hash aggregate functions: hash_all, hash_any, hash_approximate_median, hash_count_distinct, hash_distinct, hash_max, hash_mean, hash_min, hash_product, hash_stdev, hash_variance
  • scalar arithmetic functions: logb, round, round_to_multiple
  • scalar string functions: ascii_capitalize, ascii_swapcase, ascii_title, utf8_capitalize, utf8_swapcase, utf8_title
  • scalar temporal functions: assume_timezone, day_time_interval_between, days_between, hours_between, microseconds_between, milliseconds_between, minutes_between, month_day_nano_interval_between, month_interval_between, nanoseconds_between, quarters_between, seconds_between, strftime, us_week, week, weeks_between, years_between
  • other scalar functions: choose, list_element
  • vector functions: drop_null, select_k_unstable

In general, type support has been improved for most of the compute functions, but work here is ongoing, particularly around decimal support.

Crashes have been fixed in particular cases for take, filter, unique, and value_counts (ARROW-13474, ARROW-13509, ARROW-14129).

Hash aggregations (i.e. group by) supports scalar and array values (ARROW-13737, ARROW-14027).

Temporal functions are now timezone-aware (e.g. when extracting the hour of a timestamp) (ARROW-12980).

count can optionally count all values, not just null or non-null values (ARROW-13574).

fill_null has been replaced by the more general coalesce (ARROW-7179).

is_null can optionally consider NaN as null (ARROW-12959).

Sorting has been optimized (ARROW-10898, ARROW-14165). Also, null values can now be sorted at either the beginning or the end (ARROW-12063).

CSV

The CSV reader can read time32 and time64 types, and will infer time32 values for columns in the format “hh:mm” and “hh:mm:ss” (ARROW-11243).

The decimal point can be customized when reading (ARROW-13421).

The streaming reader will not unintentionally infer null-typed columns when using the various skip options (ARROW-13441).

If a row has an incorrect number of columns, now the row can be skipped instead of raising an error (ARROW-12673).

The option quoted_strings_can_be_null applies to all column types now, not just strings (ARROW-13580). When quoting is disabled entirely, the reader now takes advantage of this to improve performance (ARROW-14150).

A CSVWriter object is now exposed, allowing for incremental writing (ARROW-11828). Dates can now be written (ARROW-12540).

Dataset Layer

The dataset writer was refactored, and now supports more options, including a limit on the number of files open at once, compatibility with the async scanner, a limit on the number of rows written per file, and control over what to do when files already exist in the target directory (ARROW-13650). Additionally, the query engine can feed into the dataset writer as a sink (ARROW-13542).

The asynchronous scanner now properly respects backpressure (ARROW-13611, ARROW-14192), as does the writer (ARROW-14191).

ORC datasets are supported (ARROW-13572) with support for column projection pushdown (ARROW-13797).

The Parquet/IPC format readers now respect the batch_size scanner option (ARROW-14024). Also, the Parquet reader now properly implements readahead for better performance (ARROW-14026).

IO and Filesystem Layer

The retry strategy of S3FileSystem can be customized (ARROW-13508). When writing to an existing bucket as a user with limited permissions, Arrow will no longer emit a spurious “Access Denied” error (ARROW-13685).

On MacOS with NFS mounts, a “[errno 25] Inappropriate ioctl for device” error was fixed (ARROW-13983).

The basics of a Google Cloud Storage filesystem have been added; work is in progress for full support (ARROW-8147, ARROW-14222, ARROW-14223, ARROW-14232, ARROW-14236, ARROW-14345, ARROW-14157).

JSON

A crash was fixed when duplicate keys were present (ARROW-14109).

Parquet

Written min/max and null_count statistics for dictionary types were corrected (ARROW-11634, ARROW-12513). null_count statistics for columns that contain repeated data where corrected.

file_offset for row groups was not being populated according to the specification, this issue has been corrected.

Column selection now works for repeated columns and structs of more then one level.

An error with large files when built with Thrift 0.14 was fixed (ARROW-13655).

The ParquetVersion enum was updated with more values to support finer-grained Parquet format version selection (ARROW-13794).

Writer performance was improved by avoiding repeated dynamic casts (ARROW-13965).

C# notes

This release includes improved support for dictionary arrays, as well as integration testing with the other Arrow implementations for the primitive and decimal types

Go notes

Bug Fixes

  • Fixed handling of the zero value for Decimal128 FromBigInt #10796
  • Fixed various “too many releases” errors in the tests allowing all tests to be run using the assert build tag in CI from now on, including a bug when writing slices of String, Binary or FixedWidthType arrays via ipc.Writer #11270, #11276
  • Fixed builds on ARM and s390x architectures #11299, [#11305]((https://github.com/apache/arrow/pull/11305)

Enhancements

  • Added Concatenate function for concatenating arrays
  • Implemented Scalar values and MakeArrayFromScalar function [#11252]((https://github.com/apache/arrow/pull/11252)
  • Added cgo optional allocator for allocating memory using the C++ memory pool for use with the C Data import and export APIs
  • Added support for Month, Day, Nano interval type #11310
  • Completed Encoding package for Parquet, added Metadata package.

Version Compatibility Update

  • Release process updated to add proper git tags for Go release for Module aware version tracking (#11312) meaning that this release will be correctly tagged as v6.0.0 in go.mod and pkg.go.dev and future releases will be correctly versioned with the Go module system.

Java notes

  • Some dependent libraries were upgraded. In particular, grpc upgraded to 1.41.0, netty upgraded to 2.0.43, and orc upgraded to 1.7.0. (ARROW-14198) (ARROW-14049)
  • Fixed the problem of appending BitVectors in batch (ARROW-13981)
  • Code coverage support enabled for Java (ARROW-13859)
  • Fixed the incorrect string representations for unsigned integer vectors (ARROW-13792)
  • Reduced the memory consumption of JDBC adapters by reusing record batches (ARROW-13733)
  • Allowed NullVectors to have distinct field names (ARROW-13645)
  • Some APIs that have been deprecated for long have been removed (ARROW-13544)
  • Allowed passing empty columns for projection in Dataset (ARROW-13257)
  • A Java implementation of Arrow C data interface was provided (ARROW-12965)

JavaScript notes

This release fixes builds with the latest TypeScript versions and ESM tree shaking.

Deprecation notice: in Arrow 7, we will remove the compute code from Arrow JS.

Python notes

  • Many new pyarrow.compute functions are available (see the C++ notes above for more details), and introspection of the functions was improved so that they look more like standard Python functions.
  • All Python functions and classes should now have documented parameters in the API reference.
  • SIMD optimization is now enabled in M1 wheels
  • Wheels are now built for more Python versions on M1 systems.
  • PyArrow is now compatible with Python 3.10
  • Creating Arrow arrays now supports more than just numpy arrays as masks
  • Printing Tables now previews the values in the columns
  • copy_files is now available in Python
  • Datasets now support ORC files
  • Sets are now supported when building arrays or converting from pandas.
  • 39 bugs have been fixed.

R notes

This release adds grouped aggregation and joins in the dplyr interface, on top of the new Arrow C++ query engine. There is also support for using duckdb to query Arrow datasets. For more details, see the complete R changelog.

Ruby and C GLib notes

Ruby

The updates of Red Arrow etc. consists of the following improvements:

  • Red Arrow
    • Added Arrow::RecordBatchReader
    • Support to convert an array of symbols to a dictionary array
    • Add support for building and converting map
    • Add support for group aggregation
    • Use compute kernels for the implementation of slicers
    • Support a Range and an Array of selectors in Arrow::Table#[] and Arrow::RecordBatch#[]
    • Separate min and max aggregators
    • Support a hash slicer; a scalar value is for equality matching, and a range is for between matching
    • Add Arrow::TableConcatenateOptions and conversion from a Hash for convenience
    • Add Arrow::Expression and conversion from Array and Hash for convenience
  • Red Arrow Dataset
    • Add support for loading from directories
    • Add support for writing
    • Add filter expression support in a loader
  • Red Arrow Flight
    • Add ArrowFlight::Client#do_get support

C GLib

The updates of Arrow GLib etc. consists of the following improvements:

  • Arrow GLib
    • Add the following new functions:
      • garrow_record_batch_reader_new
      • garrow_record_batch_reader_read_all
      • garrow_union_scalar_get_type_code
    • Add type_code supports in union scalar types
    • Add C ABI support
    • Add GArrowCountOptions and let count functions support it
    • Add support for group aggregation
    • Add GArrowSetLookupOptions for options of is_in and index_in kernels
    • Add GArrowVarianceOptions to specify the calculation options for variance and standard deviation kernels
    • Add GArrowFunctionDoc
    • Add GArrowTableConcatenateOptions and let garrow_table_concatenate support it
    • Add expressions support
  • Arrow Dataset GLib
    • Add support for writing data
    • Support recursive scanning in a directory
    • Add gadataset_scanner_builder_set_filter
    • Make use_async of GADatasetScannerBuilder a property, and remove gadataset_scanner_builder_use_async
  • Arrow Flight GLib
    • Add DoGet support:
      • New functions: gaflight_client_do_get and gaflight_server_do_get
      • New classes: GAFlightStreamReader, GAFlightStreamChunk, GAFlightRecordBatchReader, GAFlightDataStream, GAFlightRecordBatchStream, and GAFlightServerCallContext
    • Change the argument order of gaflight_client_list_flights function
  • Parquet GLib
    • Add gparquet_arrow_file_reader_get_n_rows function

Rust notes

Rust continues to release minor versions every 2 weeks in addition to a major version with the rest of the Arrow language implementations. Thus most enhancements have been incrementally released over the last 3 months as part of the 5.x.

The DataFusion and Ballista sub projects have begun releasing at their own cadence which is expected to continue in the next few weeks.

Major changes in the 6.0.0 release include support for the MapArray array type, improved lower level ArrayData APIs to better communicate safety, and a faster (but unstable) sorting kernel.

For additional details on the 6.0.0 Rust implementation, please see the Arrow Rust CHANGELOG