Apache Arrow 7.0.0 Release


Published 08 Feb 2022
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 7.0.0 release. This covers over 3 months of development work and includes 617 resolved issues from 105 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Community

Since the 6.0.1 release, Rémi Dattai and Alessandro Molina have been invited to be committers. Daniël Heres and Yibo Cai have joined the Project Management Committee (PMC). Thanks for your contributions and participation in the project!

Arrow Flight RPC notes

The Flight specification has been clarified to note that schemas are expected to be IPC-encapsulated on the wire.

Documentation has been generally improved; see the Arrow Cookbook for recipes on how to use Flight in Python and R, and a new example on how to use Flight and gRPC services on the same port.

This release includes Arrow Flight SQL, a protocol for using Arrow Flight to execute queries against and fetch metadata from SQL databases. Support is included for C++ and Java (but not languages that bind to C++, like Python or R). A more detailed blog post is forthcoming (EDIT 2022/02/16: see the Flight SQL announcement). Note that development is ongoing and the specification is currently experimental.

C++ notes

A set of CMake presets has been added to ease building Arrow in a number of cases (ARROW-14678, ARROW-14714).

The arrow::BitUtil namespace has been renamed to arrow::bit_util (ARROW-13494).

Concatenation of union arrays is now supported (ARROW-4975).

StructType gained three convenience methods to add, change and remove a given field (ARROW-11424).

The Datum kind COLLECTION has been removed as it was entirely unused in the codebase (ARROW-13598).

Compute Layer

A number of compute functions have been added:

  • functions operating on strings: “binary_reverse” (ARROW-14306), “string_repeat” (ARROW-12712), “utf8_normalize” (ARROW-14205);
  • “fill_null_forward”, “fill_null_backward” (ARROW-1699);
  • “ceil_temporal”, “floor_temporal”, “round_temporal” to adjust temporal input to an integral multiple of a given unit (ARROW-14822);
  • “year_month_day” to extract the calendar components of the input (ARROW-15032);
  • “random” to general random floating-point values between 0 and 1 (ARROW-12404);
  • “indices_nonzero” to return the indices in the input where there are non-zero, non-null values (ARROW-13035).

Decimal data is now supported as input of the arithmetic kernels (ARROW-13130).

Dictionary data is now supported as input of the hash join execution node (ARROW-14181).

Residual predicates have been implemented in the hash join node (ARROW-13643).

The “list_parent_indices” function now always returns int64 data regardless of the input type (ARROW-14592).

Month-day-nano interval data is now supported as input of the same functions as other interval types (ARROW-13989).

CSV

The CSV writer got additional configuration options:

  • the string representation of null values (ARROW-14905);
  • the quoting strategy: always / never / as needed (ARROW-14905);
  • the end of line character(s) (ARROW-14907)

Dataset Layer

Skyhook, a dataset addition that offloads fragment scan operations to a Ceph distributed storage cluster, was contributed (ARROW-13607).

The dataset writer now exposes options min_rows_per_group and max_rows_per_group to control the size of row groups created (ARROW-14426).

IO and Filesystem Layer

A critical bug in the AWS SDK for C++ that risks losing data in S3 multipart uploads has been circumvented (ARROW-14523).

The Google Cloud Storage filesystem is now featureful enough to pass all generic filesystem tests (ARROW-14924).

The OpenAppendStream method of filesystems has been un-deprecated; however, it still cannot be implemented for all filesystem backends (ARROW-14969).

A new function arrow::fs::ResolveS3BucketRegion allows resolving the region where a particular S3 bucket resides (ARROW-15165).

The S3 filesystem now sets the Content-Type of output files to “application/octet-stream” (instead of “application/xml” previously) if not explicitly specified by the caller (ARROW-15306).

IPC

Fine-grained I/O (coalescing) is now enabled in the synchronous (ARROW-12683) and asynchronous (ARROW-14577) IPC reader.

It is now possible to set the compression level when using LZ4 compression (ARROW-9648).

ORC

The ORC adapters have been significantly improved. A lot more properties of the ORC reader as well as ORC writer options are now available. Moreover API docs for both the ORC reader and the ORC writer have been generated. (ARROW-11297)

Parquet

DELTA_BYTE_ARRAY-encoded data can now be read from (but not written to) bytearray columns in Parquet files (PARQUET-492).

Go notes

Arrow

Bug Fixes

  • License lifted up a level so that it is properly detected for the github.com/apache/arrow/go/v7 module for pkg.go.dev ARROW-14728. Documentation on pkg.go.dev will look correct with complete major version handling as of the v7.0.0 release.
  • Errors from MessageReader.Message get properly surfaced by Reader.Read ARROW-14769
  • ipc.Reader properly uses the allocator it is initialized with instead of making native byte slices ARROW-14717
  • Fixed a CI issue where the CGO tests were crashing on windows ARROW-14589
  • Various fixes for internal usages of Release and Retain to maintain proper management of reference counting.

Enhancements

  • Continuous Integration for Go library now uses Go1.16 as the version being tested ARROW-14985
  • ValueOffsets function added to array.String to return the entire slice of offsets ARROW-14645
  • array.Interface has been lifted to arrow.Array, array.{Record,Column,Chunked,Table} have been lifted to arrow.{Record,Column,Chunked,Table}. Interface arrow.ArrayData has been created to be used instead of array.Data. Aliases have been provided for the array package so existing code that doesn’t directly use array.Data shouldn’t be affected. The aliases will be removed in v8. ARROW-5599. The Chunked.NewSlice method has been removed and is replaced by the array.NewChunkedSlice function.
  • Arrays and Records now support marshalling to JSON via the json.Marshaller interface. Builders support adding values to them by unmarshalling from JSON via the json.Unmarshaller interface. array.FromJSON function added to create Arrays from JSON directly. ARROW-9630
  • Basic handling of field referencing and expression building similar to the C++ Compute APIs added through the new compute package in preparation for adding compute interfaces. Does not yet allow executing expressions. ARROW-14430

Parquet

Enhancements

  • Updated dependency versions ARROW-14462
  • file module added, Go Parquet library now supports full file reading and writing. ARROW-13984 ARROW-13986. Does not yet provide direct Parquet <–> Arrow conversions.
  • Internal min_max utility functions given Arm64 NEON SIMD optimized assembly, gaining a 4x - 6x performance improvement. ARROW-15536

Java notes

  • Flight SQL support is now available in the Java library, with integration tests to verify it against the C++ reference implementation.
  • GeneralOutOfPlaceVectorSorter is now available for sorting any kind of vector. In general if dedicated sorters can be used (like FixedWidthInPlaceVectorSorter) they should preferred as they will generally perform better.
  • log4j2 dependency was removed as it was unused and a possible vector for attacks
  • VectorSchemaRootAppender now works with BitVector

JavaScript notes

  • Major simplifications to the API. There is only a single Vector class now. See the (also much improved) docs for details.
  • Dictionary vectors created with vectorFromArray are automatically cached for better performance.
  • Better tree shaking support. Some bundles can now be only a few kb.

Python notes

  • Official support for Python 3.6 has been dropped.
  • random and indices_nonzero compute functions are now supported in Python
  • pyarrow.orc.read_table is now provided to easily read the content of ORC files to a Table.
  • pyarrow.orc.ORCFile now has a lot more properties exposed.
  • pyarrow.orc.ORCWriter and pyarrow.orc.write_table now have the writer options available.
  • pyarrow.orc now has much better API documentation.
  • Support for compute functions arguments and options has been improved in general, arguments are not position only, while options can be provided as keyword args or not, and error reporting for wrong arguments has been improved.
  • Table now has a group_by method that allows to perform aggregations on table data. The compute functions documentation has also been improved to better distinguish between standard compute functions and HASH_AGGREGATE compute functions that can only be using for aggregations.
  • Python documentation now provides interlinking for references to parameter types and return values, thus making far easier to navigate the documentation.

R notes

This release adds additional improvements to the dplyr interface, to CSV support, and to the C-Data interface to exchange data with other languages. For more details, see the complete R changelog.

Ruby and C GLib notes

Ruby

There are two new contributors @okadakk and @simpl1g .

The updates of Red Arrow consists of the following improvements:

  • Arrow::Function#execute now accepts an instance of an Arrow::Column as its argument (ARROW-14551)
  • Arrow::Table.load now supports .arrows files to load (ARROW-15356)
  • Add support loading Arrow::Table by a URI in Arrow::Table.load (ARROW-14562)
  • Arrow::Table now supports to join two tables (ARROW-14531)
  • Arrow::Function#execute gets more easier to use than before (ARROW-15274)
  • Arrow::SortKey#name has been renamed to Arrow::SortKey#target (ARROW-14784)
  • Add Cookbook section to documentation (ARROW-14636)
  • Support the explicit initialization of S3 API by the Arrow.s3_initialize method (ARROW-14637)
  • On macOS, stop specifying the version of openssl package explicitly when building the extension library (ARROW-14619)

C GLib

The updates of Arrow GLib etc. consists of the following improvements:

  • Add garrow_execute_plan_build_hash_join_node function, GArrowHashJoinNodeOption, and GArrowJoinType (ARROW-15288)
  • Add garrow_function_get_options_type function (ARROW-15273)
  • Add garrow_function_get_default_options function (ARROW-15267)
  • Add GArrowRoundToMultipleOptions to customize the round_to_multiple function (ARROW-15216)
  • Add garrow_function_all function to list up all the functions (ARROW-15205)
    • In addition, add garrow_function_get_name, garrow_function_equal, and garrow_function_to_string functions for convenience
  • Add GArrowRoundOptions (ARROW-15204)
  • Add garrow_struct_scalar_get_value function for converting a C++ scalar value to a GLib value (ARROW-15203)
  • Add the following three interval data types (ARROW-15134)
    • GArrowMonthIntervalDataType for the interval with the month component
    • GArrowDayTimeIntervalDataType for the interval with the days and the milliseconds components
    • GArrowMonthDayNanoIntervalDataType for the interval with the months, the days, and the nanoseconds components
  • Rename GArrowSortKey::name to ::target (ARROW-14784)
  • Support the explicit initialization of S3 API by the garrow_s3_initialize function (ARROW-14637)
  • garrow_decimal128_new_string and garrow_decimal256_new_string now returns errors when they gets a invalid decimal string (ARROW-14530)
  • garrow_decimal128_data_type_new and garrow_decimal256_data_type_new functions now validates the given precision (ARROW-14529)

Rust notes

Rust releases minor versions every 2 weeks in addition to a major version with the rest of the Arrow language implementations. Thus most enhancements have been incrementally released over the last 3 months as part of the 6.x.

Going forward, the Rust implementation version will start deviating from the rest of the Arrow implementations, incrementing a major version if the changes to the crate require it. We still plan a release every other week. Please see issue #1120 for more details

Major changes in the 7.0.0 release include:

  1. Additional support for Decimal
  2. More ergonomic compute kernels that take dyn Array
  3. Union type now follows the latest Arrow standard
  4. Support for custom datetime format for inference and parsing CSV files

Another highlight is that the community continues to improve the safety of the arrow crate. The 6.4.0 release included complete data validation and has resolved all outstanding RUSTSEC issues against the crate.

For additional details on the 7.0.0 Rust implementation, please see the Arrow Rust CHANGELOG