Apache Arrow 5.0.0 Release


Published 29 Jul 2021
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 5.0.0 release. This covers 3 months of development work and includes 684 commits from 99 distinct contributors in 2 repositories. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelogs for the apache/arrow and apache/arrow-rs repositories.

Community

Since the 4.0.0 release, Daniël Heres, Kazuaki Ishizaki, Dominik Moritz, and Weston Pace have been invited as committers to Arrow, and Benjamin Kietzman and David Li have joined the Project Management Committee (PMC). Thank you for all of your contributions!

Columnar Format Notes

Official IANA Media types (MIME types) have been registered for Apache Arrow IPC protocol data, both stream and file variants:

We recommend .arrow as the IPC file format file extension and .arrows for the IPC streaming format file extension.

Arrow Flight RPC notes

The Go implementation now supports custom metadata and middleware, and has been added to integration testing.

In Python, some operations can now be interrupted via Control-C.

C++ notes

MakeArrayFromScalar now works for fixed-size binary types (ARROW-13321).

Compute layer

The following compute functions were added:

  • aggregations: index

  • scalar arithmetic and math functions: abs, abs_checked, acos, acos_checked, asin, asin_checked, atan, atan2, ceil, cos, cos_checked, floor, ln, ln_checked, log10, log10_checked, log1p, log1p_checked, log2, log2_checked, negate, negate_checked, sign, sin, sin_checked, tan, tan_checked, trunc

  • scalar bitwise functions: bit_wise_and, bit_wise_not, bit_wise_or, bit_wise_xor, shift_left, shift_left_checked, shift_right, shift_right_checked

  • scalar string functions: ascii_center, ascii_lpad, ascii_reverse, ascii_rpad, binary_join, binary_join_element_wise, binary_replace_slice, count_substring, count_substring_regex, ends_with, find_substring, find_substring_regex, match_like, split_pattern_regex, starts_with, utf8_center, utf8_lpad, utf8_replace_slice, utf8_rpad, utf8_reverse, utf8_slice_codepoints

  • scalar temporal functions: day, day_of_week, day_of_year, iso_calendar, iso_week, iso_year, hour, microsecond, millisecond, minute, month, nanosecond, quarter, second, subsecond, year

  • other scalar functions: case_when, coalesce, if_else, is_finite, is_inf, is_nan, max_element_wise, min_element_wise, make_struct

  • vector functions: replace_with_mask

Duplicates are now allowed in SetLookupOptions::value_set (ARROW-12554).

Decimal types are now supported by some basic arithmetic functions (ARROW-12074).

The take function now supports dense unions (ARROW-13005).

It is now possible to cast between dictionary types with different index types (ARROW-11673).

Sorting is now implemented for boolean input (ARROW-12016).

CSV

The streaming CSV reader can now take some advantage of multiple threads (ARROW-11889).

The CSV reader tries to make its errors more informative by adding the row number when it is known, i.e. when parallel reading is disabled (ARROW-12675).

A new option ReaderOptions::skip_rows_after_names allows skipping a number of rows after reading the column names (as opposed to ReaderOptions::skip_rows).

Quoted strings can now be treated as always non-null (ARROW-10115).

Dataset layer

The asynchronous scanner introduced in 4.0.0 has been improved with truly asynchronous readers implemented for CSV, Parquet, and IPC file formats and file-level parallelism added. This mode is controlled by a flag use_async that can be passed into methods which scan a dataset. Setting this flag to True will have significant improvements on filesystems with high latency or parallel reads (e.g. S3).

A CountRows method has been added to count rows matching a predicate; where possible, this will use metadata in files instead of reading the data itself.

CSV datasets can now be written, and when reading a CSV dataset, explicit types can now be specified for a subset of columns while allowing the rest to still be inferred.

IO and Filesystem layer

The I/O thread pool size can now be adjusted at runtime (ARROW-12760). The default size remains 8 threads.

Streams now can have auxiliary metadata, depending on the backend. This has been implemented for the S3 filesystems, where a couple metadata keys are supported such as Content-Type and ACL (ARROW-11161, ARROW-12719).

The HadoopFileSystem implementation now implements the FileSystem abstraction more faithfully (ARROW-12790).

Parquet

The new LZ4_RAW compression scheme was implemented (PARQUET-1998). Unlike the legacy LZ4 compression scheme, it is defined unambiguously and should provide better portability once other Parquet implementations catch up.

Go notes

Flight

  • Flight Client and Server now support Custom Metadata through the functions flight.NewClientWithMiddleware and flight.NewServerWithMiddleware. Functions flight.NewFlightClient, flight.NewFlightServer, flight.CreateServerBearerTokenAuthInterceptors have been deprecated in favor of using the new middleware. #10633
  • Flight Client AuthHandler no longer overwrites outgoing metadata, correctly appending new metadata without overwriting existing metadata #10297
  • Flight AppMetadata field is now exposed both for Reading and Writing via flight.Reader#LatestAppMetadata() and flight.Writer#WriteWithAppMetadata functions #10142

Other enhancements

Java notes

Highlighted improvements and fixes:

  • Improved support for extension types using a complex storage type, e.g. struct, map or union. These can now extend the ExtensionTypeVector base class.
  • Union vectors now extend AbstractContainerVector to be consistent with other vectors.
  • Guava dependency updated to 30.1.1
  • Memory leak fixed if an exception occurs when reading IPC messages from a channel. #10423
  • Flight error metadata is now propagated to the client. #10370
  • JDBC adapter now preserves nullability. #10285
  • The memory rounding policy is respected when allocating vector buffers. This helps saving memory space. #10576

API compatibility changes:

  • Complex vectors now return covariant types from getObject(int). #9964

JavaScript notes

  • Tables do not extend DataFrames anymore. This enables smaller bundles. #10277
  • Arrow uses closure compiler for all UMD bundles, making them smaller. #10281
  • The npm package now comes with declaration maps for better navigation from types to source code. #10673
  • Updated dependencies and improvements to the code.

Python notes

  • Datasets can now scan files asynchronously when the use_async=True option is provided to Dataset.scanner, Dataset.to_table, or Dataset.to_batches methods. This should provide better performance in environments where I/O can be slow, such as with remote sources.
  • Arrow now provides builtin support for writing CSV files through pyarrow.csv.write_csv
  • Wheels for Apple M1 Macs are now provided.
  • Many new pyarrow.compute functions are available (see the C++ notes above for more details), and introspection of the functions was improved so that they look more like standard Python functions.
  • It is now possible to access ORC file metadata from Python ORCFile objects
  • Building a StructArray now accepts a mask like other arrays
  • Many updates and fixes for the documentation

R notes

In this release, we’ve more than doubled the number of functions you can call on Arrow Datasets inside dplyr::filter(), mutate(), and arrange(), including many more string, datetime, and math functions. You can also write Datasets to CSV files, in addition to Parquet and Feather. We’ve also deepened support for the Arrow C interface, which is used in the Python interface and allows integration with other projects, such as DuckDB.

For more on what’s in the 5.0.0 R package, see the R changelog.

Ruby and C GLib notes

Apache Arrow Flight support is started. But ListFlights is only supported for now. More features will be implemented in the next major release.

Ruby

You need gobject-introspection gem 3.4.5 or later to implement your Apache Arrow Flight server. If you only use Apache Arrow Flight client, gobject-introspection gem 3.4.5 or later isn’t required.

Here are highlighted improvements:

  • Compute functions accept raw Ruby objects such as true, Integer, Array and String:

    add_function = Arrow::Function.find("add")
    # Not shortcut version
    augend = Arrow::Int8Array.new([1, 2, 3])
    addend = Arrow::Int8Scalar.new(5)
    args = [
      Arrow::ArrayDatum.new(augend),
      Arrow::ScalarDatum.new(addend),
    ]
    add_function.execute(args).value.to_a # => [6, 7, 8]
    # Shortcut version
    add_function.execute([[1, 2, 3], 5]).value.to_a # => [6, 7, 8]
    
  • Arrow::PrimaryArray and Arrow::Buffer can be used as MemoryView that is added in Ruby 3.0.

There are some backward incompatible changes:

  • Arrow::CountOptions and Arrow::CountMode are removed. Use Arrow::ScalarAggregateOptions instead.

C GLib

There are some backward incompatible changes:

  • GArrowCountOptions and GArrowCountMode are removed. Use GArrowScalarAggregateOptions instead.
  • garrow_array_equal_range() requires GArrowEqualOptions.
  • Prefix in arrow-dataset-glib is changed to gadataset_/GADATASET_ from gad_/GAD_.
  • GADScanOptions, GADScanTask and GADInMemoryScanTask are removed. Use gadataset_begin_scan() or gadataset_to_table() instead.
  • GArrowCompareOptions, GArrowCompareOperator and garrow_*_array_compare() are removed. Use equal, not_equal, less_than, less_than_equal, greater_than and greater_than_equal compute functions directly instead.

Rust notes

The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the 5.0.0 release of the Rust implementation, see the Arrow Rust changelog and the Apache Arrow Rust 5.0.0 Release blog post.