Apache Arrow 8.0.0 Release
Published
15 May 2022
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 8.0.0 release. This covers over 3 months of development work and includes 586 resolved issues from 127 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Community
Since the 7.0.0 release, Kun Liu, Raphael Taylor-Davies Xudong Wang, Yijie Shen and Liang-Chi Hsieh have been invited to be committers. Thanks for your contributions and participation in the project!
Arrow Flight RPC notes
Flight SQL has been extended with a method to get type metadata (ARROW-15313) and column metadata in returned schemas (ARROW-15314, ARROW-16064) New documentation is available describing Flight and Flight SQL, along with several Cookbook recipes (ARROW-14698, ARROW-16065).
The C++ libraries now support UCX as a network transport (ARROW-15706), and the APIs have been refactored to allow other transports to be implemented (ARROW-15282). UCX support is experimental and still subject to change. Many of the APIs have been refactored to use the arrow::Result
type, and the original variants have been deprecated (ARROW-16032). Support for gRPC >= 1.43 has been added (ARROW-15551).
C++ notes
Compute
Arrow C++ can now optionally build with support for the experimental Substrait query representation format (ARROW-15238).
A number of compute kernels operating on temporal data have been added:
- addition, subtraction and multiplication between various temporal types;
- a new “is_dst” function to compute whether the input timestamps fall within daylight saving time (DST);
- a new “is_leap_year” function to compute whether the input timestamps fall within a leap year.
It is possible to enable a timezone database on Windows at runtime
by calling the arrow::Initialize()
function (ARROW-13168).
New hash aggregations are available: “hash_one” to return one value from each group (ARROW-13993), and “hash_list” to return all values from each group (ARROW-15152). Null columns are now supported on the sum, mean and product hash aggregates (ARROW-15506). Also, it is now possible to execute hash “aggregations” with only key columns (ARROW-15609).
A new compute function “map_lookup” allows looking up a given key in a map array (ARROW-15089).
New compute functions “sqrt” and “sqrt_checked” allow extracting the square root of their input (ARROW-15614).
Casting between two struct types is now possible, assuming the destination field names all exist in the source struct type (ARROW-1888, ARROW-15643).
Optional OpenTelemetry tracing has been added to kernel functions and execution plan nodes (ARROW-15061).
The CMake build option ARROW_ENGINE
has been renamed to ARROW_SUBSTRAIT
,
to better reflect its actual effect (ARROW-16158).
CSV
It is now possible to change the field delimiter when writing a CSV file (ARROW-15672).
Dataset
The ORC dataset scanner now observes the batch size parameter (ARROW-14153).
The dataset layer now supports filename-based partitioning, where the data files are all laid out in the dataset’s base directory, their names prefixed with the partition values separated by underscore characters (ARROW-14612).
Optional OpenTelemetry tracing has been added to the dataset scanner (ARROW-15067).
Filesystem
It is possible to instantiate a Google Cloud Storage (GCS) filesystem
from a URI, making GCS implicitly usable in the datasets layer (ARROW-14893).
Recognized URI schemes are gs
and gcs
.
FileSystem::DeleteDirContents
can now optionally succeed when the directory
doesn’t exist (ARROW-16159).
IO
It is possible to override the number of IO threads using the
environment variable ARROW_IO_THREADS
(ARROW-15941).
IPC
The IPC file reader and writer now allow accessing the custom metadata associated with record batches (ARROW-16131).
Miscellaneous
It is possible to enable lightweight memory checks on the standard memory pools
using a dedicated environment variable ARROW_DEBUG_MEMORY_POOL
(ARROW-15550).
These checks are not a replacement for sophisticated checkers such as Address
Sanitizer or Valgrind, but might come up useful if those tools are not
available.
Temporal data is now validated when doing full array validation (ARROW-10924). The validation catches values not matching the specification (for example, a time value being outside of the span of one day).
The GDB plugin now attempts to print the data of an array, in addition to its metadata (ARROW-15389). This only works for primitive datatypes.
Pretty-printing is now shorter and more customizable for nested datatypes (ARROW-14798).
C# notes
With .NET Core 2.1 reaching end-of-life in August 2021, the Apache.Arrow library has been updated to target netcoreapp3.1
and higher. It still supports netstandard1.3
, so the library works on .NET Framework. But to get the best performance, using .NET Core 3.1, .NET 5, or later is recommended.
Go notes
Bug fixes
- parquet_reader / parquet_schema no longer crash (ARROW-15509)
- Base64 encoding of origin Arrow schema properly uses padding and decodes both with or without the padding in pqarrow.getOriginSchema (ARROW-15544)
- ipc.Writer no longer includes unnecessary offsets when encoding sliced arrays (ARROW-15715)
- Use base64.StdEncoding for Arrow Flight Basic Auth middleware for proper encoding/decoding (ARROW-15772)
- Fix panic during concurrent compression of ipc body buffers due to negative WaitGroup counter (ARROW-15792)
- Fix memory leak in pqarrow.NewColumnWriter with nested structures (ARROW-15946)
- ipc.FileReader no longer leaks memory when using ZSTD compression (ARROW-16163)
Enhancements
Flight RPC
- Breaking Change gRPC version is updated and Flight Server creation has been simplified. Flight servers must now embed flight.BaseFlightServer (ARROW-15418)
- You can now provide a full net.Listener as an alternative to just providing an address to bind to when creating a Flight server (ARROW-16082)
Parquet
- ‘unpack_bool’, sum_float64, and bitmap handling functions have been given optimized assembly implementation for Arm64 NEON (ARROW-15440, ARROW-15742, ARROW-15995)
- Go Parquet handling has been simplified to only need io.ReadSeeker instead of a ReadAtSeeker interface (ARROW-15963)
- BitSetRunReader and helper functions have been lifted to internal/bitutils package to share between arrow and parquet implementations (ARROW-15950)
- Parquet Reader now properly obeys the buffer size read property for buffered streams (ARROW-16187)
- parquet NewBufferedReader no longer panics (ARROW-16283)
Arrow
- Go Arrow library now supports Dictionary Arrays (ARROW-3039, ARROW-9378, GH-12158)
- array.ArrayEqual, array.ArrayApproxEqual have been renamed to array.Equal and array.ApproxEqual. Aliases are provided to avoid breaking existing code which will be removed in v9. (ARROW-5598)
- Custom cpu discovery package replaced with using golang.org/x/sys/cpu (ARROW-16193)
- Breaking Change array.Interface, array.Record, array.Table, etc. were deprecated in v7 in favor of arrow.Array, arrow.Record, etc. The deprecated aliases have been removed in v8 (ARROW-16192)
CI
- staticcheck is now run as part of CI to lint the Go code (ARROW-15296)
- Travis builds on Arm64 for Go are no longer allowed to fail (ARROW-15400)
Java notes
- Java 17 is now supported as a target and has been added to tested platforms
- When scanning datasets an
ArrowReader
is now returned, which makes easier to createVectorSchemaRoot
from it. - Java Documentation had a overall improvement, with few sections added and most sections rewritten as more clear tutorials
- Overall improvements to FlightSQL support in Java
- Java Cookbook is now available
JavaScript notes
- Tables now allow setting arbitrary symbols, which enables support for passing Arrow tables to Vega. ARROW-16209
- Arrow now supports
tableFromJSON
and struct vectors invectorFromArray
. ARROW-16210 - Fixed support for appending null children in a StructBuilder. ARROW-15705
Python notes
In general, the Python bindings benefit from improvements in the C++ library (e.g. new compute functions); see the C++ notes above for additional details. In addition:
- Tables and Datasets now support the
join
operation to performleft
,right
,full
joins ofinner
orouter
types. The result of the join operation will be a new table (ARROW-14293). See https://arrow.apache.org/docs/dev/python/compute.html#table-and-dataset-joins for examples. - Additional legacy keywords and properties of the
ParquetDataset
class have been deprecated and will issue a warning, in favor of functionality based on thepyarrow.dataset
functionality (ARROW-16119). - It is now possible to create references to nested fields in a Table or Dataset using
py.field("a", "b")
(ARROW-11259). - Docstrings in
Schema
,ChunkedArray
,Tensor
,RecordBatch
,parquet
andTable
now include examples on how to use the methods and classes (ARROW-15367). - Reading and writing Parquet files now supports encryption (ARROW-9947). See the docs for more details.
- Support for
zoneinfo
(Python 3.9+) anddateutil
timezones in conversion to Arrow data structures (ARROW-5248). - Multiple bugfixes and segfaults resolved.
R notes
This release includes:
- Support for over 20 additional
lubridate
andbase
date and time functions in Arrow dpylr queries, - An API to allow external packages to define custom extension array types and custom conversions into Arrow types,
- Support for concatenating Arrays, RecordBatches, and Tables, including with
c()
,rbind()
andcbind()
.
For more on what’s in the 8.0.0 R package, see the R changelog.
Ruby and C GLib notes
Ruby
- Add support for
#values
ofMonthInterval
Type (ARROW-15749) - Add support for
#raw_records
ofMonthInterval
type (ARROW-15750) - Add support for
#values
ofDayTimeInterval
type (ARROW-15885) - Add
DayTimeIntervalArrayBuilder
to support to makeDayTimeIntervalArray
by a Hash with:day
and:millisecond
keys (ARROW-15918) - Add support for
#raw_records
ofDayTimeInterval
type (ARROW-15886) - Add support for
#values
ofMonthDayNanoInterval
type (ARROW-15924)- Also add
MonthDayNanoIntervalArrayBuilder
to support to makeMonthDayNanoIntervalArray
by a Hash with:month
,:day
, and:nanosecond
keys
- Also add
- Add support for
#raw_records
ofMonthDayNanoInterval
type (ARROW-15925) - Add Ruby-ish interfaces for
Parquet::BooleanStatistics
(ARROW-16251)
C GLib
- Add
gaflight_client_close
(ARROW-15487) - Add
GParquetFileMetadata
andgparquet_arrow_file_reader_get_metadata
(ARROW-16214) - Fix
GArrowGIOInputStream
so that all the data is completely read (ARROW-15626) - Add
garrow_string_array_builder_append_string_len
andgarrow_large_string_array_builder_append_string_len
(ARROW-15629) - Add
GParquetRowGroupMetadata
(ARROW-16245) - Add
GParquetColumnChunkMetadata
(ARROW-16250) - Add
GArrowGCSFileSystem
(ARROW-16247) - Add
GParquetStatistics
and its family (ARROW-16251)GParquetBooleanStatistics
GParquetInt32Statistics
GParquetInt64Statistics
GParquetFloatStatistics
GParquetDoubleStatistics
GParquetByteArrayStatistics
GParquetFixedLengthByteArrayStatistics
- Add missing casts for
GArrowRoundMode
(ARROW-16296)
Rust notes
The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the 13.0.0 release of the Rust implementation, see the Arrow Rust changelog.