Apache Arrow 7.0.0 Release
Published
08 Feb 2022
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 7.0.0 release. This covers over 3 months of development work and includes 617 resolved issues from 105 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Community
Since the 6.0.1 release, Rémi Dattai and Alessandro Molina have been invited to be committers. Daniël Heres and Yibo Cai have joined the Project Management Committee (PMC). Thanks for your contributions and participation in the project!
Arrow Flight RPC notes
The Flight specification has been clarified to note that schemas are expected to be IPC-encapsulated on the wire.
Documentation has been generally improved; see the Arrow Cookbook for recipes on how to use Flight in Python and R, and a new example on how to use Flight and gRPC services on the same port.
This release includes Arrow Flight SQL, a protocol for using Arrow Flight to execute queries against and fetch metadata from SQL databases. Support is included for C++ and Java (but not languages that bind to C++, like Python or R). A more detailed blog post is forthcoming (EDIT 2022/02/16: see the Flight SQL announcement). Note that development is ongoing and the specification is currently experimental.
C++ notes
A set of CMake presets has been added to ease building Arrow in a number of cases (ARROW-14678, ARROW-14714).
The arrow::BitUtil
namespace has been renamed to arrow::bit_util
(ARROW-13494).
Concatenation of union arrays is now supported (ARROW-4975).
StructType
gained three convenience methods to add, change and remove
a given field (ARROW-11424).
The Datum
kind COLLECTION
has been removed as it was entirely unused
in the codebase (ARROW-13598).
Compute Layer
A number of compute functions have been added:
- functions operating on strings: “binary_reverse” (ARROW-14306), “string_repeat” (ARROW-12712), “utf8_normalize” (ARROW-14205);
- “fill_null_forward”, “fill_null_backward” (ARROW-1699);
- “ceil_temporal”, “floor_temporal”, “round_temporal” to adjust temporal input to an integral multiple of a given unit (ARROW-14822);
- “year_month_day” to extract the calendar components of the input (ARROW-15032);
- “random” to general random floating-point values between 0 and 1 (ARROW-12404);
- “indices_nonzero” to return the indices in the input where there are non-zero, non-null values (ARROW-13035).
Decimal data is now supported as input of the arithmetic kernels (ARROW-13130).
Dictionary data is now supported as input of the hash join execution node (ARROW-14181).
Residual predicates have been implemented in the hash join node (ARROW-13643).
The “list_parent_indices” function now always returns int64 data regardless of the input type (ARROW-14592).
Month-day-nano interval data is now supported as input of the same functions as other interval types (ARROW-13989).
CSV
The CSV writer got additional configuration options:
- the string representation of null values (ARROW-14905);
- the quoting strategy: always / never / as needed (ARROW-14905);
- the end of line character(s) (ARROW-14907)
Dataset Layer
Skyhook, a dataset addition that offloads fragment scan operations to a Ceph distributed storage cluster, was contributed (ARROW-13607).
The dataset writer now exposes options min_rows_per_group
and
max_rows_per_group
to control the size of row groups created (ARROW-14426).
IO and Filesystem Layer
A critical bug in the AWS SDK for C++ that risks losing data in S3 multipart uploads has been circumvented (ARROW-14523).
The Google Cloud Storage filesystem is now featureful enough to pass all generic filesystem tests (ARROW-14924).
The OpenAppendStream method of filesystems has been un-deprecated; however, it still cannot be implemented for all filesystem backends (ARROW-14969).
A new function arrow::fs::ResolveS3BucketRegion
allows resolving the
region where a particular S3 bucket resides (ARROW-15165).
The S3 filesystem now sets the Content-Type of output files to “application/octet-stream” (instead of “application/xml” previously) if not explicitly specified by the caller (ARROW-15306).
IPC
Fine-grained I/O (coalescing) is now enabled in the synchronous (ARROW-12683) and asynchronous (ARROW-14577) IPC reader.
It is now possible to set the compression level when using LZ4 compression (ARROW-9648).
ORC
The ORC adapters have been significantly improved. A lot more properties of the ORC reader as well as ORC writer options are now available. Moreover API docs for both the ORC reader and the ORC writer have been generated. (ARROW-11297)
Parquet
DELTA_BYTE_ARRAY-encoded data can now be read from (but not written to) bytearray columns in Parquet files (PARQUET-492).
Go notes
Arrow
Bug Fixes
- License lifted up a level so that it is properly detected for the github.com/apache/arrow/go/v7 module for pkg.go.dev ARROW-14728. Documentation on pkg.go.dev will look correct with complete major version handling as of the v7.0.0 release.
- Errors from
MessageReader.Message
get properly surfaced byReader.Read
ARROW-14769 ipc.Reader
properly uses the allocator it is initialized with instead of making native byte slices ARROW-14717- Fixed a CI issue where the CGO tests were crashing on windows ARROW-14589
- Various fixes for internal usages of
Release
andRetain
to maintain proper management of reference counting.
Enhancements
- Continuous Integration for Go library now uses Go1.16 as the version being tested ARROW-14985
ValueOffsets
function added toarray.String
to return the entire slice of offsets ARROW-14645array.Interface
has been lifted toarrow.Array
,array.{Record,Column,Chunked,Table}
have been lifted toarrow.{Record,Column,Chunked,Table}
. Interfacearrow.ArrayData
has been created to be used instead ofarray.Data
. Aliases have been provided for thearray
package so existing code that doesn’t directly usearray.Data
shouldn’t be affected. The aliases will be removed in v8. ARROW-5599. TheChunked.NewSlice
method has been removed and is replaced by thearray.NewChunkedSlice
function.- Arrays and Records now support marshalling to JSON via the
json.Marshaller
interface. Builders support adding values to them by unmarshalling from JSON via thejson.Unmarshaller
interface.array.FromJSON
function added to create Arrays from JSON directly. ARROW-9630 - Basic handling of field referencing and expression building similar to the C++ Compute APIs added through the new
compute
package in preparation for adding compute interfaces. Does not yet allow executing expressions. ARROW-14430
Parquet
Enhancements
- Updated dependency versions ARROW-14462
file
module added, Go Parquet library now supports full file reading and writing. ARROW-13984 ARROW-13986. Does not yet provide direct Parquet <–> Arrow conversions.- Internal min_max utility functions given Arm64 NEON SIMD optimized assembly, gaining a 4x - 6x performance improvement. ARROW-15536
Java notes
- Flight SQL support is now available in the Java library, with integration tests to verify it against the C++ reference implementation.
GeneralOutOfPlaceVectorSorter
is now available for sorting any kind of vector. In general if dedicated sorters can be used (likeFixedWidthInPlaceVectorSorter
) they should preferred as they will generally perform better.log4j2
dependency was removed as it was unused and a possible vector for attacksVectorSchemaRootAppender
now works withBitVector
JavaScript notes
- Major simplifications to the API. There is only a single Vector class now. See the (also much improved) docs for details.
- Dictionary vectors created with
vectorFromArray
are automatically cached for better performance. - Better tree shaking support. Some bundles can now be only a few kb.
Python notes
- Official support for Python 3.6 has been dropped.
random
andindices_nonzero
compute functions are now supported in Pythonpyarrow.orc.read_table
is now provided to easily read the content of ORC files to a Table.pyarrow.orc.ORCFile
now has a lot more properties exposed.pyarrow.orc.ORCWriter
andpyarrow.orc.write_table
now have the writer options available.pyarrow.orc
now has much better API documentation.- Support for compute functions arguments and options has been improved in general, arguments are not position only, while options can be provided as keyword args or not, and error reporting for wrong arguments has been improved.
Table
now has agroup_by
method that allows to perform aggregations on table data. The compute functions documentation has also been improved to better distinguish between standard compute functions andHASH_AGGREGATE
compute functions that can only be using for aggregations.- Python documentation now provides interlinking for references to parameter types and return values, thus making far easier to navigate the documentation.
R notes
This release adds additional improvements to the dplyr
interface, to CSV support, and to the C-Data interface to exchange data with other languages. For more details, see the complete R changelog.
Ruby and C GLib notes
Ruby
There are two new contributors @okadakk and @simpl1g .
The updates of Red Arrow consists of the following improvements:
Arrow::Function#execute
now accepts an instance of anArrow::Column
as its argument (ARROW-14551)Arrow::Table.load
now supports.arrows
files to load (ARROW-15356)- Add support loading
Arrow::Table
by aURI
inArrow::Table.load
(ARROW-14562) Arrow::Table
now supports to join two tables (ARROW-14531)Arrow::Function#execute
gets more easier to use than before (ARROW-15274)Arrow::SortKey#name
has been renamed toArrow::SortKey#target
(ARROW-14784)- Add Cookbook section to documentation (ARROW-14636)
- Support the explicit initialization of S3 API by the
Arrow.s3_initialize
method (ARROW-14637) - On macOS, stop specifying the version of openssl package explicitly when building the extension library (ARROW-14619)
C GLib
The updates of Arrow GLib etc. consists of the following improvements:
- Add
garrow_execute_plan_build_hash_join_node
function,GArrowHashJoinNodeOption
, andGArrowJoinType
(ARROW-15288) - Add
garrow_function_get_options_type
function (ARROW-15273) - Add
garrow_function_get_default_options
function (ARROW-15267) - Add
GArrowRoundToMultipleOptions
to customize theround_to_multiple
function (ARROW-15216) - Add
garrow_function_all
function to list up all the functions (ARROW-15205)- In addition, add
garrow_function_get_name
,garrow_function_equal
, andgarrow_function_to_string
functions for convenience
- In addition, add
- Add
GArrowRoundOptions
(ARROW-15204) - Add
garrow_struct_scalar_get_value
function for converting a C++ scalar value to a GLib value (ARROW-15203) - Add the following three interval data types (ARROW-15134)
GArrowMonthIntervalDataType
for the interval with the month componentGArrowDayTimeIntervalDataType
for the interval with the days and the milliseconds componentsGArrowMonthDayNanoIntervalDataType
for the interval with the months, the days, and the nanoseconds components
- Rename
GArrowSortKey::name
to::target
(ARROW-14784) - Support the explicit initialization of S3 API by the
garrow_s3_initialize
function (ARROW-14637) garrow_decimal128_new_string
andgarrow_decimal256_new_string
now returns errors when they gets a invalid decimal string (ARROW-14530)garrow_decimal128_data_type_new
andgarrow_decimal256_data_type_new
functions now validates the given precision (ARROW-14529)
Rust notes
Rust releases minor versions every 2 weeks in addition to a major version with the rest of the Arrow language implementations. Thus most enhancements have been incrementally released over the last 3 months as part of the 6.x.
Going forward, the Rust implementation version will start deviating from the rest of the Arrow implementations, incrementing a major version if the changes to the crate require it. We still plan a release every other week. Please see issue #1120 for more details
Major changes in the 7.0.0 release include:
- Additional support for
Decimal
- More ergonomic compute kernels that take
dyn Array
Union
type now follows the latest Arrow standard- Support for custom datetime format for inference and parsing CSV files
Another highlight is that the community continues to improve the safety of the arrow crate. The 6.4.0 release included complete data validation and has resolved all outstanding RUSTSEC issues against the crate.
For additional details on the 7.0.0 Rust implementation, please see the Arrow Rust CHANGELOG