Apache Arrow 7.0.0 Release
Published
08 Feb 2022
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 7.0.0 release. This covers over 3 months of development work and includes 617 resolved issues from 105 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Community
Since the 6.0.1 release, Rémi Dattai and Alessandro Molina have been invited to be committers. Daniël Heres and Yibo Cai have joined the Project Management Committee (PMC). Thanks for your contributions and participation in the project!
Arrow Flight RPC notes
The Flight specification has been clarified to note that schemas are expected to be IPC-encapsulated on the wire.
Documentation has been generally improved; see the Arrow Cookbook for recipes on how to use Flight in Python and R, and a new example on how to use Flight and gRPC services on the same port.
This release includes Arrow Flight SQL, a protocol for using Arrow Flight to execute queries against and fetch metadata from SQL databases. Support is included for C++ and Java (but not languages that bind to C++, like Python or R). A more detailed blog post is forthcoming (EDIT 2022/02/16: see the Flight SQL announcement). Note that development is ongoing and the specification is currently experimental.
C++ notes
A set of CMake presets has been added to ease building Arrow in a number of cases (ARROW-14678, ARROW-14714).
The arrow::BitUtil namespace has been renamed to arrow::bit_util
(ARROW-13494).
Concatenation of union arrays is now supported (ARROW-4975).
StructType gained three convenience methods to add, change and remove
a given field (ARROW-11424).
The Datum kind COLLECTION has been removed as it was entirely unused
in the codebase (ARROW-13598).
Compute Layer
A number of compute functions have been added:
- functions operating on strings: "binary_reverse" (ARROW-14306), "string_repeat" (ARROW-12712), "utf8_normalize" (ARROW-14205);
- "fill_null_forward", "fill_null_backward" (ARROW-1699);
- "ceil_temporal", "floor_temporal", "round_temporal" to adjust temporal input to an integral multiple of a given unit (ARROW-14822);
- "year_month_day" to extract the calendar components of the input (ARROW-15032);
- "random" to general random floating-point values between 0 and 1 (ARROW-12404);
- "indices_nonzero" to return the indices in the input where there are non-zero, non-null values (ARROW-13035).
Decimal data is now supported as input of the arithmetic kernels (ARROW-13130).
Dictionary data is now supported as input of the hash join execution node (ARROW-14181).
Residual predicates have been implemented in the hash join node (ARROW-13643).
The "list_parent_indices" function now always returns int64 data regardless of the input type (ARROW-14592).
Month-day-nano interval data is now supported as input of the same functions as other interval types (ARROW-13989).
CSV
The CSV writer got additional configuration options:
- the string representation of null values (ARROW-14905);
- the quoting strategy: always / never / as needed (ARROW-14905);
- the end of line character(s) (ARROW-14907)
Dataset Layer
Skyhook, a dataset addition that offloads fragment scan operations to a Ceph distributed storage cluster, was contributed (ARROW-13607).
The dataset writer now exposes options min_rows_per_group and
max_rows_per_group to control the size of row groups created (ARROW-14426).
IO and Filesystem Layer
A critical bug in the AWS SDK for C++ that risks losing data in S3 multipart uploads has been circumvented (ARROW-14523).
The Google Cloud Storage filesystem is now featureful enough to pass all generic filesystem tests (ARROW-14924).
The OpenAppendStream method of filesystems has been un-deprecated; however, it still cannot be implemented for all filesystem backends (ARROW-14969).
A new function arrow::fs::ResolveS3BucketRegion allows resolving the
region where a particular S3 bucket resides (ARROW-15165).
The S3 filesystem now sets the Content-Type of output files to "application/octet-stream" (instead of "application/xml" previously) if not explicitly specified by the caller (ARROW-15306).
IPC
Fine-grained I/O (coalescing) is now enabled in the synchronous (ARROW-12683) and asynchronous (ARROW-14577) IPC reader.
It is now possible to set the compression level when using LZ4 compression (ARROW-9648).
ORC
The ORC adapters have been significantly improved. A lot more properties of the ORC reader as well as ORC writer options are now available. Moreover API docs for both the ORC reader and the ORC writer have been generated. (ARROW-11297)
Parquet
DELTA_BYTE_ARRAY-encoded data can now be read from (but not written to) bytearray columns in Parquet files (PARQUET-492).
Go notes
Arrow
Bug Fixes
- License lifted up a level so that it is properly detected for the github.com/apache/arrow/go/v7 module for pkg.go.dev ARROW-14728. Documentation on pkg.go.dev will look correct with complete major version handling as of the v7.0.0 release.
- Errors from
MessageReader.Messageget properly surfaced byReader.ReadARROW-14769 -
ipc.Readerproperly uses the allocator it is initialized with instead of making native byte slices ARROW-14717 - Fixed a CI issue where the CGO tests were crashing on windows ARROW-14589
- Various fixes for internal usages of
ReleaseandRetainto maintain proper management of reference counting.
Enhancements
- Continuous Integration for Go library now uses Go1.16 as the version being tested ARROW-14985
-
ValueOffsetsfunction added toarray.Stringto return the entire slice of offsets ARROW-14645 -
array.Interfacehas been lifted toarrow.Array,array.{Record,Column,Chunked,Table}have been lifted toarrow.{Record,Column,Chunked,Table}. Interfacearrow.ArrayDatahas been created to be used instead ofarray.Data. Aliases have been provided for thearraypackage so existing code that doesn't directly usearray.Datashouldn't be affected. The aliases will be removed in v8. ARROW-5599. TheChunked.NewSlicemethod has been removed and is replaced by thearray.NewChunkedSlicefunction. - Arrays and Records now support marshalling to JSON via the
json.Marshallerinterface. Builders support adding values to them by unmarshalling from JSON via thejson.Unmarshallerinterface.array.FromJSONfunction added to create Arrays from JSON directly. ARROW-9630 - Basic handling of field referencing and expression building similar to the C++ Compute APIs added through the new
computepackage in preparation for adding compute interfaces. Does not yet allow executing expressions. ARROW-14430
Parquet
Enhancements
- Updated dependency versions ARROW-14462
-
filemodule added, Go Parquet library now supports full file reading and writing. ARROW-13984 ARROW-13986. Does not yet provide direct Parquet <--> Arrow conversions. - Internal min_max utility functions given Arm64 NEON SIMD optimized assembly, gaining a 4x - 6x performance improvement. ARROW-15536
Java notes
- Flight SQL support is now available in the Java library, with integration tests to verify it against the C++ reference implementation.
-
GeneralOutOfPlaceVectorSorteris now available for sorting any kind of vector. In general if dedicated sorters can be used (likeFixedWidthInPlaceVectorSorter) they should preferred as they will generally perform better. -
log4j2dependency was removed as it was unused and a possible vector for attacks -
VectorSchemaRootAppendernow works withBitVector
JavaScript notes
- Major simplifications to the API. There is only a single Vector class now. See the (also much improved) docs for details.
- Dictionary vectors created with
vectorFromArrayare automatically cached for better performance. - Better tree shaking support. Some bundles can now be only a few kb.
Python notes
- Official support for Python 3.6 has been dropped.
-
randomandindices_nonzerocompute functions are now supported in Python -
pyarrow.orc.read_tableis now provided to easily read the content of ORC files to a Table. -
pyarrow.orc.ORCFilenow has a lot more properties exposed. -
pyarrow.orc.ORCWriterandpyarrow.orc.write_tablenow have the writer options available. -
pyarrow.orcnow has much better API documentation. - Support for compute functions arguments and options has been improved in general, arguments are not position only, while options can be provided as keyword args or not, and error reporting for wrong arguments has been improved.
-
Tablenow has agroup_bymethod that allows to perform aggregations on table data. The compute functions documentation has also been improved to better distinguish between standard compute functions andHASH_AGGREGATEcompute functions that can only be using for aggregations. - Python documentation now provides interlinking for references to parameter types and return values, thus making far easier to navigate the documentation.
R notes
This release adds additional improvements to the dplyr interface, to CSV support, and to the C-Data interface to exchange data with other languages. For more details, see the complete R changelog.
Ruby and C GLib notes
Ruby
There are two new contributors @okadakk and @simpl1g .
The updates of Red Arrow consists of the following improvements:
-
Arrow::Function#executenow accepts an instance of anArrow::Columnas its argument (ARROW-14551) -
Arrow::Table.loadnow supports.arrowsfiles to load (ARROW-15356) - Add support loading
Arrow::Tableby aURIinArrow::Table.load(ARROW-14562) -
Arrow::Tablenow supports to join two tables (ARROW-14531) -
Arrow::Function#executegets more easier to use than before (ARROW-15274) -
Arrow::SortKey#namehas been renamed toArrow::SortKey#target(ARROW-14784) - Add Cookbook section to documentation (ARROW-14636)
- Support the explicit initialization of S3 API by the
Arrow.s3_initializemethod (ARROW-14637) - On macOS, stop specifying the version of openssl package explicitly when building the extension library (ARROW-14619)
C GLib
The updates of Arrow GLib etc. consists of the following improvements:
- Add
garrow_execute_plan_build_hash_join_nodefunction,GArrowHashJoinNodeOption, andGArrowJoinType(ARROW-15288) - Add
garrow_function_get_options_typefunction (ARROW-15273) - Add
garrow_function_get_default_optionsfunction (ARROW-15267) - Add
GArrowRoundToMultipleOptionsto customize theround_to_multiplefunction (ARROW-15216) - Add
garrow_function_allfunction to list up all the functions (ARROW-15205)- In addition, add
garrow_function_get_name,garrow_function_equal, andgarrow_function_to_stringfunctions for convenience
- In addition, add
- Add
GArrowRoundOptions(ARROW-15204) - Add
garrow_struct_scalar_get_valuefunction for converting a C++ scalar value to a GLib value (ARROW-15203) - Add the following three interval data types (ARROW-15134)
-
GArrowMonthIntervalDataTypefor the interval with the month component -
GArrowDayTimeIntervalDataTypefor the interval with the days and the milliseconds components -
GArrowMonthDayNanoIntervalDataTypefor the interval with the months, the days, and the nanoseconds components
-
- Rename
GArrowSortKey::nameto::target(ARROW-14784) - Support the explicit initialization of S3 API by the
garrow_s3_initializefunction (ARROW-14637) -
garrow_decimal128_new_stringandgarrow_decimal256_new_stringnow returns errors when they gets a invalid decimal string (ARROW-14530) -
garrow_decimal128_data_type_newandgarrow_decimal256_data_type_newfunctions now validates the given precision (ARROW-14529)
Rust notes
Rust releases minor versions every 2 weeks in addition to a major version with the rest of the Arrow language implementations. Thus most enhancements have been incrementally released over the last 3 months as part of the 6.x.
Going forward, the Rust implementation version will start deviating from the rest of the Arrow implementations, incrementing a major version if the changes to the crate require it. We still plan a release every other week. Please see issue #1120 for more details
Major changes in the 7.0.0 release include:
- Additional support for
Decimal - More ergonomic compute kernels that take
dyn Array -
Uniontype now follows the latest Arrow standard - Support for custom datetime format for inference and parsing CSV files
Another highlight is that the community continues to improve the safety of the arrow crate. The 6.4.0 release included complete data validation and has resolved all outstanding RUSTSEC issues against the crate.
For additional details on the 7.0.0 Rust implementation, please see the Arrow Rust CHANGELOG