Apache Arrow 5.0.0 Release
Published
29 Jul 2021
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 5.0.0 release. This covers 3 months of development work and includes 684 commits from 99 distinct contributors in 2 repositories. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights
of the release. Many other bugfixes and improvements have been made: we refer
you to the complete changelogs for the apache/arrow
and
apache/arrow-rs
repositories.
Community
Since the 4.0.0 release, Daniël Heres, Kazuaki Ishizaki, Dominik Moritz, and Weston Pace have been invited as committers to Arrow, and Benjamin Kietzman and David Li have joined the Project Management Committee (PMC). Thank you for all of your contributions!
Columnar Format Notes
Official IANA Media types (MIME types) have been registered for Apache Arrow IPC protocol data, both stream and file variants:
- https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.stream
- https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.file
We recommend .arrow
as the IPC file format file extension and .arrows
for
the IPC streaming format file extension.
Arrow Flight RPC notes
The Go implementation now supports custom metadata and middleware, and has been added to integration testing.
In Python, some operations can now be interrupted via Control-C.
C++ notes
MakeArrayFromScalar
now works for fixed-size binary types (ARROW-13321).
Compute layer
The following compute functions were added:
-
aggregations:
index
-
scalar arithmetic and math functions:
abs
,abs_checked
,acos
,acos_checked
,asin
,asin_checked
,atan
,atan2
,ceil
,cos
,cos_checked
,floor
,ln
,ln_checked
,log10
,log10_checked
,log1p
,log1p_checked
,log2
,log2_checked
,negate
,negate_checked
,sign
,sin
,sin_checked
,tan
,tan_checked
,trunc
-
scalar bitwise functions:
bit_wise_and
,bit_wise_not
,bit_wise_or
,bit_wise_xor
,shift_left
,shift_left_checked
,shift_right
,shift_right_checked
-
scalar string functions:
ascii_center
,ascii_lpad
,ascii_reverse
,ascii_rpad
,binary_join
,binary_join_element_wise
,binary_replace_slice
,count_substring
,count_substring_regex
,ends_with
,find_substring
,find_substring_regex
,match_like
,split_pattern_regex
,starts_with
,utf8_center
,utf8_lpad
,utf8_replace_slice
,utf8_rpad
,utf8_reverse
,utf8_slice_codepoints
-
scalar temporal functions:
day
,day_of_week
,day_of_year
,iso_calendar
,iso_week
,iso_year
,hour
,microsecond
,millisecond
,minute
,month
,nanosecond
,quarter
,second
,subsecond
,year
-
other scalar functions:
case_when
,coalesce
,if_else
,is_finite
,is_inf
,is_nan
,max_element_wise
,min_element_wise
,make_struct
-
vector functions:
replace_with_mask
Duplicates are now allowed in SetLookupOptions::value_set
(ARROW-12554).
Decimal types are now supported by some basic arithmetic functions (ARROW-12074).
The take
function now supports dense unions (ARROW-13005).
It is now possible to cast between dictionary types with different index types (ARROW-11673).
Sorting is now implemented for boolean input (ARROW-12016).
CSV
The streaming CSV reader can now take some advantage of multiple threads (ARROW-11889).
The CSV reader tries to make its errors more informative by adding the row number when it is known, i.e. when parallel reading is disabled (ARROW-12675).
A new option ReaderOptions::skip_rows_after_names
allows skipping a number
of rows after reading the column names (as opposed to
ReaderOptions::skip_rows
).
Quoted strings can now be treated as always non-null (ARROW-10115).
Dataset layer
The asynchronous scanner introduced in 4.0.0 has been improved with truly
asynchronous readers implemented for CSV, Parquet, and IPC file formats and
file-level parallelism added. This mode is controlled by a flag use_async
that
can be passed into methods which scan a dataset. Setting this flag to True
will have significant improvements on filesystems with high latency or parallel
reads (e.g. S3).
A CountRows method has been added to count rows matching a predicate; where possible, this will use metadata in files instead of reading the data itself.
CSV datasets can now be written, and when reading a CSV dataset, explicit types can now be specified for a subset of columns while allowing the rest to still be inferred.
IO and Filesystem layer
The I/O thread pool size can now be adjusted at runtime (ARROW-12760). The default size remains 8 threads.
Streams now can have auxiliary metadata, depending on the backend. This
has been implemented for the S3 filesystems, where a couple metadata
keys are supported such as Content-Type
and ACL
(ARROW-11161, ARROW-12719).
The HadoopFileSystem implementation now implements the FileSystem abstraction more faithfully (ARROW-12790).
Parquet
The new LZ4_RAW compression scheme was implemented (PARQUET-1998). Unlike the legacy LZ4 compression scheme, it is defined unambiguously and should provide better portability once other Parquet implementations catch up.
Go notes
Flight
- Flight Client and Server now support Custom Metadata through the functions
flight.NewClientWithMiddleware
andflight.NewServerWithMiddleware
. Functionsflight.NewFlightClient
,flight.NewFlightServer
,flight.CreateServerBearerTokenAuthInterceptors
have been deprecated in favor of using the new middleware. #10633 - Flight Client
AuthHandler
no longer overwrites outgoing metadata, correctly appending new metadata without overwriting existing metadata #10297 - Flight AppMetadata field is now exposed both for Reading and Writing via
flight.Reader#LatestAppMetadata()
andflight.Writer#WriteWithAppMetadata
functions #10142
Other enhancements
- Map and Extension Datatypes are now implemented for Arrow Arrays
- Schema package and first part of Encoding package added for Golang Parquet Implementation
Java notes
Highlighted improvements and fixes:
- Improved support for extension types using a complex storage type, e.g. struct, map or union. These can now extend
the
ExtensionTypeVector
base class. - Union vectors now extend
AbstractContainerVector
to be consistent with other vectors. - Guava dependency updated to 30.1.1
- Memory leak fixed if an exception occurs when reading IPC messages from a channel. #10423
- Flight error metadata is now propagated to the client. #10370
- JDBC adapter now preserves nullability. #10285
- The memory rounding policy is respected when allocating vector buffers. This helps saving memory space. #10576
API compatibility changes:
- Complex vectors now return covariant types from
getObject(int)
. #9964
JavaScript notes
- Tables do not extend DataFrames anymore. This enables smaller bundles. #10277
- Arrow uses closure compiler for all UMD bundles, making them smaller. #10281
- The npm package now comes with declaration maps for better navigation from types to source code. #10673
- Updated dependencies and improvements to the code.
Python notes
- Datasets can now scan files asynchronously when the
use_async=True
option is provided toDataset.scanner
,Dataset.to_table
, orDataset.to_batches
methods. This should provide better performance in environments where I/O can be slow, such as with remote sources. - Arrow now provides builtin support for writing CSV files through
pyarrow.csv.write_csv
- Wheels for Apple M1 Macs are now provided.
- Many new
pyarrow.compute
functions are available (see the C++ notes above for more details), and introspection of the functions was improved so that they look more like standard Python functions. - It is now possible to access ORC file metadata from Python
ORCFile
objects - Building a
StructArray
now accepts amask
like other arrays - Many updates and fixes for the documentation
R notes
In this release, we’ve more than doubled the number of functions you can call on Arrow Datasets inside dplyr::filter()
, mutate()
, and arrange()
, including many more string, datetime, and math functions. You can also write Datasets to CSV files, in addition to Parquet and Feather. We’ve also deepened support for the Arrow C interface, which is used in the Python interface and allows integration with other projects, such as DuckDB.
For more on what’s in the 5.0.0 R package, see the R changelog.
Ruby and C GLib notes
Apache Arrow Flight support is started. But ListFlights
is only supported for now. More features will be implemented in the next major release.
Ruby
You need gobject-introspection gem 3.4.5 or later to implement your Apache Arrow Flight server. If you only use Apache Arrow Flight client, gobject-introspection gem 3.4.5 or later isn’t required.
Here are highlighted improvements:
-
Compute functions accept raw Ruby objects such as
true
,Integer
,Array
andString
:add_function = Arrow::Function.find("add") # Not shortcut version augend = Arrow::Int8Array.new([1, 2, 3]) addend = Arrow::Int8Scalar.new(5) args = [ Arrow::ArrayDatum.new(augend), Arrow::ScalarDatum.new(addend), ] add_function.execute(args).value.to_a # => [6, 7, 8] # Shortcut version add_function.execute([[1, 2, 3], 5]).value.to_a # => [6, 7, 8]
-
Arrow::PrimaryArray
andArrow::Buffer
can be used as MemoryView that is added in Ruby 3.0.
There are some backward incompatible changes:
Arrow::CountOptions
andArrow::CountMode
are removed. UseArrow::ScalarAggregateOptions
instead.
C GLib
There are some backward incompatible changes:
GArrowCountOptions
andGArrowCountMode
are removed. UseGArrowScalarAggregateOptions
instead.garrow_array_equal_range()
requiresGArrowEqualOptions
.- Prefix in arrow-dataset-glib is changed to
gadataset_
/GADATASET_
fromgad_
/GAD_
. GADScanOptions
,GADScanTask
andGADInMemoryScanTask
are removed. Usegadataset_begin_scan()
orgadataset_to_table()
instead.GArrowCompareOptions
,GArrowCompareOperator
andgarrow_*_array_compare()
are removed. Useequal
,not_equal
,less_than
,less_than_equal
,greater_than
andgreater_than_equal
compute functions directly instead.
Rust notes
The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the 5.0.0 release of the Rust implementation, see the Arrow Rust changelog and the Apache Arrow Rust 5.0.0 Release blog post.