Apache Arrow 2.0.0 Release
Published
22 Oct 2020
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 2.0.0 release. This covers over 3 months of development work and includes 511 resolved issues from 81 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Community
Since the 1.0.0 release, Jorge Leitão has been added as a committer. Thank you for your contributions!
Columnar Format
As this is the first major release since 1.0.0, we remind everyone that we have moved to a “split” versioning system where the Library version (which is now 2.0.0) will now evolve separate from the Format version (which is still 1.0.0). Major releases of the libraries may contain non-backward-compatible API changes, but they will not contain any incompatible format changes. See the Versioning and Stability page in the documentation for more.
The columnar format metadata has been updated to permit 256-bit decimal values in addition to 128-bit decimals. This change is backward and forward compatible.
Arrow Flight RPC notes
For Arrow Flight, 2.0.0 mostly brings bugfixes. In Java, some memory leaks in
FlightStream
and DoPut
have been addressed. In C++ and Python, a deadlock
has been fixed in an edge case. Additionally, when supported by gRPC, TLS
verification can be disabled.
C++ notes
Parquet reading now fully supports round trip of arbitrarily nested data, including extension types with a nested storage type. In the process, several bugs in writing nested data and FixedSizeList were fixed. If writing data with these type we recommend upgrading to this release and validating old data as there is potential data loss.
Datasets can now be written with partitions, including these features:
- Writing to Parquet, including control over accumulation of statistics for individual columns.
- Writing to IPC/Feather, including body buffer compression.
- Basenames of written files can be specified with a string template, allowing non-colliding writes into the same partitioned dataset.
Other notable features in the release include
- Compute kernels for standard deviation, variance, and mode
- Improvements to S3 support, including automatic region detection
- CSV reading now parses Date type and creating Dictionary types
C# notes
The .NET package has added a number of new features this release.
Full support for Struct
types.
Synchronous write APIs for ArrowStreamWriter
and ArrowFileWriter
. These are
complimentary to the existing async write APIs, and can be used in situations
where the async APIs can’t be used.
The ability to use DateTime
instances with Date32Array
and Date64Array
.
Java notes
The Java package has supported a number of new features. Users can validate vectors in a wider range of aspects, if they are willing to take more time. In dictionary encoding, dictionary indices can be expressed as unsigned integers. A framework for data compression has been setup for IPC.
The calculation for vector capacity has been simplified, so users should experience notable performance improvements for various ‘setSafe’ methods.
Bugs for JDBC adapters, sort algorithms, and ComplexCopier have been resolved to make them more usable.
JavaScript notes
Upgrades Arrow’s build to use TypeScript 3.9, fixing generated .d.ts
typings.
Python notes
Parquet reading now supports round trip of arbitrarily nested data. Several bug fixes for writing nested data and FixedSizeList. If writing data with these type we recommend validating old data (there is potential some data loss) and upgrade to 2.0.
Extension types with a nested storage type now round trip through Parquet.
The pyarrow.filesystem
submodule is deprecated in favor of new filesystem
implementations in pyarrow.fs
.
The custom serialization functionality (pyarrow.serialize()
,
pyarrow.deserialize()
, etc) is deprecated. Those functions provided a
Python-specific (not cross-language) serialization format which were not
compatible with the standardized Arrow (IPC) serialization format. For
arbitrary objects, you can use the standard library pickle
functionality
instead. For pyarrow objects, you can use the IPC serialization format through
the pyarrow.ipc
module, as explained above.
The pyarrow.compute
module now has a complete coverage of the available C++
compute kernels in the python API. Several new kernels have been added.
The pyarrow.dataset
module was further improved. In addition to reading, it
is now also possible to write partitioned datasets (with write_dataset()
).
The Arrow <-> Python conversion code was refactored, fixing several bugs and corner cases.
Conversion of an array of pyarrow.MapType
to Pandas has been added.
Conversion of timezone aware datetimes to and/from pyarrow arrays including
pandas now round-trip preserving timezone. To use the old behavior (e.g. for
spark) set the environment variable PYARROW_IGNORE_TIMEZONE to a truthy value
(i.e. PYARROW_IGNORE_TIMEZONE=1
)
R notes
Highlights of the R release include
- Writing multi-file datasets with partitioning to Parquet or Feather
- Reading and writing directly to AWS S3, both individual files and multi-file datasets
- Bindings for Flight which use reticulate
In addition, the R package benefits from the various improvements in the C++ library listed above, including the ability to read and write Parquet files with nested struct and list types.
For more on what’s in the 2.0.0 R package, see the R changelog.
Ruby and C GLib notes
Ruby
In Ruby binding, Arrow::Table#save
uses the number of rows as the
chunk_size
parameter by default when the table is saved in a Parquet file.
C GLib
The GLib binding newly supports GArrowStringDictionaryArrayBuilder
and
GArrowBinaryDictionaryArrayBuilder
.
Moreover the GLib binding supports new accessors of GArrowListArray
and
GArrowLargeListArray
. They are get_values
, get_value_offset
,
get_value_length
, and get_value_offsets
.
Rust notes
Due to the high volume of activity in the Rust subproject in this release, we’re writing a separate blog post dedicated to those changes.