Apache Arrow 2.0.0 Release
22 Oct 2020
By The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 2.0.0 release. This covers over 3 months of development work and includes 511 resolved issues from 81 distinct contributors. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.
Since the 1.0.0 release, Jorge Leitão has been added as a committer. Thank you for your contributions!
As this is the first major release since 1.0.0, we remind everyone that we have moved to a “split” versioning system where the Library version (which is now 2.0.0) will now evolve separate from the Format version (which is still 1.0.0). Major releases of the libraries may contain non-backward-compatible API changes, but they will not contain any incompatible format changes. See the Versioning and Stability page in the documentation for more.
The columnar format metadata has been updated to permit 256-bit decimal values in addition to 128-bit decimals. This change is backward and forward compatible.
Arrow Flight RPC notes
For Arrow Flight, 2.0.0 mostly brings bugfixes. In Java, some memory leaks in
DoPut have been addressed. In C++ and Python, a deadlock
has been fixed in an edge case. Additionally, when supported by gRPC, TLS
verification can be disabled.
Parquet reading now fully supports round trip of arbitrarily nested data, including extension types with a nested storage type. In the process, several bugs in writing nested data and FixedSizeList were fixed. If writing data with these type we recommend upgrading to this release and validating old data as there is potential data loss.
Datasets can now be written with partitions, including these features:
- Writing to Parquet, including control over accumulation of statistics for individual columns.
- Writing to IPC/Feather, including body buffer compression.
- Basenames of written files can be specified with a string template, allowing non-colliding writes into the same partitioned dataset.
Other notable features in the release include
- Compute kernels for standard deviation, variance, and mode
- Improvements to S3 support, including automatic region detection
- CSV reading now parses Date type and creating Dictionary types
The .NET package has added a number of new features this release.
Full support for
Synchronous write APIs for
ArrowFileWriter. These are
complimentary to the existing async write APIs, and can be used in situations
where the async APIs can’t be used.
The ability to use
DateTime instances with
The Java package has supported a number of new features. Users can validate vectors in a wider range of aspects, if they are willing to take more time. In dictionary encoding, dictionary indices can be expressed as unsigned integers. A framework for data compression has been setup for IPC.
The calculation for vector capacity has been simplified, so users should experience notable performance improvements for various ‘setSafe’ methods.
Bugs for JDBC adapters, sort algorithms, and ComplexCopier have been resolved to make them more usable.
Upgrades Arrow’s build to use TypeScript 3.9, fixing generated
Parquet reading now supports round trip of arbitrarily nested data. Several bug fixes for writing nested data and FixedSizeList. If writing data with these type we recommend validating old data (there is potential some data loss) and upgrade to 2.0.
Extension types with a nested storage type now round trip through Parquet.
pyarrow.filesystem submodule is deprecated in favor of new filesystem
The custom serialization functionality (
pyarrow.deserialize(), etc) is deprecated. Those functions provided a
Python-specific (not cross-language) serialization format which were not
compatible with the standardized Arrow (IPC) serialization format. For
arbitrary objects, you can use the standard library
instead. For pyarrow objects, you can use the IPC serialization format through
pyarrow.ipc module, as explained above.
pyarrow.compute module now has a complete coverage of the available C++
compute kernels in the python API. Several new kernels have been added.
pyarrow.dataset module was further improved. In addition to reading, it
is now also possible to write partitioned datasets (with
The Arrow <-> Python conversion code was refactored, fixing several bugs and corner cases.
Conversion of an array of
pyarrow.MapType to Pandas has been added.
Conversion of timezone aware datetimes to and/from pyarrow arrays including
pandas now round-trip preserving timezone. To use the old behavior (e.g. for
spark) set the environment variable PYARROW_IGNORE_TIMEZONE to a truthy value
Highlights of the R release include
- Writing multi-file datasets with partitioning to Parquet or Feather
- Reading and writing directly to AWS S3, both individual files and multi-file datasets
- Bindings for Flight which use reticulate
In addition, the R package benefits from the various improvements in the C++ library listed above, including the ability to read and write Parquet files with nested struct and list types.
For more on what’s in the 2.0.0 R package, see the R changelog.
Ruby and C GLib notes
In Ruby binding,
Arrow::Table#save uses the number of rows as the
chunk_size parameter by default when the table is saved in a Parquet file.
The GLib binding newly supports
Moreover the GLib binding supports new accessors of
GArrowLargeListArray. They are
Due to the high volume of activity in the Rust subproject in this release, we’re writing a separate blog post dedicated to those changes.