Apache Arrow 0.17.0 Release


Published 21 Apr 2020
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 0.17.0 release. This covers over 2 months of development work and includes 569 resolved issues from 79 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Community

Since the 0.16.0 release, two committers have joined the Project Management Committee (PMC):

Thank you for all your contributions!

Columnar Format Notes

A C-level Data Interface was designed to ease data sharing inside a single process. It allows different runtimes or libraries to share Arrow data using a well-known binary layout and metadata representation, without any copies. Third party libraries can use the C interface to import and export the Arrow columnar format in-process without requiring on any new code dependencies.

The C++ library now includes an implementation of the C Data Interface, and Python and R have bindings to that implementation.

Arrow Flight RPC notes

  • Adopted new DoExchange bi-directional data RPC
  • ListFlights supports being passed a Criteria argument in Java/C++/Python. This allows applications to search for flights satisfying a given query.
  • Custom metadata can be attached to errors that the server sends to the client, which can be used to encode richer application-specific information.
  • A number of minor bugs were fixed, including proper handling of empty null arrays in Java and round-tripping of certain Arrow status codes in C++/Python.

C++ notes

Feather V2

The “Feather V2” format based on the Arrow IPC file format was developed. Feather V2 features full support for all Arrow data types, and resolves the 2GB per-column limitation for large amounts of string data that the original Feather implementation had. Feather V2 also introduces experimental IPC message compression using LZ4 frame format or ZSTD. This will be formalized later in the Arrow format.

C++ Datasets

  • Improve speed on high latency file system by relaxing discovery validation
  • Better performance with Arrow IPC files using column projection
  • Add the ability to list files in FileSystemDataset
  • Add support for Parquet file reader options
  • Support dictionary columns in partition expression
  • Fix various crashes and other issues

C++ Parquet notes

  • Complete support for writing nested types to Parquet format was completed. The legacy code can be accessed through parquet write option C++ and an environment variable in Python. Read support will come in a future release.
  • The BYTE_STREAM_SPLIT encoding was implemented for floating-point types. It helps improve the efficiency of memory compression for high-entropy data.
  • Expose Parquet schema field_id as Arrow field metadata
  • Support for DataPageV2 data page format

C++ build notes

  • We continued to make the core C++ library build simpler and faster. Among the improvements are the removal of the dependency on Thrift IDL compiler at build time; while Parquet still requires the Thrift runtime C++ library, its dependencies are much lighter. We also further reduced the number of build configurations that require Boost, and when Boost is needed to be built, we only download the components we need, reducing the size of the Boost bundle by 90%.
  • Improved support for building on ARM platforms
  • Upgraded LLVM version from 7 to 8
  • Simplified SIMD build configuration with ARROW_SIMD_LEVEL option allowing no SIMD, SSE4.2, AVX2, or AVX512 to be selected.
  • Fixed a number of bugs affecting compilation on aarch64 platforms

Other C++ notes

  • Many crashes on invalid input detected by OSS-Fuzz in the IPC reader and in Parquet-Arrow reading were fixed. See our recent blog post for more details.
  • A “Device” abstraction was added to simplify buffer management and movement across heterogeneous hardware configurations, e.g. CPUs and GPUs.
  • A streaming CSV reader was implemented, yielding individual RecordBatches and helping limit overall memory occupation.
  • Array casting from Decimal128 to integer types and to Decimal128 with different scale/precision was added.
  • Sparse CSF tensors are now supported.
  • When creating an Array, the null bitmap is not kept if the null count is known to be zero
  • Compressor support for the LZ4 frame format (LZ4_FRAME) was added
  • An event-driven interface for reading IPC streams was added.
  • Further core APIs that required passing an explicit out-parameter were migrated to Result<T>.
  • New analytics kernels for match, sort indices / argsort, top-k

Java notes

  • Netty dependencies were removed for BufferAllocator and ReferenceManager classes. In the future, we plan to move netty related classes to a separate module.
  • New features were provided to support efficiently appending vector/vector schema root values in batch.
  • Comparing a range of values in dense union vectors has been supported.
  • The quick sort algorithm was improved to avoid degenerating to the worst case.

Python notes

Datasets

  • Updated pyarrow.dataset module following the changes in the C++ Datasets project. This release also adds richer documentation on the datasets module.
  • Support for the improved dataset functionality in pyarrow.parquet.read_table/ParquetDataset. To enable, pass use_legacy_dataset=False. Among other things, this allows to specify filters for all columns and not only the partition keys (using row group statistics) and enables different partitioning schemes. See the “note” in the ParquetDataset documentation.

Packaging

  • Wheels for Python 3.8 are now available
  • Support for Python 2.7 has been dropped as Python 2.x reached end-of-life in January 2020.
  • Nightly wheels and conda packages are now available for testing or other development purposes. See the installation guide

Other improvements

  • Conversion to numpy/pandas for FixedSizeList, LargeString, LargeBinary
  • Sparse CSC matrices and Sparse CSF tensors support was added. (ARROW-7419, ARROW-7427)

R notes

Highlights include support for the Feather V2 format and the C Data Interface, both described above. Along with low-level bindings for the C interface, this release adds tooling to work with Arrow data in Python using reticulate. See vignette("python", package = "arrow") for a guide to getting started.

Installation on Linux now builds C++ the library from source by default. For a faster, richer build, set the environment variable NOT_CRAN=true. See vignette("install", package = "arrow") for details and more options.

For more on what’s in the 0.17 R package, see the R changelog.

Ruby and C GLib notes

Ruby

  • Support Ruby 2.3 again

C GLib

  • Add GArrowRecordBatchIterator
  • Add support for GArrowFilterOptions
  • Add support for Peek() to GIOInputStream
  • Add some metadata bindings to GArrowSchema
  • Add LocalFileSystem support
  • Add support for writer properties of Parquet
  • Add support for MapArray
  • Add support for BooleanNode

Rust notes

  • DictionayArray support.
  • Various improvements to code safety.
  • Filter kernel now supports temporal types.

Rust Parquet notes

  • Array reader now supports temporal types.
  • Parquet writer now supports custom meta-data key/value pairs.

Rust DataFusion notes

  • Logical plans can now reference columns by name (as well as by index) using the new UnresolvedColumn expression. There is a new optimizer rule to resolve these into column indices.
  • Scalar UDFs can now be registered with the execution context and used from logical query plans as well as from SQL. A number of math scalar functions have been implemented using this feature (sqrt, cos, sin, tan, asin, acos, atan, floor, ceil, round, trunc, abs, signum, exp, log, log2, log10).
  • Various SQL improvements, including support for SELECT * and SELECT COUNT(*), and improvements to parsing of aggregate queries.
  • Flight examples are provided, with a client that sends a SQL statement to a Flight server and receives the results.
  • The interactive SQL command-line tool now has improved documentation and better formatting of query results.

Project Operations

We’ve continued our migration of general automation toward GitHub Actions. The majority of our commit-by-commit continuous integration (CI) is now running on GitHub Actions. We are working on different solutions for using dedicated hardware as part of our CI. The Buildkite self-hosted CI/CD platform is now supported on Apache repositories and GitHub Actions also supports self-hosted workers.