Apache Arrow 0.15.0 Release
Published
06 Oct 2019
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 0.15.0 release. This covers about 3 months of development work and includes 687 resolved issues from 80 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.
About a third of issues closed (240) were classified as bug fixes, so this release brings many stability, memory use, and performance improvements over 0.14.x. We will discuss some of the language-specific improvements and new features below.
New committers
Since the 0.14.0 release, we’ve added four new committers:
In addition, Sebastien Binet and Micah Kornfield have joined the PMC.
Thank you for all your contributions!
Columnar Format Notes
The format gets new datatypes : LargeList(ARROW-4810), LargeBinary and LargeString (ARROW-750). LargeList is similar to List but with 64-bit offsets instead of 32-bit. The same relationship holds for LargeBinary and LargeString with respect to Binary and String.
Since the last major release, we have also made a significant overhaul of the columnar format documentation to be clearer and easier to follow for implementation creators.
Upcoming Columnar Format Stability and Library / Format Version Split
The Arrow community has decided to make a 1.0.0 release of the project marking formal stability of the columnar format and binary protocol, including explicit forward and backward compatibility guarantees. You can read about these guarantees in the new documentation page about versioning.
Starting with 1.0.0, we will give the columnar format and libraries separate version numbers. This will allow the library versions to evolve without creating confusion or uncertainty about whether the Arrow columnar format remains stable or not.
Columnar “Streaming Protocol” Change since 0.14.0
Since 0.14.0 we have modified the IPC “encapsulated message” format to insert 4 bytes of additional data in the message preamble to ensure that the Flatbuffers metadata starts on an aligned offset. By default, IPC streams generated by 0.15.0 and later will not be readable by library versions 0.14.1 and prior. Implementations have offered options to write messages using the now “legacy” message format.
For users who cannot upgrade to version 0.15.0 in all parts of their system, such as Apache Spark users, we recommend one of the two routes:
- If using pyarrow, set the environment variable
ARROW_PRE_0_15_IPC_FORMAT=1
when using 0.15.0 and sending data to an old library - Wait to upgrade all components simultaneously
We do not anticipate making this kind of change again in the near future and would not have made such a non-forward-compatible change unless we deemed it very important.
Arrow Flight notes
A GetFlightSchema method is added to the Flight RPC protocol (ARROW-6094). As the name suggests, it returns the schema for a given Flight descriptor on the server. This is useful for cases where the Flight locations are not immediately available, depending on the server implementation.
Flight implementations for C++ and Java now implement half-closed semantics for DoPut (ARROW-6063). The client can close the writing end of the stream to signal that it has finished sending the Flight data, but still receive the batch-specific response and its associated metadata.
C++ notes
C++ now supports the LargeList, LargeBinary and LargeString datatypes.
The Status class gains the ability to carry additional subsystem-specific data with it, under the form of an opaque StatusDetail interface (ARROW-4036). This allows, for example, to store not only an exception message coming from Python but the actual Python exception object, such as to raise it again if the Status is propagated back to Python. It can also enable the consumer of Status to inspect the subsystem-specific error, such as a finer-grained Flight error code.
DataType and Schema equality are significantly faster (ARROW-6038).
The Column class is completely removed, as it did not have a strong enough motivation for existing between ChunkedArray and RecordBatch / Table (ARROW-5893).
C++: Parquet
The 0.15 release includes many improvements to the Apache Parquet C++ internals, resulting in greatly improved read and write performance. We described the work and published some benchmarks in a recent blog post.
C++: CSV reader
The CSV reader is now more flexible in terms of how column names are chosen (ARROW-6231) and column selection (ARROW-5977).
C++: Memory Allocation Layer
Arrow now has the option to allocate memory using the mimalloc memory allocator. jemalloc is still preferred for best performance, but mimalloc is a reasonable alternative to the system allocator on Windows where jemalloc is not currently supported.
Also, we now expose explicit global functions to get a MemoryPool for each of the jemalloc allocator, mimalloc allocator and system allocator (ARROW-6292).
The vendored jemalloc version is bumped from 4.5.x to 5.2.x (ARROW-6549). Performance characteristics may differ on memory allocation-heavy workloads, though we did not notice any significant regression on our suite of micro-benchmarks (and a multi-threaded benchmark of reading a CSV file showed a 25% speedup).
C++: Filesystem layer
A FileSystem implementation to access Amazon S3-compatible filesystems is now available. It depends on the AWS SDK for C++.
C++: I/O layer
Significant improvements were made to the Arrow I/O stack.
- ARROW-6180: Add RandomAccessFile::GetStream that returns an InputStream over a fixed subset of the file.
- ARROW-6381: Improve performance of small writes with BufferOutputStream
- ARROW-2490: Streamline concurrency semantics of InputStream implementations, and add debug checks for race conditions between non-thread-safe InputStream operations.
- ARROW-6527: Add an OutputStream::Write overload that takes an owned Buffer rather than a raw memory area. This allows OutputStream implementations to safely implement delayed writing without having to copy the data.
C++: Tensors
There are three improvements of Tensor and SparseTensor in this release.
- Add Tensor::Value template function for element access
- Add EqualOptions support in Tensor::Equals function, that allows us to control the way to compare two float tensors
- Add smaller bit-width index supports in SparseTensor
C# Notes
We have fixed some bugs causing incompatibilities between C# and other Arrow implementations.
Java notes
- Added an initial version of an Avro adapter
- To improve the JDBC adapter performance, refactored consume data logic and implemented an iterator API to prevent loading all data into one vector
- Implemented subField encoding for complex type, now List and Struct vectors subField encoding is available
- Implemented visitor API for vector/range/type/approx equals compare
- Performed a lot of optimization and refactoring for DictionaryEncoder, supporting all data types and avoiding memory copy via hash table and visitor API
- Introduced Error Prone into code base to catch more potential errors earlier
- Fixed the bug where dictionary entries were required in IPC streams even when empty; readers can now also read interleaved messages
Python notes
The FileSystem API, implemented in C++, is now available in Python (ARROW-5494).
The API for extension types has been straightened and definition of custom extension types in Python is now more powerful (ARROW-5610).
Sparse tensors are now available in Python (ARROW-4453).
A potential crash when handling Python dates and datetimes was fixed (ARROW-6597).
Based on a mailing list discussion, we are looking for help with maintaining
our Python wheels. Community members have found that the wheels take up a great
deal of maintenance time, so if you or your organization depend on pip install
pyarrow
working, we would appreciate your assistance.
Ruby and C GLib notes
Ruby and C GLib continues to follow the features in the C++ project. Ruby includes the following backward incompatible changes.
- Remove Arrow::Struct and use Hash instead.
- Add Arrow::Time for Arrow::Time{32,64}DataType value.
- Arrow::Decimal128Array#get_value returns BigDecimal.
Ruby improves the performance of Arrow#values.
Rust notes
A number of core Arrow improvements were made to the Rust library.
- Add explicit SIMD vectorization for the divide kernel
- Add a feature to disable SIMD
- Use “if cfg!” pattern
- Optimizations to BooleanBufferBuilder::append_slice
- Implemented Debug trait for List/Struct/BinaryArray
Improvements related to Rust Parquet and DataFusion are detailed next.
Rust Parquet
- Implement Arrow record reader
- Add converter that is used to convert record reader’s content to arrow primitive array.
Rust DataFusion
- Preview of new query execution engine using an extensible trait-based physical execution plan that supports parallel execution using threads
- ExecutionContext now has a register_parquet convenience method for registering Parquet data sources
- Fixed bug in type coercion optimizer rule
- TableProvider.scan() now returns a thread-safe BatchIterator
- Remove use of bare trait objects (switched to using dyn syntax)
- Adds casting from unsigned to signed integer data types
R notes
A major development since the 0.14 release was the arrival of the arrow
R
package on CRAN. We wrote about this in August on the Arrow blog.
In addition to the package availability on CRAN, we also published
package documentation on the Arrow website.
The 0.15 R package includes many of the enhancements in the C++ library release, such as the Parquet performance improvements and the FileSystem API. In addition, there are a number of upgrades that make it easier to read and write data, specify types and schema, and interact with Arrow tables and record batches in R.
For more on what’s in the 0.15 R package, see the changelog.
Community Discussions Ongoing
There are a number of active discussions ongoing on the developer dev@arrow.apache.org mailing list. We look forward to hearing from the community there.