Apache Arrow 0.3.0 Release
The Apache Arrow team is pleased to announce the 0.3.0 release of the project. It is the product of an intense 10 weeks of development since the 0.2.0 release from this past February. It includes 306 resolved JIRAs from 23 contributors.
While we have added many new features to the different Arrow implementations, one of the major development focuses in 2017 has been hardening the in-memory format, type metadata, and messaging protocol to provide a stable, production-ready foundation for big data applications. We are excited to be collaborating with the Apache Spark and GeoMesa communities on utilizing Arrow for high performance IO and in-memory data processing.
See the Install Page to learn how to get the libraries for your platform.
We will be publishing more information about the Apache Arrow roadmap as we forge ahead with using Arrow to accelerate big data systems.
We are looking for more contributors from within our existing communities and from other communities (such as Go, R, or Julia) to get involved in Arrow development.
File and Streaming Format Hardening
The 0.2.0 release brought with it the first iterations of the random access and streaming Arrow wire formats. See the IPC specification for implementation details and example blog post with some use cases. These provide low-overhead, zero-copy access to Arrow record batch payloads.
In 0.3.0 we have solidified a number of small details with the binary format and improved our integration and unit testing particularly in the Java, C++, and Python libraries. Using the Google Flatbuffers project has helped with adding new features to our metadata without breaking forward compatibility.
We are not yet ready to make a firm commitment to strong forward compatibility (in case we find something needs to change) in the binary format, but we will make efforts between major releases to not make unnecessary breakages. Contributions to the website and component user and API documentation would also be most welcome.
Dictionary Encoding Support
Emilio Lahr-Vivaz from the GeoMesa project contributed Java support
for dictionary-encoded Arrow vectors. We followed up with C++ and Python
pandas.Categorical integration). We have not yet implemented
full integration tests for dictionaries (for sending this data between C++ and
Java), but hope to achieve this in the 0.4.0 Arrow release.
This common data representation technique for categorical data allows multiple record batches to share a common “dictionary”, with the values in the batches being represented as integers referencing the dictionary. This data is called “categorical” or “factor” in statistical languages, while in file formats like Apache Parquet it is strictly used for data compression.
Expanded Date, Time, and Fixed Size Types
A notable omission from the 0.2.0 release was complete and integration-tested support for the gamut of date and time types that occur in the wild. These are needed for Apache Parquet and Apache Spark integration.
- Date: 32-bit (days unit) and 64-bit (milliseconds unit)
- Time: 64-bit integer with unit (second, millisecond, microsecond, nanosecond)
- Timestamp: 64-bit integer with unit, with or without timezone
- Fixed Size Binary: Primitive values occupying certain number of bytes
- Fixed Size List: List values with constant size (no separate offsets vector)
We have additionally added experimental support for exact decimals in C++ using Boost.Multiprecision, though we have not yet hardened the Decimal memory format between the Java and C++ implementations.
C++ and Python Support on Windows
We have made many general improvements to development and packaging for general C++ and Python development. 0.3.0 is the first release to bring full C++ and Python support for Windows on Visual Studio (MSVC) 2015 and 2017. In addition to adding Appveyor continuous integration for MSVC, we have also written guides for building from source on Windows: C++ and Python.
For the first time, you can install the Arrow Python library on Windows from conda-forge:
conda install pyarrow -c conda-forge
C (GLib) Bindings, with support for Ruby, Lua, and more
Kouhei Sutou is a new Apache Arrow contributor and has contributed GLib C bindings (to the C++ libraries) for Linux. Using a C middleware framework called GObject Introspection, it is possible to use these bindings seamlessly in Ruby, Lua, Go, and other programming languages. We will probably need to publish some follow up blogs explaining how these bindings work and how to use them.
Apache Spark Integration for PySpark
We have been collaborating with the Apache Spark community on SPARK-13534
to add support for using Arrow to accelerate
PySpark. We have observed over 40x speedup from the more efficient
Using Arrow in PySpark opens the door to many other performance optimizations,
particularly around UDF evaluation (e.g.
filter operations with
Python lambda functions).
New Python Feature: Memory Views, Feather, Apache Parquet support
Arrow’s Python library
pyarrow is a Cython binding for the
libarrow_python C++ libraries, which handle inteoperability with NumPy,
pandas, and the Python standard library.
At the heart of Arrow’s C++ libraries is the
arrow::Buffer object, which is a
managed memory view supporting zero-copy reads and slices. Jeff Knupp
contributed integration between Arrow buffers and the Python buffer protocol
and memoryviews, so now code like this is possible:
In : import pyarrow as pa In : buf = pa.frombuffer(b'foobarbaz') In : buf Out: <pyarrow._io.Buffer at 0x7f6c0a84b538> In : memoryview(buf) Out: <memory at 0x7f6c0a8c5e88> In : buf.to_pybytes() Out: b'foobarbaz'
We have significantly expanded Apache Parquet support via the C++ Parquet implementation parquet-cpp. This includes support for partitioned datasets on disk or in HDFS. We added initial Arrow-powered Parquet support in the Dask project, and look forward to more collaborations with the Dask developers on distributed processing of pandas data.
With Arrow’s support for pandas maturing, we were able to merge in the Feather format implementation, which is essentially a special case of the Arrow random access format. We’ll be continuing Feather development within the Arrow codebase. For example, Feather can now read and write with Python file objects using Arrow’s Python binding layer.
We also implemented more robust support for pandas-specific data types, like
Support for Tensors and beyond in C++ Library
There has been increased interest in using Apache Arrow as a tool for zero-copy shared memory management for machine learning applications. A flagship example is the Ray project from the UC Berkeley RISELab.
Machine learning deals in additional kinds of data structures beyond what the
Arrow columnar format supports, like multidimensional arrays aka “tensors”. As
such, we implemented the
arrow::Tensor C++ type which can utilize the
rest of Arrow’s zero-copy shared memory machinery (using
managing memory lifetime). In C++ in particular, we will want to provide for
additional data structures utilizing common IO and memory management tools.
Improved Website and Developer Documentation
Since 0.2.0 we have implemented a new website stack for publishing documentation and blogs based on Jekyll. Kouhei Sutou developed a Jekyll Jupyter Notebook plugin so that we can use Jupyter to author content for the Arrow website.
On the website, we have now published API documentation for the C, C++, Java, and Python subcomponents. Within these you will find easier-to-follow developer instructions for getting started.
Thanks to all who contributed patches to this release.
$ git shortlog -sn apache-arrow-0.2.0..apache-arrow-0.3.0 119 Wes McKinney 55 Kouhei Sutou 18 Uwe L. Korn 17 Julien Le Dem 9 Phillip Cloud 6 Bryan Cutler 5 Philipp Moritz 5 Emilio Lahr-Vivaz 4 Max Risuhin 4 Johan Mabille 4 Jeff Knupp 3 Steven Phillips 3 Miki Tebeka 2 Leif Walsh 2 Jeff Reback 2 Brian Hulette 1 Tsuyoshi Ozawa 1 rvernica 1 Nong Li 1 Julien Lafaye 1 Itai Incze 1 Holden Karau 1 Deepak Majeti