Project News and Blog


Apache Arrow 0.4.0 Release

Published 23 May 2017
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.4.0 release of the project. While only 17 days since the release, it includes 77 resolved JIRAs with some important new features and bug fixes.

See the Install Page to learn how to get the libraries for your platform.

Expanded JavaScript Implementation

The TypeScript Arrow implementation has undergone some work since 0.3.0 and can now read a substantial portion of the Arrow streaming binary format. As this implementation develops, we will eventually want to include JS in the integration test suite along with Java and C++ to ensure wire cross-compatibility.

Python Support for Apache Parquet on Windows

With the 1.1.0 C++ release of Apache Parquet, we have enabled the pyarrow.parquet extension on Windows for Python 3.5 and 3.6. This should appear in conda-forge packages and PyPI in the near future. Developers can follow the source build instructions.

Generalizing Arrow Streams

In the 0.2.0 release, we defined the first version of the Arrow streaming binary format for low-cost messaging with columnar data. These streams presume that the message components are written as a continuous byte stream over a socket or file.

We would like to be able to support other other transport protocols, like gRPC, for the message components of Arrow streams. To that end, in C++ we defined an abstract stream reader interface, for which the current contiguous streaming format is one implementation:

class RecordBatchReader {
 public:
  virtual std::shared_ptr<Schema> schema() const = 0;
  virtual Status GetNextRecordBatch(std::shared_ptr<RecordBatch>* batch) = 0;
};

It would also be good to define abstract stream reader and writer interfaces in the Java implementation.

In an upcoming blog post, we will explain in more depth how Arrow streams work, but you can learn more about them by reading the IPC specification.

C++ and Cython API for Python Extensions

As other Python libraries with C or C++ extensions use Apache Arrow, they will need to be able to return Python objects wrapping the underlying C++ objects. In this release, we have implemented a prototype C++ API which enables Python wrapper objects to be constructed from C++ extension code:

#include "arrow/python/pyarrow.h"

if (!arrow::py::import_pyarrow()) {
  // Error
}

std::shared_ptr<arrow::RecordBatch> cpp_batch = GetData(...);
PyObject* py_batch = arrow::py::wrap_batch(cpp_batch);

This API is intended to be usable from Cython code as well:

cimport pyarrow
pyarrow.import_pyarrow()

Python Wheel Installers on macOS

With this release, pip install pyarrow works on macOS (OS X) as well as Linux. We are working on providing binary wheel installers for Windows as well.

Apache Arrow 0.3.0 Release

Published 08 May 2017
By Wes McKinney (wesm)

Translations: 日本語

The Apache Arrow team is pleased to announce the 0.3.0 release of the project. It is the product of an intense 10 weeks of development since the 0.2.0 release from this past February. It includes 306 resolved JIRAs from 23 contributors.

While we have added many new features to the different Arrow implementations, one of the major development focuses in 2017 has been hardening the in-memory format, type metadata, and messaging protocol to provide a stable, production-ready foundation for big data applications. We are excited to be collaborating with the Apache Spark and GeoMesa communities on utilizing Arrow for high performance IO and in-memory data processing.

See the Install Page to learn how to get the libraries for your platform.

We will be publishing more information about the Apache Arrow roadmap as we forge ahead with using Arrow to accelerate big data systems.

We are looking for more contributors from within our existing communities and from other communities (such as Go, R, or Julia) to get involved in Arrow development.

File and Streaming Format Hardening

The 0.2.0 release brought with it the first iterations of the random access and streaming Arrow wire formats. See the IPC specification for implementation details and example blog post with some use cases. These provide low-overhead, zero-copy access to Arrow record batch payloads.

In 0.3.0 we have solidified a number of small details with the binary format and improved our integration and unit testing particularly in the Java, C++, and Python libraries. Using the Google Flatbuffers project has helped with adding new features to our metadata without breaking forward compatibility.

We are not yet ready to make a firm commitment to strong forward compatibility (in case we find something needs to change) in the binary format, but we will make efforts between major releases to not make unnecessary breakages. Contributions to the website and component user and API documentation would also be most welcome.

Dictionary Encoding Support

Emilio Lahr-Vivaz from the GeoMesa project contributed Java support for dictionary-encoded Arrow vectors. We followed up with C++ and Python support (and pandas.Categorical integration). We have not yet implemented full integration tests for dictionaries (for sending this data between C++ and Java), but hope to achieve this in the 0.4.0 Arrow release.

This common data representation technique for categorical data allows multiple record batches to share a common “dictionary”, with the values in the batches being represented as integers referencing the dictionary. This data is called “categorical” or “factor” in statistical languages, while in file formats like Apache Parquet it is strictly used for data compression.

Expanded Date, Time, and Fixed Size Types

A notable omission from the 0.2.0 release was complete and integration-tested support for the gamut of date and time types that occur in the wild. These are needed for Apache Parquet and Apache Spark integration.

We have additionally added experimental support for exact decimals in C++ using Boost.Multiprecision, though we have not yet hardened the Decimal memory format between the Java and C++ implementations.

C++ and Python Support on Windows

We have made many general improvements to development and packaging for general C++ and Python development. 0.3.0 is the first release to bring full C++ and Python support for Windows on Visual Studio (MSVC) 2015 and 2017. In addition to adding Appveyor continuous integration for MSVC, we have also written guides for building from source on Windows: C++ and Python.

For the first time, you can install the Arrow Python library on Windows from conda-forge:

conda install pyarrow -c conda-forge

C (GLib) Bindings, with support for Ruby, Lua, and more

Kouhei Sutou is a new Apache Arrow contributor and has contributed GLib C bindings (to the C++ libraries) for Linux. Using a C middleware framework called GObject Introspection, it is possible to use these bindings seamlessly in Ruby, Lua, Go, and other programming languages. We will probably need to publish some follow up blogs explaining how these bindings work and how to use them.

Apache Spark Integration for PySpark

We have been collaborating with the Apache Spark community on SPARK-13534 to add support for using Arrow to accelerate DataFrame.toPandas in PySpark. We have observed over 40x speedup from the more efficient data serialization.

Using Arrow in PySpark opens the door to many other performance optimizations, particularly around UDF evaluation (e.g. map and filter operations with Python lambda functions).

New Python Feature: Memory Views, Feather, Apache Parquet support

Arrow’s Python library pyarrow is a Cython binding for the libarrow and libarrow_python C++ libraries, which handle inteoperability with NumPy, pandas, and the Python standard library.

At the heart of Arrow’s C++ libraries is the arrow::Buffer object, which is a managed memory view supporting zero-copy reads and slices. Jeff Knupp contributed integration between Arrow buffers and the Python buffer protocol and memoryviews, so now code like this is possible:

In [6]: import pyarrow as pa

In [7]: buf = pa.frombuffer(b'foobarbaz')

In [8]: buf
Out[8]: <pyarrow._io.Buffer at 0x7f6c0a84b538>

In [9]: memoryview(buf)
Out[9]: <memory at 0x7f6c0a8c5e88>

In [10]: buf.to_pybytes()
Out[10]: b'foobarbaz'

We have significantly expanded Apache Parquet support via the C++ Parquet implementation parquet-cpp. This includes support for partitioned datasets on disk or in HDFS. We added initial Arrow-powered Parquet support in the Dask project, and look forward to more collaborations with the Dask developers on distributed processing of pandas data.

With Arrow’s support for pandas maturing, we were able to merge in the Feather format implementation, which is essentially a special case of the Arrow random access format. We’ll be continuing Feather development within the Arrow codebase. For example, Feather can now read and write with Python file objects using Arrow’s Python binding layer.

We also implemented more robust support for pandas-specific data types, like DatetimeTZ and Categorical.

Support for Tensors and beyond in C++ Library

There has been increased interest in using Apache Arrow as a tool for zero-copy shared memory management for machine learning applications. A flagship example is the Ray project from the UC Berkeley RISELab.

Machine learning deals in additional kinds of data structures beyond what the Arrow columnar format supports, like multidimensional arrays aka “tensors”. As such, we implemented the arrow::Tensor C++ type which can utilize the rest of Arrow’s zero-copy shared memory machinery (using arrow::Buffer for managing memory lifetime). In C++ in particular, we will want to provide for additional data structures utilizing common IO and memory management tools.

Start of JavaScript (TypeScript) Implementation

Brian Hulette started developing an Arrow implementation in TypeScript for use in NodeJS and browser-side applications. We are benefitting from Flatbuffers’ first class support for JavaScript.

Improved Website and Developer Documentation

Since 0.2.0 we have implemented a new website stack for publishing documentation and blogs based on Jekyll. Kouhei Sutou developed a Jekyll Jupyter Notebook plugin so that we can use Jupyter to author content for the Arrow website.

On the website, we have now published API documentation for the C, C++, Java, and Python subcomponents. Within these you will find easier-to-follow developer instructions for getting started.

Contributors

Thanks to all who contributed patches to this release.

$ git shortlog -sn apache-arrow-0.2.0..apache-arrow-0.3.0
    119 Wes McKinney
     55 Kouhei Sutou
     18 Uwe L. Korn
     17 Julien Le Dem
      9 Phillip Cloud
      6 Bryan Cutler
      5 Philipp Moritz
      5 Emilio Lahr-Vivaz
      4 Max Risuhin
      4 Johan Mabille
      4 Jeff Knupp
      3 Steven Phillips
      3 Miki Tebeka
      2 Leif Walsh
      2 Jeff Reback
      2 Brian Hulette
      1 Tsuyoshi Ozawa
      1 rvernica
      1 Nong Li
      1 Julien Lafaye
      1 Itai Incze
      1 Holden Karau
      1 Deepak Majeti