Apache Arrow 0.4.0 Release


Published 23 May 2017
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.4.0 release of the project. While only 17 days since the release, it includes 77 resolved JIRAs with some important new features and bug fixes.

See the Install Page to learn how to get the libraries for your platform.

Expanded JavaScript Implementation

The TypeScript Arrow implementation has undergone some work since 0.3.0 and can now read a substantial portion of the Arrow streaming binary format. As this implementation develops, we will eventually want to include JS in the integration test suite along with Java and C++ to ensure wire cross-compatibility.

Python Support for Apache Parquet on Windows

With the 1.1.0 C++ release of Apache Parquet, we have enabled the pyarrow.parquet extension on Windows for Python 3.5 and 3.6. This should appear in conda-forge packages and PyPI in the near future. Developers can follow the source build instructions.

Generalizing Arrow Streams

In the 0.2.0 release, we defined the first version of the Arrow streaming binary format for low-cost messaging with columnar data. These streams presume that the message components are written as a continuous byte stream over a socket or file.

We would like to be able to support other other transport protocols, like gRPC, for the message components of Arrow streams. To that end, in C++ we defined an abstract stream reader interface, for which the current contiguous streaming format is one implementation:

class RecordBatchReader {
 public:
  virtual std::shared_ptr<Schema> schema() const = 0;
  virtual Status GetNextRecordBatch(std::shared_ptr<RecordBatch>* batch) = 0;
};

It would also be good to define abstract stream reader and writer interfaces in the Java implementation.

In an upcoming blog post, we will explain in more depth how Arrow streams work, but you can learn more about them by reading the IPC specification.

C++ and Cython API for Python Extensions

As other Python libraries with C or C++ extensions use Apache Arrow, they will need to be able to return Python objects wrapping the underlying C++ objects. In this release, we have implemented a prototype C++ API which enables Python wrapper objects to be constructed from C++ extension code:

#include "arrow/python/pyarrow.h"

if (!arrow::py::import_pyarrow()) {
  // Error
}

std::shared_ptr<arrow::RecordBatch> cpp_batch = GetData(...);
PyObject* py_batch = arrow::py::wrap_batch(cpp_batch);

This API is intended to be usable from Cython code as well:

cimport pyarrow
pyarrow.import_pyarrow()

Python Wheel Installers on macOS

With this release, pip install pyarrow works on macOS (OS X) as well as Linux. We are working on providing binary wheel installers for Windows as well.