Apache Arrow 0.7.0 Release

Published 19 Sep 2017
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.7.0 release. It includes 133 resolved JIRAs many new features and bug fixes to the various language implementations. The Arrow memory format remains stable since the 0.3.x release.

See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

We include some highlights from the release in this post.

New PMC Member: Kouhei Sutou

Since the last release we have added Kou to the Arrow Project Management Committee. He is also a PMC for Apache Subversion, and a major contributor to many other open source projects.

As an active member of the Ruby community in Japan, Kou has been developing the GLib-based C bindings for Arrow with associated Ruby wrappers, to enable Ruby users to benefit from the work that’s happening in Apache Arrow.

We are excited to be collaborating with the Ruby community on shared infrastructure for in-memory analytics and data science.

Expanded JavaScript (TypeScript) Implementation

Paul Taylor from the Falcor and ReactiveX projects has worked to expand the JavaScript implementation (which is written in TypeScript), using the latest in modern JavaScript build and packaging technology. We are looking forward to building out the JS implementation and bringing it up to full functionality with the C++ and Java implementations.

We are looking for more JavaScript developers to join the project and work together to make Arrow for JS work well with many kinds of front end use cases, like real time data visualization.

Type casting for C++ and Python

As part of longer-term efforts to build an Arrow-native in-memory analytics library, we implemented a variety of type conversion functions. These functions are essential in ETL tasks when conforming one table schema to another. These are similar to the astype function in NumPy.

In [17]: import pyarrow as pa

In [18]: arr = pa.array([True, False, None, True])

In [19]: arr
Out[19]:
<pyarrow.lib.BooleanArray object at 0x7ff6fb069b88>
[
  True,
  False,
  NA,
  True
]

In [20]: arr.cast(pa.int32())
Out[20]:
<pyarrow.lib.Int32Array object at 0x7ff6fb0383b8>
[
  1,
  0,
  NA,
  1
]

Over time these will expand to support as many input-and-output type combinations with optimized conversions.

New Arrow GPU (CUDA) Extension Library for C++

To help with GPU-related projects using Arrow, like the GPU Open Analytics Initiative, we have started a C++ add-on library to simplify Arrow memory management on CUDA-enabled graphics cards. We would like to expand this to include a library of reusable CUDA kernel functions for GPU analytics on Arrow columnar memory.

For example, we could write a record batch from CPU memory to GPU device memory like so (some error checking omitted):

#include <arrow/api.h>
#include <arrow/gpu/cuda_api.h>

using namespace arrow;

gpu::CudaDeviceManager* manager;
std::shared_ptr<gpu::CudaContext> context;

gpu::CudaDeviceManager::GetInstance(&manager)
manager_->GetContext(kGpuNumber, &context);

std::shared_ptr<RecordBatch> batch = GetCpuData();

std::shared_ptr<gpu::CudaBuffer> device_serialized;
gpu::SerializeRecordBatch(*batch, context_.get(), &device_serialized));

We can then “read” the GPU record batch, but the returned arrow::RecordBatch internally will contain GPU device pointers that you can use for CUDA kernel calls:

std::shared_ptr<RecordBatch> device_batch;
gpu::ReadRecordBatch(batch->schema(), device_serialized,
                     default_memory_pool(), &device_batch));

// Now run some CUDA kernels on device_batch

Decimal Integration Tests

Phillip Cloud has been working on decimal support in C++ to enable Parquet read/write support in C++ and Python, and also end-to-end testing against the Arrow Java libraries.

In the upcoming releases, we hope to complete the remaining data types that need end-to-end testing between Java and C++:

Fixed size lists (variable-size lists already implemented)
Fixes size binary
Unions
Maps
Time intervals

Other Notable Python Changes

Some highlights of Python development outside of bug fixes and general API improvements include:

Simplified put and get arbitrary Python objects in Plasma objects
High-speed, memory efficient object serialization. This is important enough that we will likely write a dedicated blog post about it.
New flavor='spark' option to pyarrow.parquet.write_table to enable easy writing of Parquet files maximized for Spark compatibility
parquet.write_to_dataset function with support for partitioned writes
Improved support for Dask filesystems
Improved Python usability for IPC: read and write schemas and record batches more easily. See the API docs for more about these.

The Road Ahead

Upcoming Arrow releases will continue to expand the project to cover more use cases. In addition to completing end-to-end testing for all the major data types, some of us will be shifting attention to building Arrow-native in-memory analytics libraries.

We are looking for more JavaScript, R, and other programming language developers to join the project and expand the available implementations and bindings to more languages.