Apache Arrow 0.7.0 Release
19 Sep 2017
By Wes McKinney (wesm)
The Apache Arrow team is pleased to announce the 0.7.0 release. It includes 133 resolved JIRAs many new features and bug fixes to the various language implementations. The Arrow memory format remains stable since the 0.3.x release.
We include some highlights from the release in this post.
New PMC Member: Kouhei Sutou
Since the last release we have added Kou to the Arrow Project Management Committee. He is also a PMC for Apache Subversion, and a major contributor to many other open source projects.
As an active member of the Ruby community in Japan, Kou has been developing the GLib-based C bindings for Arrow with associated Ruby wrappers, to enable Ruby users to benefit from the work that’s happening in Apache Arrow.
We are excited to be collaborating with the Ruby community on shared infrastructure for in-memory analytics and data science.
Type casting for C++ and Python
As part of longer-term efforts to build an Arrow-native in-memory analytics
library, we implemented a variety of type conversion functions. These functions
are essential in ETL tasks when conforming one table schema to another. These
are similar to the
astype function in NumPy.
In : import pyarrow as pa In : arr = pa.array([True, False, None, True]) In : arr Out: <pyarrow.lib.BooleanArray object at 0x7ff6fb069b88> [ True, False, NA, True ] In : arr.cast(pa.int32()) Out: <pyarrow.lib.Int32Array object at 0x7ff6fb0383b8> [ 1, 0, NA, 1 ]
Over time these will expand to support as many input-and-output type combinations with optimized conversions.
New Arrow GPU (CUDA) Extension Library for C++
To help with GPU-related projects using Arrow, like the GPU Open Analytics Initiative, we have started a C++ add-on library to simplify Arrow memory management on CUDA-enabled graphics cards. We would like to expand this to include a library of reusable CUDA kernel functions for GPU analytics on Arrow columnar memory.
For example, we could write a record batch from CPU memory to GPU device memory like so (some error checking omitted):
#include <arrow/api.h> #include <arrow/gpu/cuda_api.h> using namespace arrow; gpu::CudaDeviceManager* manager; std::shared_ptr<gpu::CudaContext> context; gpu::CudaDeviceManager::GetInstance(&manager) manager_->GetContext(kGpuNumber, &context); std::shared_ptr<RecordBatch> batch = GetCpuData(); std::shared_ptr<gpu::CudaBuffer> device_serialized; gpu::SerializeRecordBatch(*batch, context_.get(), &device_serialized));
We can then “read” the GPU record batch, but the returned
internally will contain GPU device pointers that you can use for CUDA kernel
std::shared_ptr<RecordBatch> device_batch; gpu::ReadRecordBatch(batch->schema(), device_serialized, default_memory_pool(), &device_batch)); // Now run some CUDA kernels on device_batch
Decimal Integration Tests
Phillip Cloud has been working on decimal support in C++ to enable Parquet read/write support in C++ and Python, and also end-to-end testing against the Arrow Java libraries.
In the upcoming releases, we hope to complete the remaining data types that need end-to-end testing between Java and C++:
- Fixed size lists (variable-size lists already implemented)
- Fixes size binary
- Time intervals
Other Notable Python Changes
Some highlights of Python development outside of bug fixes and general API improvements include:
getarbitrary Python objects in Plasma objects
- High-speed, memory efficient object serialization. This is important enough that we will likely write a dedicated blog post about it.
pyarrow.parquet.write_tableto enable easy writing of Parquet files maximized for Spark compatibility
parquet.write_to_datasetfunction with support for partitioned writes
- Improved support for Dask filesystems
- Improved Python usability for IPC: read and write schemas and record batches more easily. See the API docs for more about these.
The Road Ahead
Upcoming Arrow releases will continue to expand the project to cover more use cases. In addition to completing end-to-end testing for all the major data types, some of us will be shifting attention to building Arrow-native in-memory analytics libraries.