Introducing the Apache Arrow C Data Interface

Published 03 May 2020
By Antoine Pitrou (apitrou)

Apache Arrow includes a cross-language, platform-independent in-memory columnar format allowing zero-copy data sharing and transfer between heterogenous runtimes and applications.

The easiest way to use the Arrow columnar format has always been to depend on one of the concrete implementations developed by the Apache Arrow community. The project codebase contains libraries for 11 different programming languages so far, and will likely grow to include more languages in the future.

However, some projects may wish to import and export the Arrow columnar format without taking on a new library dependency, such as the Arrow C++ library. We have therefore designed an alternative which exchanges data at the C level, conforming to a simple data definition. The C Data Interface carries no dependencies except a shared C ABI between binaries which use it. C ABIs are platform-wide standards which are necessarily adhered to by all compilers which generate binaries and are extremely stable, ensuring portability of libraries and executable binaries. Two libraries that utilize the C structures defined by the C Data Interface can do zero-copy data transfers at runtime without any build-time or link-time dependency requirements.

The best way to learn about the C Data Interface is to read the spec. However, we will quickly go over its strong points.

Two simple struct definitions

To interact with the C Data Interface at the C or C++ level, the only thing you have to include in your code is two struct type declarations (and a couple of #defines for constant values). Those declarations only depend on standard C types, and can simply be pasted in a header file. Other languages can also participate as long as they provide a Foreign Function Interface layer; this is the case for most modern languages, such as Python (with ctypes or cffi), Julia, Rust, Go, etc.

Zero-copy data sharing

The C Data Interface passes Arrow data buffers through memory pointers. So, by construction, it allows you to share data from one runtime to another without copying it. Since the data is in standard Arrow in-memory format, its layout is well-defined and unambiguous.

This design also restricts the C Data Interface to in-process data sharing. For interprocess communication, we recommend use of the Arrow IPC format.

Reduced marshalling

The C Data Interface stays close to the natural way of expressing Arrow-like data in C or C++. Only two aspects involve non-trivial marshalling:

the encoding of data types, using a very simple string-based language
the encoding of optional metadata, using a very simple length-prefixed format

Separate type and data representation

For applications which produce many instances of data of a single datatype (for example, as a stream of record batches), repeatedly reconstructing the datatype from its string encoding would represent unnecessary overhead. To address this use case, the C Data Interface defines two independent structures: one representing a datatype (and optional metadata), one representing a piece of data.

Lifetime handling

One common difficulty of data sharing between heterogenous runtimes is to correctly handle the lifetime of data. The C Data Interface allows the producer to define its own memory management scheme through a release callback. This is a simple function pointer which consumers will call when they are finished using the data. For example when used as a producer the Arrow C++ library passes a release callback which simply decrements a shared_ptr's reference count.

Application: passing data between R and Python

The R and Python Arrow libraries are both based on the Arrow C++ library, however their respective toolchains (mandated by the R and Python packaging standards) are ABI-incompatible. It is therefore impossible to pass data directly at the C++ level between the R and Python bindings.

Using the C Data Interface, we have circumvented this restriction and provide a zero-copy data sharing API between R and Python. It is based on the R reticulate library.

Here is an example session mixing R and Python library calls:

library(arrow)
library(reticulate)
use_virtualenv("arrow")
pa <- import("pyarrow")

# Create an array in PyArrow
a <- pa$array(c(1, 2, 3))
a

## Array
## <double>
## [
##   1,
##   2,
##   3
## ]

# Apply R methods on the PyArrow-created array:
a[a > 1]

## Array
## <double>
## [
##   2,
##   3
## ]

# Create an array in R and pass it to PyArrow
b <- Array$create(c(5, 6, 7))
a_and_b <- pa$concat_arrays(r_to_py(list(a, b)))
a_and_b

## Array
## <double>
## [
##   1,
##   2,
##   3,
##   5,
##   6,
##   7
## ]