The arrow package provides reticulate methods for passing data between R and Python in the same process. This document provides a brief overview.

Installing

To use arrow in Python, at a minimum you’ll need the pyarrow library. To install it in a virtualenv,

library(reticulate)
virtualenv_create("arrow-env")
install_pyarrow("arrow-env")

If you want to install a development version of pyarrow, add nightly = TRUE:

install_pyarrow("arrow-env", nightly = TRUE)

install_pyarrow() also works with conda environments (conda_create() instead of virtualenv_create()).

For more on installing and configuring Python, see the reticulate docs.

Using

To start, load arrow and reticulate, and then import pyarrow.

library(arrow)
library(reticulate)
use_virtualenv("arrow-env")
pa <- import("pyarrow")

The package includes support for sharing Arrow Array and RecordBatch objects in-process between R and Python. For example, let’s create an Array in pyarrow.

a <- pa$array(c(1, 2, 3))
a

## Array
## <double>
## [
##   1,
##   2,
##   3
## ]

a is now an Array object in our R session, even though we created it in Python. We can apply R methods on it:

a[a > 1]

## Array
## <double>
## [
##   2,
##   3
## ]

We can send data both ways. One reason we might want to use pyarrow in R is to take advantage of functionality that is better supported in Python than in R. For example, pyarrow has a concat_arrays function, but as of 0.17, this function is not implemented in the arrow R package. We can use reticulate to use it efficiently.

b <- Array$create(c(5, 6, 7, 8, 9))
a_and_b <- pa$concat_arrays(list(a, b))
a_and_b

## Array
## <double>
## [
##   1,
##   2,
##   3,
##   5,
##   6,
##   7,
##   8,
##   9
## ]

Now we have a single Array in R.

“Send”, however, isn’t the correct word. Internally, we’re passing pointers to the data between the R and Python interpreters running together in the same process, without copying anything. Nothing is being sent: we’re sharing and accessing the same internal Arrow memory buffers.

Troubleshooting

If you get an error like

Error in py_get_attr_impl(x, name, silent) :
  AttributeError: 'pyarrow.lib.DoubleArray' object has no attribute '_export_to_c'

it means that the version of pyarrow you’re using is too old. Support for passing data to and from R is included in versions 0.17 and greater. Check your pyarrow version like this:

pa$`__version__`

## [1] "0.16.0"

Note that your pyarrow and arrow versions don’t need themselves to match: they just need to be 0.17 or greater.