arrow package provides
reticulate methods for passing data between R and Python in the same process. This document provides a brief overview.
arrow in Python, at a minimum you’ll need the
pyarrow library. To install it in a virtualenv,
If you want to install a development version of
nightly = TRUE:
install_pyarrow("arrow-env", nightly = TRUE)
For more on installing and configuring Python, see the reticulate docs.
To start, load
reticulate, and then import
The package includes support for sharing Arrow
RecordBatch objects in-process between R and Python. For example, let’s create an
a <- pa$array(c(1, 2, 3)) a ## Array ## <double> ## [ ## 1, ## 2, ## 3 ## ]
a is now an
Array object in our R session, even though we created it in Python. We can apply R methods on it:
a[a > 1] ## Array ## <double> ## [ ## 2, ## 3 ## ]
We can send data both ways. One reason we might want to use
pyarrow in R is to take advantage of functionality that is better supported in Python than in R. For example,
pyarrow has a
concat_arrays function, but as of 0.17, this function is not implemented in the
arrow R package. We can use
reticulate to use it efficiently.
Now we have a single
Array in R.
“Send”, however, isn’t the correct word. Internally, we’re passing pointers to the data between the R and Python interpreters running together in the same process, without copying anything. Nothing is being sent: we’re sharing and accessing the same internal Arrow memory buffers.
If you get an error like
Error in py_get_attr_impl(x, name, silent) : AttributeError: 'pyarrow.lib.DoubleArray' object has no attribute '_export_to_c'
it means that the version of
pyarrow you’re using is too old. Support for passing data to and from R is included in versions 0.17 and greater. Check your pyarrow version like this:
pa$`__version__` ##  "0.16.0"
Note that your
arrow versions don’t need themselves to match: they just need to be 0.17 or greater.