On macOS and Windows, when you install.packages("arrow"), you get a binary package that contains Arrow’s C++ dependencies along with it. On Linux, install.packages() retrieves a source package that has to be compiled locally, and C++ dependencies need to be resolved as well. Generally for R packages with C++ dependencies, this requires either installing system packages, which you may not have privileges to do, or building the C++ dependencies separately, which introduces all sorts of additional ways for things to go wrong.

Our goal is to make install.packages("arrow") “just work” for as many Linux distributions, versions, and configurations as possible. This document describes how it works and the options for fine-tuning Linux installation. The intended audience for this document is arrow R package users on Linux, not developers. If you’re contributing to the Arrow project, you’ll probably want to manage your C++ installation more directly. Note also that if you use conda to manage your R environment, this document does not apply. You can conda install -c conda-forge r-arrow and you’ll get the latest official release of the R package along with any C++ dependencies.

Installation basics

Install the latest release of arrow from CRAN with

Daily development builds, which are not official releases, can be installed from the Ursa Labs repository:

install.packages("arrow", repos = "https://dl.bintray.com/ursalabs/arrow-r")

There currently are no daily conda builds.

You can also install the R package from a git checkout:

git clone https://github.com/apache/arrow
cd arrow/r
R CMD INSTALL .

How dependencies are resolved

In order for the arrow R package to work, it needs the Arrow C++ library. There are a number of ways you can get it: a system package; a library you’ve built yourself outside of the context of installing the R package; or, if you don’t already have it, the R package will attempt to resolve it automatically when it installs.

If you are authorized to install system packages and you’re installing a CRAN release, you may want to use the official Apache Arrow release packages corresponding to the R package version. See the Arrow project installation page to find pre-compiled binary packages for some common Linux distributions, including Debian, Ubuntu, and CentOS. You’ll need to install libparquet-dev on Debian and Ubuntu, or parquet-devel on CentOS. This will also automatically install the Arrow C++ library as a dependency.

When you install the arrow R package on Linux, it will first attempt to find the Arrow C++ libraries on your system using the pkg-config command. This will find either installed system packages or libraries you’ve built yourself. In order for install.packages("arrow") to work with these system packages, you’ll need to install them before installing the R package.

If no Arrow C++ libraries are found on the system, the R package installation script will next attempt to download prebuilt static Arrow C++ libraries that match your both your local operating system and arrow R package version. If found, they will be downloaded and bundled when your R package compiles. For a list of supported distributions and versions, see the arrow-r-nightly project.

If no binary is found, it will download the Arrow C++ source that matches the R package version (CRAN release or nightly build) and attempt to build it locally. If no matching source bundle is found, it will also look to see if you are in a checkout of the apache/arrow git repository and thus have the C++ source there. Depending on your system, building Arrow C++ from source likely will be slow; consequently, it is designed to happen only when you run install.packages("arrow") or R CMD INSTALL but not when running R CMD check, unless you’ve set the NOT_CRAN=true environment variable.

For the mechanics of how all this works, see the R package configure script, which calls tools/linuxlibs.R. If the C++ library is built from source, inst/build_arrow_static.sh is executed. This build script is also what is used to generate the prebuilt binaries.

Troubleshooting and additional options

The intent is that install.packages("arrow") will just work and handle all C++ dependencies, but depending on your system, you may have better results if you tune one of several parameters. Here are some known complications and ways to address them.

Using system libraries

If a system library or other installed Arrow is found but it doesn’t match the R package version (for example, you have libarrow 0.14 on your system and are installing R package 0.15.1), it is likely that the R bindings will fail to compile. Because the Apache Arrow project is under active development, is it essential that versions of the C++ and R libraries match. When install.packages("arrow") has to download the C++ libraries, the install script ensures that you fetch the C++ libraries that correspond to your R package version. However, if you are using Arrow libraries already on your system, version match isn’t guaranteed.

To fix version mismatch, you can either update your system packages to match the R package version, or set the environment variable ARROW_USE_PKG_CONFIG=FALSE to tell the configure script not to look for system Arrow packages. System packages are available corresponding to all CRAN releases but not for nightly or dev versions, so depending on the R package version you’re installing, system packages may not be an option.

Note also that once you have a working R package installation based on system (shared) libraries, if you update your system Arrow, you’ll need to reinstall the R package to match its version.

Using a local Arrow C++ build

If you’ve built the Arrow C++ libraries locally from source but haven’t installed them where pkg-config will find them, there are a few options for telling the R package how to locate them. You can set PKG_CONFIG_PATH to /path/to/your/installation/pkgconfig (that is, PKG_CONFIG_PATH=${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}/pkgconfig, if you’ve set those variables). Alternatively, you can set the INCLUDE_DIR and LIB_DIR environment variables to point to their location.

If the package fails to install/load with an error like this:

** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib

try setting the environment variable R_LD_LIBRARY_PATH to wherever Arrow C++ was put in make install, e.g. export R_LD_LIBRARY_PATH=/usr/local/lib, and retry installing the R package.

Using prebuilt binaries

If the R package finds and downloads a prebuilt binary of the C++ library, but then the arrow package can’t be loaded, perhaps with “undefined symbols” errors, please report an issue. This is likely a compiler mismatch and may be resolvable by setting some environment variables to instruct R to compile the packages to match the C++ library.

A workaround would be to set the environment variable LIBARROW_BINARY_DISTRO=FALSE and retry installation: this value instructs the package to build the C++ library from source instead of downloading the prebuilt binary. That should guarantee that the compiler settings match.

If a prebuilt binary wasn’t found for your operating system but you think it should have been, check the logs for a message that says *** Unable to identify current OS/version, or a message that says *** No C++ binaries found for an invalid OS. If you see either, please report an issue. You may also set the environment variable ARROW_R_DEV=TRUE for additional debug messages.

A workaround would be to set the environment variable LIBARROW_BINARY_DISTRO to a distribution-version that exists in the Ursa Labs repository. Setting LIBARROW_BINARY_DISTRO is also an option when there’s not an exact match for your OS but a similar version would work, such as if you’re on ubuntu-18.10 and there’s only a binary for ubuntu-18.04.

If that workaround works for you, and you believe that it should work for everyone else too, you may propose adding an entry to this lookup table. This table is checked during the installation process and tells the script to use binaries built on a different operating system/version because they’re known to work.

Building C++ from source

If building the C++ library from source fails, check the error message. The install script attempts to install any necessary build dependencies, but it’s possible that some operating systems may require additional ones. You may be able to install them and retry. Regardless, if the C++ library fails to compile, please report an issue so that we can attempt to improve the script.

Known C++ build issues

  1. m4 (build dependency for flex and bison, which are build dependencies for thrift) fails to build with a message like:
freadahead.c: In function 'freadahead':
freadahead.c:92:3: error: #error "Please port gnulib freadahead.c to your platform! Look at the definition of fflush, fread, ungetc on your system, then report this to bug-gnulib."

This has been observed on CentOS 8 (Docker image rstudio/r-base:3.6-centos8) and Fedora (rhub/fedora-clang-devel).

A solution is to install m4 using your system package manager; if that’s an option for you, you may as well just install flex and bison and avoid this build step entirely.

Summary of build environment variables

By default, these are all unset.

  • ARROW_USE_PKG_CONFIG: If set to FALSE, the configure script won’t look for Arrow libraries on your system and instead will look to download/build them. Use this if you have a version mismatch between installed system libraries and the version of the R package you’re installing.
  • LIBARROW_DOWNLOAD: If set to FALSE (case insensitive), the build script will not attempt to download C++ binary or source bundles. Use this if you’re in a checkout of the apache/arrow git repository and want to build the C++ library from the local source.
  • LIBARROW_BUILD: If set to FALSE (case insensitive), the build script will not attempt to build the C++ from source. This means you will only get a working arrow R package if a prebuilt binary is found. Use this if you want to avoid compiling the C++ library, which may be slow and resource-intensive, and ensure that you only use a prebuilt binary.
  • LIBARROW_BINARY_DISTRO: If set to FALSE (case insensitive), the script will not download a binary, but it may still download a source bundle. You may also set it to some other string, a related “distro-version” that has binaries built that work for your OS.
  • NOT_CRAN: If this variable is set to true, as the devtools package does, the build script will attempt to download and/or build the Arrow C++ library, if necessary, even when running R CMD check. Otherwise, it will only download/build C++ if you’re not running R CMD check. The purpose of this protection is to avoid expensive compilation in automated testing environments (unless you opt-in).
  • ARROW_R_DEV: If set to TRUE, more verbose messaging will be printed in the build script. This variable also is needed if you’re modifying Rcpp code in the package: see “Editing Rcpp code” in the README.
  • DEBUG_DIR: If the C++ library building from source fails (cmake), there may be messages telling you to check some log file in the build directory. However, when the library is built during R package installation, that location is in a temp directory that is already deleted. To capture those logs, set this variable to an absolute (not relative) path and the log files will be copied there. The directory will be created if it does not exist.

Contributing

As mentioned above, please report an issue if you encounter ways to improve this. If you find that your Linux distribution or version is not supported, we welcome the contribution of Docker images (hosted on Docker Hub) that we can use in our continuous integration. These Docker images should be minimal, containing only R and the dependencies it requires. (For reference, see the images that R-hub uses.)

You can test the arrow R package installation using the docker-compose setup included in the apache/arrow git repository. For example,

R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose build r
R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose run r

installs the arrow R package, including the C++ source build, on the rhub/ubuntu-gcc-release image.