In most cases, install.packages("arrow")
should just work. There are things you can do to make the installation faster, documented in this article. If for some reason installation does not work, set the environment variable ARROW_R_DEV=true
, retry, and share the logs with us.
Background
The Apache Arrow project is implemented in multiple languages, and the R package depends on the Arrow C++ library (referred to from here on as libarrow). This means that when you install arrow, you need both the R and C++ versions. If you install arrow from CRAN on a machine running Windows or macOS, when you call install.packages("arrow")
, a precompiled binary containing both the R package and libarrow will be downloaded. However, CRAN does not host R package binaries for Linux, and so you must choose from one of the alternative approaches.
This article outlines the recommend approaches to installing arrow on Linux, starting from the simplest and least customizable to the most complex but with more flexibility to customize your installation.
The primary audience for this document is arrow R package users on Linux, and not Arrow developers. Additional resources for developers are listed at the end of this article.
System dependencies
The arrow package is designed to work with very minimal system requirements, but there are a few things to note.
Compilers
As of version 10.0.0, arrow requires a C++17 compiler to build. For gcc
, this generally means version 7 or newer. Most contemporary Linux distributions have a new enough compiler; however, CentOS 7 is a notable exception, as it ships with gcc 4.8.
If you are on CentOS 7, to build arrow you will need to install a newer devtoolset
, and you’ll need to update R’s Makevars to define the CXX17
variables. This script installs devtoolset-8
and configures R to be able to use C++17:
#!/usr/bin/env bash
yum install -y centos-release-scl
yum install -y devtoolset-8
# Optional: also install cloud storage dependencies, as described below
yum install -y libcurl-devel openssl-devel
source /opt/rh/devtoolset-8/enable
if [ ! `R CMD config CXX17` ]; then
mkdir -p ~/.R
echo "CC = $(which gcc) -fPIC" >> ~/.R/Makevars
echo "CXX17 = $(which g++) -fPIC" >> ~/.R/Makevars
echo "CXX17STD = -std=c++17" >> ~/.R/Makevars
echo "CXX17FLAGS = ${CXX11FLAGS}" >> ~/.R/Makevars
fi
Note that the C++17 compiler is only required at build time. You don’t need to enable the devtoolset every time you load the package. What’s more, if you install a binary package from RStudio Package Manager (see method 1a below), you do not need to set up any of this. Likewise, if you R CMD INSTALL --build
arrow on a CentOS machine with the newer compilers, you can take the binary package it produces and install it on any other CentOS machine without those compilers.
Libraries
Optional support for reading from cloud storage–AWS S3 and Google Cloud Storage (GCS)–requires additional system dependencies:
- CURL: install
libcurl-devel
(rpm) orlibcurl4-openssl-dev
(deb) - OpenSSL >= 1.0.2: install
openssl-devel
(rpm) orlibssl-dev
(deb)
The prebuilt binaries come with S3 and GCS support enabled, so you will need to meet these system requirements in order to use them. If you’re building everything from source, the install script will check for the presence of these dependencies and turn off S3 and GCS support in the build if the prerequisites are not met–installation will succeed but without S3 or GCS functionality. If afterwards you install the missing system requirements, you’ll need to reinstall the package in order to enable S3 and GCS support.
Install release version (easy way)
On macOS and Windows, when you run install.packages("arrow")
and install arrow from CRAN, you get an R binary package that contains a precompiled version of libarrow. Installing binaries is much easier that installing from source, but CRAN does not host binaries for Linux. This means that the default behaviour when you run install.packages()
on Linux is to retrieve the source version of the R package and compile both the R package and libarrow from source. We’ll talk about this scenario in the next section (the “less easy” way), but first we’ll suggest two faster alternatives that are usually much easier.
Binary R package with libarrow binary via RSPM/conda
If you want a quicker installation process, and by default a more fully-featured build, you could install arrow from RStudio’s public package manager, which hosts binaries for both Windows and Linux.
For example, if you are using Ubuntu 20.04 (Focal):
options(
HTTPUserAgent =
sprintf(
"R/%s R (%s)",
getRversion(),
paste(getRversion(), R.version["platform"], R.version["arch"], R.version["os"])
)
)
install.packages("arrow", repos = "https://packagemanager.rstudio.com/all/__linux__/focal/latest")
Note that the User Agent header must be specified as in the example above. Please check the RStudio Package Manager: Admin Guide for more details.
For other Linux distributions, to get the relevant URL, you can visit the RSPM site, click on ‘binary’, and select your preferred distribution.
Similarly, if you use conda
to manage your R environment, you can get the latest official release of the R package including libarrow via:
# Using the --strict-channel-priority flag on `conda install` causes very long
# solve times, so we add it directly to the config
conda config --set channel_priority strict
conda install -c conda-forge r-arrow
R source package with libarrow binary
Another way of achieving faster installation with all key features enabled is to use static libarrow binaries we host. These are used automatically on many Linux distributions (x86_64 architecture only), according to the allowlist. If your distribution isn’t in the list, you can opt-in by setting the NOT_CRAN
environment variable before you call install.packages()
:
Sys.setenv("NOT_CRAN" = "true")
install.packages("arrow")
This installs the source version of the R package, but during the installation process will check for compatible libarrow binaries that we host and use those if available. If no binary is available or can’t be found, then this option falls back onto method 2 below (full source build), but setting the environment variable results in a more fully-featured build than default.
The libarrow binaries include support for AWS S3 and GCS, so they require the libcurl and openssl libraries installed separately, as noted above. If you don’t have these installed, the libarrow binary won’t be used, and you will fall back to the full source build (with S3 and GCS support disabled).
If the internet access of your computer doesn’t allow downloading the libarrow binaries (e.g. if access is limited to CRAN), you can first identify the right source and version by trying to install on the offline computer:
Sys.setenv("NOT_CRAN" = "true", "LIBARROW_BUILD" = FALSE, "ARROW_R_DEV" = TRUE)
install.packages("arrow")
# This will fail if no internet access, but will print the binaries URL
Then you can obtain the libarrow binaries (using a computer with internet access) and transfer the zip file to the target computer. Now you just have to tell the installer to use that pre-downloaded file:
# Watchout: release numbers of the pre-downloaded libarrow must match CRAN!
Sys.setenv("ARROW_DOWNLOADED_BINARIES" = "/path/to/downloaded/libarrow.zip")
install.packages("arrow")
Install release version (less easy)
The “less easy” way to install arrow is to install both the R package and the underlying Arrow C++ library (libarrow) from source. This method is somewhat more difficult because compiling and installing R packages with C++ dependencies generally requires installing system packages, which you may not have privileges to do, and/or building the C++ dependencies separately, which introduces all sorts of additional ways for things to go wrong.
Installing from the full source build of arrow, compiling both C++ and R bindings, will handle most of the dependency management for you, but it is much slower than using binaries. However, if using binaries isn’t an option for you,or you wish to customize your Linux installation, the instructions in this section explain how to do that.
Basic configuration
If you wish to install libarrow from source instead of looking for pre-compiled binaries, you can set the LIBARROW_BINARY
variable.
Sys.setenv("LIBARROW_BINARY" = FALSE)
By default, this is set to TRUE
, and so libarrow will only be built from source if this environment variable is set to FALSE
or no compatible binary for your OS can be found.
When compiling libarrow from source, you have the power to really fine-tune which features to install. You can set the environment variable LIBARROW_MINIMAL
to FALSE
to enable a more full-featured build including S3 support and alternative memory allocators.
Sys.setenv("LIBARROW_MINIMAL" = FALSE)
By default this variable is unset, which builds many commonly used features such as Parquet support but disables some features that are more costly to build, like S3 and GCS support. If set to TRUE
, a trimmed-down version of arrow is installed with all optional features disabled.
Note that in this guide, you will have seen us mention the environment variable NOT_CRAN
- this is a convenience variable, which when set to TRUE
, automatically sets LIBARROW_MINIMAL
to FALSE
and LIBARROW_BINARY
to TRUE
.
Building libarrow from source requires more time and resources than installing a binary. We recommend that you set the environment variable ARROW_R_DEV
to TRUE
for more verbose output during the installation process if anything goes wrong.
Sys.setenv("ARROW_R_DEV" = TRUE)
Once you have set these variables, call install.packages()
to install arrow using this configuration.
install.packages("arrow")
The section below discusses environment variables you can set before calling install.packages("arrow")
to build from source and customise your configuration.
Handling libarrow dependencies
When you build libarrow from source, its dependencies will be automatically downloaded. The environment variable ARROW_DEPENDENCY_SOURCE
controls whether the libarrow installation also downloads or installs all dependencies (when set to BUNDLED
), uses only system-installed dependencies (when set to SYSTEM
) or checks system-installed dependencies first and only installs dependencies which aren’t already present (when set to AUTO
, the default).
These dependencies vary by platform; however, if you wish to install these yourself prior to libarrow installation, we recommend that you take a look at the docker file for whichever of our CI builds (the ones ending in “cpp” are for building Arrow’s C++ libraries, aka libarrow) corresponds most closely to your setup. This will contain the most up-to-date information about dependencies and minimum versions.
If downloading dependencies at build time is not an option, as when building on a system that is disconnected or behind a firewall, there are a few options. See “Offline builds” below.
Dependencies for S3 and GCS support
Support for working with data in S3 and GCS is not enabled in the default source build, and it has additional system requirements as described above. To enable it, set the environment variable LIBARROW_MINIMAL=false
or NOT_CRAN=true
to choose the full-featured build, or more selectively set ARROW_S3=ON
and/or ARROW_GCS=ON
.
When either feature is enabled, the install script will check for the presence of the required dependencies, and if the prerequisites are met, it will turn off S3 and GCS support–installation will succeed but without S3 or GCS functionality. If afterwards you install the missing system requirements, you’ll need to reinstall the package in order to enable S3 and GCS support.
Advanced configuration
In this section, we describe how to fine-tune your installation at a more granular level.
libarrow configuration
Some features are optional when you build Arrow from source - you can configure whether these components are built via the use of environment variables. The names of the environment variables which control these features and their default values are shown below.
Name | Description | Default Value |
---|---|---|
ARROW_S3 |
S3 support (if dependencies are met)* | OFF |
ARROW_GCS |
GCS support (if dependencies are met)* | OFF |
ARROW_JEMALLOC |
The jemalloc memory allocator |
ON |
ARROW_MIMALLOC |
The mimalloc memory allocator |
ON |
ARROW_PARQUET |
ON |
|
ARROW_DATASET |
ON |
|
ARROW_JSON |
The JSON parsing library | ON |
ARROW_WITH_RE2 |
The RE2 regular expression library, used in some string compute functions | ON |
ARROW_WITH_UTF8PROC |
The UTF8Proc string library, used in many other string compute functions | ON |
ARROW_WITH_BROTLI |
Compression algorithm | ON |
ARROW_WITH_BZ2 |
Compression algorithm | ON |
ARROW_WITH_LZ4 |
Compression algorithm | ON |
ARROW_WITH_SNAPPY |
Compression algorithm | ON |
ARROW_WITH_ZLIB |
Compression algorithm | ON |
ARROW_WITH_ZSTD |
Compression algorithm | ON |
R package configuration
There are a number of other variables that affect the configure
script and the bundled build script. All boolean variables are case-insensitive.
Name | Description | Default |
---|---|---|
LIBARROW_BUILD |
Allow building from source | true |
LIBARROW_BINARY |
Try to install libarrow binary instead of building from source |
(unset) |
LIBARROW_DOWNLOAD |
Set to false to explicitly forbid fetching a libarrow binary |
(unset) |
LIBARROW_MINIMAL |
Build with minimal features enabled | (unset) |
NOT_CRAN |
Set LIBARROW_BINARY=true and LIBARROW_MINIMAL=false
|
false |
ARROW_R_DEV |
More verbose messaging and regenerates some code | false |
ARROW_USE_PKG_CONFIG |
Use pkg-config to search for libarrow install |
true |
LIBARROW_DEBUG_DIR |
Directory to save source build logs | (unset) |
CMAKE |
Alternative CMake path | (unset) |
See below for more in-depth explanations of these environment variables.
-
LIBARROW_BINARY
: By default on many distributions, or if explicitly set totrue
, the script will determine whether there is a prebuilt libarrow that will work with your system. You can set it tofalse
to skip this option altogether, or you can specify a string “distro-version” that corresponds to a binary that is available, to override what this function may discover by default. Possible values are: “linux-openssl-1.0”, “linux-openssl-1.1”, “linux-openssl-3.0”. -
LIBARROW_BUILD
: If set tofalse
, the build script will not attempt to build the C++ from source. This means you will only get a working arrow R package if a prebuilt binary is found. Use this if you want to avoid compiling the C++ library, which may be slow and resource-intensive, and ensure that you only use a prebuilt binary. -
LIBARROW_MINIMAL
: If set tofalse
, the build script will enable some optional features, including S3 support and additional alternative memory allocators. This will increase the source build time but results in a more fully functional library. If set totrue
turns off Parquet, Datasets, compression libraries, and other optional features. This is not commonly used but may be helpful if needing to compile on a platform that does not support these features, e.g. Solaris. -
NOT_CRAN
: If this variable is set totrue
, as thedevtools
package does, the build script will setLIBARROW_BINARY=true
andLIBARROW_MINIMAL=false
unless those environment variables are already set. This provides for a more complete and fast installation experience for users who already haveNOT_CRAN=true
as part of their workflow, without requiring additional environment variables to be set. -
ARROW_R_DEV
: If set totrue
, more verbose messaging will be printed in the build script.arrow::install_arrow(verbose = TRUE)
sets this. This variable also is needed if you’re modifying C++ code in the package: see the developer guide article. -
ARROW_USE_PKG_CONFIG
: If set tofalse
, the configure script won’t look for Arrow libraries on your system and instead will look to download/build them. Use this if you have a version mismatch between installed system libraries and the version of the R package you’re installing. -
LIBARROW_DEBUG_DIR
: If the C++ library building from source fails (cmake
), there may be messages telling you to check some log file in the build directory. However, when the library is built during R package installation, that location is in a temp directory that is already deleted. To capture those logs, set this variable to an absolute (not relative) path and the log files will be copied there. The directory will be created if it does not exist. -
CMAKE
: When building the C++ library from source, you can specify a/path/to/cmake
to use a different version than whatever is found on the$PATH
.
Using install_arrow()
The previous instructions are useful for a fresh arrow installation, but arrow provides the function install_arrow()
. There are three common use cases for this function:
- You have arrow installed and want to upgrade to a different version
- You want to try to reinstall and fix issues with Linux C++ binaries
- You want to install a development build
Examples of using install_arrow()
are shown below:
install_arrow() # latest release
install_arrow(nightly = TRUE) # install development version
install_arrow(verbose = TRUE) # verbose output to debug install errors
Although this function is part of the arrow package, it is also available as a standalone script, so you can access it without first installing the package:
source("https://raw.githubusercontent.com/apache/arrow/main/r/R/install-arrow.R")
Notes:
-
install_arrow()
does not require environment variables to be set in order to satisfy C++ dependencies. - unlike packages like
tensorflow
,blogdown
, and others that require external dependencies, you do not need to runinstall_arrow()
after a successful arrow installation.
Offline installation
The install-arrow.R
file mentioned in the previous section includes a function called create_package_with_all_dependencies()
. Normally, when installing on a computer with internet access, the build process will download third-party dependencies as needed. This function provides a way to download them in advance, which can be useful when installing Arrow on a computer without internet access. The process is as follows:
Step 1. Using a computer with internet access, download dependencies:
-
Install the arrow package or source the script directly using the following command:
source("https://raw.githubusercontent.com/apache/arrow/main/r/R/install-arrow.R")
-
Use the
create_package_with_all_dependencies()
function to create the installation bundle:create_package_with_all_dependencies("my_arrow_pkg.tar.gz")
Copy the newly created
my_arrow_pkg.tar.gz
file to the computer without internet access
Step 2. On the computer without internet access, install the prepared package:
-
Install the arrow package from the copied file:
install.packages( "my_arrow_pkg.tar.gz", dependencies = c("Depends", "Imports", "LinkingTo") )
This installation will build from source, so
cmake
must be available Run
arrow_info()
to check installed capabilities
Notes:
arrow can be installed on a computer without internet access without using this function, but many useful features will be disabled, as they depend on third-party components. More precisely,
arrow::arrow_info()$capabilities()
will beFALSE
for every capability.If you are using binary packages you shouldn’t need to this function. You can download the appropriate binary from your package repository, transfer that to the offline computer, and install that.
If you’re using RStudio Package Manager on Linux (RSPM), and you want to make a source bundle with this function, make sure to set the first repository in
options("repos")
to be a mirror that contains source packages. That is, the repository needs to be something other than the RSPM binary mirror URLs.
Offline installation (alternative)
A second method for offline installation is a little more hands-on. Follow these steps if you wish to try it:
- Download the dependency files (
cpp/thirdparty/download_dependencies.sh
may be helpful) - Copy the directory of dependencies to the offline computer
- Create the environment variable
ARROW_THIRDPARTY_DEPENDENCY_DIR
on the offline computer, pointing to the copied directory. - Install the arrow package as usual.
For offline installation using libarrow binaries, see Method 1b above.
Troubleshooting
The intent is that install.packages("arrow")
will just work and handle all C++ dependencies, but depending on your system, you may have better results if you tune one of several parameters. Here are some known complications and ways to address them.
Package failed to build C++ dependencies
If you see a message like
------------------------- NOTE ---------------------------
There was an issue preparing the Arrow C++ libraries.
See https://arrow.apache.org/docs/r/articles/install.html
---------------------------------------------------------
in the output when the package fails to install, that means that installation failed to retrieve or build the libarrow version compatible with the current version of the R package.
Please check the “Known installation issues” below to see if any apply, and if none apply, set the environment variable ARROW_R_DEV=TRUE
for more verbose output and try installing again. Then, please report an issue and include the full installation output.
Using system libraries
If a system library or other installed Arrow is found but it doesn’t match the R package version (for example, you have libarrow 1.0.0 on your system and are installing R package 2.0.0), it is likely that the R bindings will fail to compile. Because the Apache Arrow project is under active development, it is essential that versions of libarrow and the R package matches. When install.packages("arrow")
has to download libarrow, the install script ensures that you fetch the libarrow version that corresponds to your R package version. However, if you are using a version of libarrow already on your system, version match isn’t guaranteed.
To fix version mismatch, you can either update your libarrow system packages to match the R package version, or set the environment variable ARROW_USE_PKG_CONFIG=FALSE
to tell the configure script not to look for system version of libarrow. (The latter is the default of install_arrow()
.) System libarrow versions are available corresponding to all CRAN releases but not for nightly or dev versions, so depending on the R package version you’re installing, system libarrow version may not be an option.
Note also that once you have a working R package installation based on system (shared) libraries, if you update your system libarrow installation, you’ll need to reinstall the R package to match its version. Similarly, if you’re using libarrow system libraries, running update.packages()
after a new release of the arrow package will likely fail unless you first update the libarrow system packages.
Using prebuilt binaries
If the R package finds and downloads a prebuilt binary of libarrow, but then the arrow package can’t be loaded, perhaps with “undefined symbols” errors, please report an issue. This is likely a compiler mismatch and may be resolvable by setting some environment variables to instruct R to compile the packages to match libarrow.
A workaround would be to set the environment variable LIBARROW_BINARY=FALSE
and retry installation: this value instructs the package to build libarrow from source instead of downloading the prebuilt binary. That should guarantee that the compiler settings match.
If a prebuilt libarrow binary wasn’t found for your operating system but you think it should have been, please report an issue and share the console output. You may also set the environment variable ARROW_R_DEV=TRUE
for additional debug messages.
Building libarrow from source
If building libarrow from source fails, check the error message. (If you don’t see an error message, only the ----- NOTE -----
, set the environment variable ARROW_R_DEV=TRUE
to increase verbosity and retry installation.) The install script should work everywhere, so if libarrow fails to compile, please report an issue so that we can improve the script.
Contributing
We are constantly working to make the installation process as painless as possible. If you find ways to improve the process, please report an issue so that we can document it. Similarly, if you find that your Linux distribution or version is not supported, we would welcome the contribution of Docker images (hosted on Docker Hub) that we can use in our continuous integration and hopefully improve our coverage. If you do contribute a Docker image, it should be as minimal as possible, containing only R and the dependencies it requires. For reference, see the images that R-hub uses.
You can test the arrow R package installation using the docker-compose
setup included in the apache/arrow
git repository. For example,
R_ORG=rhub R_IMAGE=ubuntu-release R_TAG=latest docker-compose build r
R_ORG=rhub R_IMAGE=ubuntu-release R_TAG=latest docker-compose run r
installs the arrow R package, including libarrow, on the rhub/ubuntu-release image.
Further reading
- To learn about installing development versions, see the article on installing nightly builds.
- If you’re contributing to the Arrow project, see the Arrow R developers guide for resources to help you on set up your development environment.
- Arrow developers may also wish to read a more detailed discussion of the code run during the installation process, described in the install details article.