The Arrow R package is unique compared to other R packages that you may have contributed to because it builds on top of the large and feature-rich Arrow C++ implementation. Because the R package integrates tightly with Arrow C++, it typically requires a dedicated copy of the library (i.e., it is usually not possible to link to a system version of libarrow during development).
Option 1: Using nightly libarrow binaries
On Linux, macOS, and Windows you can use the same workflow you might
use for another package that contains compiled code (e.g.,
R CMD INSTALL .
from a terminal,
devtools::load_all()
from an R prompt, or
Install & Restart
from RStudio). If the
arrow/r/libarrow
directory is not populated, the configure
script will attempt to download the latest nightly libarrow binary,
extract it to the arrow/r/libarrow
directory (macOS, Linux)
or arrow/r/windows
directory (Windows), and continue
building the R package as usual.
Most of the time, you won’t need to update your version of libarrow
because the R package rarely changes with updates to the C++ library;
however, if you start to get errors when rebuilding the R package, you
may have to remove the libarrow
directory (macOS, Linux) or
windows
directory (Windows) and do a “clean” rebuild. You
can do this from a terminal with
R CMD INSTALL . --preclean
, from RStudio using the “Clean
and Install” option from “Build” tab, or using make clean
if you are using the Makefile
located in the root of the R
package.
Option 2: Use a local Arrow C++ development build
If you need to alter both libarrow and the R package code, or if you can’t get a binary version of the latest libarrow elsewhere, you’ll need to build it from source. This section discusses how to set up a C++ libarrow build configured to work with the R package. For more general resources, see the Arrow C++ developer guide.
There are five major steps to the process.
Step 1 - Install dependencies
When building libarrow, by default, system dependencies will be used if suitable versions are found. If system dependencies are not present, libarrow will build them during its own build process. The only dependencies that you need to install outside of the build process are cmake (for configuring the build) and openssl if you are building with S3 support.
For a faster build, you may choose to pre-install more C++ library dependencies (such as lz4, zstd, etc.) on the system so that they don’t need to be built from source in the libarrow build.
Step 2 - Configure the libarrow build
We recommend that you configure libarrow to be built to a user-level
directory rather than a system directory for your development work. This
is so that the development version you are using doesn’t overwrite a
released version of libarrow you may already have installed, and so that
you are also able work with more than one version of libarrow (by using
different ARROW_HOME
directories for the different
versions).
In the example below, libarrow is installed to a directory called
dist
that has the same parent directory as the arrow
checkout. Your installation of the Arrow R package can point to any
directory with any name, though we recommend not placing it
inside of the arrow git checkout directory as unwanted changes could
stop it working properly.
export ARROW_HOME=$(pwd)/dist
mkdir $ARROW_HOME
Special instructions on Linux: You will need to set
LD_LIBRARY_PATH
to the lib
directory that is
under where you set $ARROW_HOME
, before launching R and
using arrow. One way to do this is to add it to your profile (we use
~/.bash_profile
here, but you might need to put this in a
different file depending on your setup, e.g. if you use a shell other
than bash
). On macOS you do not need to do this because the
macOS shared library paths are hardcoded to their locations during build
time.
export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH
echo "export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH" >> ~/.bash_profile
Start by navigating in a terminal to the arrow repository. You will
need to create a directory into which the C++ build will put its
contents. We recommend that you make a build
directory
inside of the cpp
directory of the Arrow git repository (it
is git-ignored, so you won’t accidentally check it in). Next, change
directories to be inside cpp/build
:
pushd arrow
mkdir -p cpp/build
pushd cpp/build
You’ll first call cmake
to configure the build and then
make install
. For the R package, you’ll need to enable
several features in libarrow using -D
flags:
Linux / Mac OS
cmake \
-DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DCMAKE_INSTALL_LIBDIR=lib \
-DARROW_COMPUTE=ON \
-DARROW_CSV=ON \
-DARROW_DATASET=ON \
-DARROW_EXTRA_ERROR_CONTEXT=ON \
-DARROW_FILESYSTEM=ON \
-DARROW_INSTALL_NAME_RPATH=OFF \
-DARROW_JEMALLOC=ON \
-DARROW_JSON=ON \
-DARROW_PARQUET=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_ZLIB=ON \
..
..
refers to the C++ source directory: you’re in
cpp/build
and the source is in cpp
.
Enabling more Arrow features
To enable optional features including: S3 support, an alternative
memory allocator, and additional compression libraries, add some or all
of these flags to your call to cmake
(the trailing
\
makes them easier to paste into a bash shell on a new
line):
-DARROW_GCS=ON \
-DARROW_MIMALLOC=ON \
-DARROW_S3=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_ZSTD=ON \
Other flags that may be useful:
-DBoost_SOURCE=BUNDLED
and-DThrift_SOURCE=BUNDLED
, for example, or any other dependency*_SOURCE
, if you have a system version of a C++ dependency that doesn’t work correctly with Arrow. This tells the build to compile its own version of the dependency from source.-DCMAKE_BUILD_TYPE=debug
or-DCMAKE_BUILD_TYPE=relwithdebinfo
can be useful for debugging. You probably don’t want to do this generally because a debug build is much slower at runtime than the defaultrelease
build.-DARROW_BUILD_STATIC=ON
and-DARROW_BUILD_SHARED=OFF
if you want to use static libraries instead of dynamic libraries. With static libraries there isn’t a risk of the R package linking to the wrong library, but it does mean if you change the C++ code you have to recompile both the C++ libraries and the R package. Compilers typically will link to static libraries only if the dynamic ones are not present, which is why we need to set-DARROW_BUILD_SHARED=OFF
. If you are switching after compiling and installing previously, you may need to remove the.dll
or.so
files from$ARROW_HOME/dist/bin
.
Note cmake
is particularly sensitive to
whitespacing, if you see errors, check that you don’t have any errant
whitespace.
Step 3 - Building libarrow
You can add -j#
at the end of the command here too to
speed up compilation by running in parallel (where #
is the
number of cores you have available).
cmake --build . --target install -j8
Step 4 - Build the Arrow R package
Once you’ve built libarrow, you can install the R package and its dependencies, along with additional dev dependencies, from the git checkout:
popd # To go back to the root directory of the project, from cpp/build
pushd r
R -e "install.packages('remotes'); remotes::install_deps(dependencies = TRUE)"
R CMD INSTALL --no-multiarch .
The --no-multiarch
flag makes it only compile on the
“main” architecture. This will compile for the architecture that the R
in your path corresponds to. If you compile on one architecture and then
switch to another, make sure to pass the --preclean
flag so
that the R package code is recompiled for the new architecture.
Otherwise, you may see errors like
LoadLibrary failure: %1 is not a valid Win32 application
.
Compilation flags
If you need to set any compilation flags while building the C++
extensions, you can use the ARROW_R_CXXFLAGS
environment
variable. For example, if you are using perf
to profile the
R extensions, you may need to set
export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
Recompiling the C++ code
With the setup described here, you should not need to rebuild the
Arrow library or even the C++ source in the R package as you iterate and
work on the R package. The only time those should need to be rebuilt is
if you have changed the C++ in the R package (and even then,
R CMD INSTALL .
should only need to recompile the files
that have changed) or if the libarrow C++ has changed and there
is a mismatch between libarrow and the R package. If you find yourself
rebuilding either or both each time you install the package or run
tests, something is probably wrong with your set up.
For a full build: a cmake
command with all of the
R-relevant optional dependencies turned on. Development with other
languages might require different flags as well. For example, to develop
Python, you would need to also add -DARROW_PYTHON=ON
(though all of the other flags used for Python are already included
here).
cmake \
-DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DCMAKE_INSTALL_LIBDIR=lib \
-DARROW_COMPUTE=ON \
-DARROW_CSV=ON \
-DARROW_DATASET=ON \
-DARROW_EXTRA_ERROR_CONTEXT=ON \
-DARROW_FILESYSTEM=ON \
-DARROW_GCS=ON \
-DARROW_INSTALL_NAME_RPATH=OFF \
-DARROW_JEMALLOC=ON \
-DARROW_JSON=ON \
-DARROW_MIMALLOC=ON \
-DARROW_PARQUET=ON \
-DARROW_S3=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_ZLIB=ON \
-DARROW_WITH_ZSTD=ON \
..
Installing a version of the R package with a specific git reference
If you need an arrow installation from a specific repository or git reference, on most platforms except Windows, you can run:
remotes::install_github("apache/arrow/r", build = FALSE)
The build = FALSE
argument is important so that the
installation can access the C++ source in the cpp/
directory in apache/arrow
.
As with other installation methods, setting the environment variables
LIBARROW_MINIMAL=false
and ARROW_R_DEV=true
will provide a more full-featured version of Arrow and provide more
verbose output, respectively.
For example, to install from the (fictional) branch
bugfix
from apache/arrow
you could run:
Sys.setenv(LIBARROW_MINIMAL="false")
remotes::install_github("apache/arrow/r@bugfix", build = FALSE)
Developers may wish to use this method of installing a specific commit separate from another Arrow development environment or system installation (e.g. we use this in arrowbench to install development versions of libarrow isolated from the system install). If you already have libarrow installed system-wide, you may need to set some additional variables in order to isolate this build from your system libraries:
Setting the environment variable
FORCE_BUNDLED_BUILD
totrue
will skip thepkg-config
search for libarrow and attempt to build from the same source at the repository+ref given.You may also need to set the Makevars
CPPFLAGS
andLDFLAGS
to""
in order to prevent the installation process from attempting to link to already installed system versions of libarrow. One way to do this temporarily is wrapping yourremotes::install_github()
call like so:
withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), remotes::install_github(...))
Summary of environment variables
- See the user-facing article on installation for a large number of environment variables that determine how the build works and what features get built.
-
ARROW_OFFLINE_BUILD
: When set totrue
, the build script will not download prebuilt the C++ library binary or, if needed,cmake
. It will turn off any features that require a download, unless they’re available inARROW_THIRDPARTY_DEPENDENCY_DIR
or thetools/thirdparty_download/
subfolder.create_package_with_all_dependencies()
creates that subfolder.
Troubleshooting
Note that after any change to libarrow, you must reinstall it and run
make clean
or git clean -fdx .
to remove any
cached object code in the r/src/
directory before
reinstalling the R package. This is only necessary if you make changes
to libarrow source; you do not need to manually purge object files if
you are only editing R or C++ code inside r/
.
Arrow library - R package mismatches
If libarrow and the R package have diverged, you will see errors like:
Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so':
dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Symbol not found: __ZN5arrow2io16RandomAccessFile9ReadAsyncERKNS0_9IOContextExx
Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so
Expected in: flat namespace
in /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so
Error: loading failed
Execution halted
ERROR: loading failed
To resolve this, try rebuilding the Arrow library.
Multiple versions of libarrow
If you are installing from a user-level directory, and you already have a previous installation of libarrow in a system directory, you get you may get errors like the following when you install the R package:
Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so':
dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: /usr/local/lib/libarrow.400.dylib
Referenced from: /usr/local/lib/libparquet.400.dylib
Reason: image not found
If this happens, you need to make sure that you don’t let R link to your system library when building arrow. You can do this a number of different ways:
- Setting the
MAKEFLAGS
environment variable to"LDFLAGS="
(see below for an example) this is the recommended way to accomplish this - Using {withr}’s
with_makevars(list(LDFLAGS = ""), ...)
- adding
LDFLAGS=
to your~/.R/Makevars
file (the least recommended way, though it is a common debugging approach suggested online)
MAKEFLAGS="LDFLAGS=" R CMD INSTALL .
rpath
issues
If the package fails to install/load with an error like this:
** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
ensure that -DARROW_INSTALL_NAME_RPATH=OFF
was passed
(this is important on macOS to prevent problems at link time and is a
no-op on other platforms). Alternatively, try setting the environment
variable R_LD_LIBRARY_PATH
to wherever Arrow C++ was put in
make install
,
e.g. export R_LD_LIBRARY_PATH=/usr/local/lib
, and retry
installing the R package.
When installing from source, if the R and C++ library versions do not match, installation may fail. If you’ve previously installed the libraries and want to upgrade the R package, you’ll need to update the Arrow C++ library first.
For any other build/configuration challenges, see the C++ developer guide.
Other installation issues
There are a number of scripts that are triggered when the arrow R package is installed. For package users who are not interacting with the underlying code, these should all just work without configuration and pull in the most complete pieces (e.g. official binaries that we host). However, knowing about these scripts can help package developers troubleshoot if things go wrong in them or things go wrong in an install. See the article on R package installation for more information.