If you’re looking to contribute to arrow, this vignette can help you set up a development environment that will enable you to write code and run tests locally. It outlines:
This document is intended only for developers of Apache Arrow or the Arrow R package. R package users do not need to do any of this setup. If you’re looking for how to install Arrow, see the instructions in the readme.
This document is a work in progress and will grow and change as the Apache Arrow project grows and changes. We have tried to make these steps as robust as possible (in fact, we even test exactly these instructions on our nightly CI to ensure they don’t become stale!), but custom configurations might conflict with these instructions and there are differences of opinion across developers about how to set up development environments like this.
We welcome any feedback you have about things that are confusing or additions you would like to see here - please report an issue if you have any suggestions or requests.
Windows and macOS users who wish to contribute to the R package and don’t need to alter libarrow (Arrow’s C++ library) may be able to obtain a recent version of the library without building from source.
On Linux, you can download a .zip file containing libarrow from the nightly repository.
To see what nightlies are available, you can use arrow’s (or any other S3 client’s) S3 listing functionality to see what is in the bucket s3://arrow-r-nightly/libarrow/bin
:
nightly <- s3_bucket("arrow-r-nightly")
nightly$ls("libarrow/bin")
Version numbers in that repository correspond to dates.
You’ll need to create a libarrow
directory inside the R package directory and unzip the zip file containing the compiled libarrow binary files into it.
On macOS, you can install libarrow using Homebrew:
On Windows, you can download a .zip file containing libarrow from the nightly repository.
To see what nightlies are available, you can use arrow’s (or any other S3 client’s) S3 listing functionality to see what is in the bucket s3://arrow-r-nightly/libarrow/bin
:
nightly <- s3_bucket("arrow-r-nightly")
nightly$ls("libarrow/bin")
Version numbers in that repository correspond to dates.
You can set the RWINLIB_LOCAL
environment variable to point to the zip file containing libarrow before installing the arrow R package.
If you need to alter both libarrow and the R package code, or if you can’t get a binary version of the latest libarrow elsewhere, you’ll need to build it from source. This section discusses how to set up a C++ libarrow build configured to work with the R package. For more general resources, see the Arrow C++ developer guide.
There are five major steps to the process.
When building libarrow, by default, system dependencies will be used if suitable versions are found. If system dependencies are not present, libarrow will build them during its own build process. The only dependencies that you need to install outside of the build process are cmake (for configuring the build) and openssl if you are building with S3 support.
For a faster build, you may choose to pre-install more C++ library dependencies (such as lz4, zstd, etc.) on the system so that they don’t need to be built from source in the libarrow build.
We recommend that you configure libarrow to be built to a user-level directory rather than a system directory for your development work. This is so that the development version you are using doesn’t overwrite a released version of libarrow you may already have installed, and so that you are also able work with more than one version of libarrow (by using different ARROW_HOME
directories for the different versions).
In the example below, libarrow is installed to a directory called dist
that has the same parent directory as the arrow
checkout. Your installation of the Arrow R package can point to any directory with any name, though we recommend not placing it inside of the arrow
git checkout directory as unwanted changes could stop it working properly.
Special instructions on Linux: You will need to set LD_LIBRARY_PATH
to the lib
directory that is under where you set $ARROW_HOME
, before launching R and using arrow. One way to do this is to add it to your profile (we use ~/.bash_profile
here, but you might need to put this in a different file depending on your setup, e.g. if you use a shell other than bash
). On macOS you do not need to do this because the macOS shared library paths are hardcoded to their locations during build time.
export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH
echo "export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH" >> ~/.bash_profile
Start by navigating in a terminal to the arrow
repository. You will need to create a directory into which the C++ build will put its contents. We recommend that you make a build
directory inside of the cpp
directory of the Arrow git repository (it is git-ignored, so you won’t accidentally check it in). Next, change directories to be inside cpp/build
:
You’ll first call cmake
to configure the build and then make install
. For the R package, you’ll need to enable several features in libarrow using -D
flags:
cmake \
-DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DCMAKE_INSTALL_LIBDIR=lib \
-DARROW_COMPUTE=ON \
-DARROW_CSV=ON \
-DARROW_DATASET=ON \
-DARROW_EXTRA_ERROR_CONTEXT=ON \
-DARROW_FILESYSTEM=ON \
-DARROW_INSTALL_NAME_RPATH=OFF \
-DARROW_JEMALLOC=ON \
-DARROW_JSON=ON \
-DARROW_PARQUET=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_ZLIB=ON \
..
..
refers to the C++ source directory: you’re in cpp/build
and the source is in cpp
.
To enable optional features including: S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags to your call to cmake
(the trailing \
makes them easier to paste into a bash shell on a new line):
-DARROW_MIMALLOC=ON \
-DARROW_S3=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_ZSTD=ON \
Other flags that may be useful:
-DBoost_SOURCE=BUNDLED
and -DThrift_SOURCE=BUNDLED
, for example, or any other dependency *_SOURCE
, if you have a system version of a C++ dependency that doesn’t work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
-DCMAKE_BUILD_TYPE=debug
or -DCMAKE_BUILD_TYPE=relwithdebinfo
can be useful for debugging. You probably don’t want to do this generally because a debug build is much slower at runtime than the default release
build.
Note cmake
is particularly sensitive to whitespacing, if you see errors, check that you don’t have any errant whitespace.
You can add -j#
between make
and install
here too to speed up compilation by running in parallel (where #
is the number of cores you have available).
Once you’ve built libarrow, you can install the R package and its dependencies, along with additional dev dependencies, from the git checkout:
popd # To go back to the root directory of the project, from cpp/build
pushd r
R -e 'install.packages("remotes"); remotes::install_deps(dependencies = TRUE)'
R CMD INSTALL .
If you need to set any compilation flags while building the C++ extensions, you can use the ARROW_R_CXXFLAGS
environment variable. For example, if you are using perf
to profile the R extensions, you may need to set
With the setup described here, you should not need to rebuild the Arrow library or even the C++ source in the R package as you iterate and work on the R package. The only time those should need to be rebuilt is if you have changed the C++ in the R package (and even then, R CMD INSTALL .
should only need to recompile the files that have changed) or if the libarrow C++ has changed and there is a mismatch between libarrow and the R package. If you find yourself rebuilding either or both each time you install the package or run tests, something is probably wrong with your set up.
cmake
command with all of the R-relevant optional dependencies turned on. Development with other languages might require different flags as well. For example, to develop Python, you would need to also add -DARROW_PYTHON=ON
(though all of the other flags used for Python are already included here).
cmake \
-DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DCMAKE_INSTALL_LIBDIR=lib \
-DARROW_COMPUTE=ON \
-DARROW_CSV=ON \
-DARROW_DATASET=ON \
-DARROW_EXTRA_ERROR_CONTEXT=ON \
-DARROW_FILESYSTEM=ON \
-DARROW_INSTALL_NAME_RPATH=OFF \
-DARROW_JEMALLOC=ON \
-DARROW_JSON=ON \
-DARROW_MIMALLOC=ON \
-DARROW_PARQUET=ON \
-DARROW_S3=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_ZLIB=ON \
-DARROW_WITH_ZSTD=ON \
..
If you need an arrow installation from a specific repository or git reference, on most platforms except Windows, you can run:
remotes::install_github("apache/arrow/r", build = FALSE)
The build = FALSE
argument is important so that the installation can access the C++ source in the cpp/
directory in apache/arrow
.
As with other installation methods, setting the environment variables LIBARROW_MINIMAL=false
and ARROW_R_DEV=true
will provide a more full-featured version of Arrow and provide more verbose output, respectively.
For example, to install from the (fictional) branch bugfix
from apache/arrow
you could run:
Sys.setenv(LIBARROW_MINIMAL="false")
remotes::install_github("apache/arrow/r@bugfix", build = FALSE)
Developers may wish to use this method of installing a specific commit separate from another Arrow development environment or system installation (e.g. we use this in arrowbench to install development versions of libarrow isolated from the system install). If you already have libarrow installed system-wide, you may need to set some additional variables in order to isolate this build from your system libraries:
Setting the environment variable FORCE_BUNDLED_BUILD
to true
will skip the pkg-config
search for libarrow and attempt to build from the same source at the repository+ref given.
You may also need to set the Makevars CPPFLAGS
and LDFLAGS
to ""
in order to prevent the installation process from attempting to link to already installed system versions of libarrow. One way to do this temporarily is wrapping your remotes::install_github()
call like so:
withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), remotes::install_github(...))
The arrow/r
directory contains a Makefile
to help with some common tasks from the command line (e.g. make test
, make doc
, make clean
, etc.).
The R documentation uses the @examplesIf
tag introduced in roxygen2
version 7.1.1.9001, which hasn’t yet been released on CRAN at the time of writing. If you are making changes which require updating the documentation, please install the development version of roxygen2
from GitHub.
remotes::install_github("r-lib/roxygen2")
You can use devtools::document()
and pkgdown::build_site()
to rebuild the documentation and preview the results.
# Update roxygen documentation
devtools::document()
# To preview the documentation website
pkgdown::build_site(preview=TRUE)
The R code in the package follows the tidyverse style. On PR submission (and on pushes) our CI will run linting and will flag possible errors on the pull request with annotations.
To run the lintr locally, install the lintr package (note, we currently use a fork that includes fixes not yet accepted upstream, see how lintr is being installed in the file ci/docker/linux-apt-lint.dockerfile
for the current status) and then run
lintr::lint_package("arrow/r")
You can automatically change the formatting of the code in the package using the styler package. There are two ways to do this:
Use the comment bot to do this automatically with the command @github-actions autotune
on a PR, and commit it back to the branch.
Run the styler locally either via Makefile commands:
or in R:
# note the two excluded files which should not be styled
styler::style_pkg(exclude_files = c("tests/testthat/latin1.R", "data-raw/codegen.R"))
The styler package will fix many styling errors, thought not all lintr errors are automatically fixable with styler. The list of files we intentionally do not style is in r/.styler_excludes.R
.
The arrow package uses some customized tools on top of cpp11 to prepare its C++ code in src/
. This is because there are some features that are only enabled and built conditionally during build time. If you change C++ code in the R package, you will need to set the ARROW_R_DEV
environment variable to true
(optionally, add it to your ~/.Renviron
file to persist across sessions) so that the data-raw/codegen.R
file is used for code generation. The Makefile
commands also handles this automatically.
We use Google C++ style in our C++ code. The easiest way to accomplish this is use an editors/IDE that formats your code for you. Many popular editors/IDEs have support for running clang-format
on C++ files when you save them. Installing/enabling the appropriate plugin may save you much frustration.
Check for style errors with
Fix any style issues before committing with
The lint script requires Python 3 and clang-format-8
. If the command isn’t found, you can explicitly provide the path to it like:
On macOS, you can get this by installing LLVM via Homebrew and running the script as:
Note that the lint script requires Python 3 and the Python dependencies (note that `cmake_format is pinned to a specific version):
Tests can be run either using devtools::test()
or the Makefile alternative.
# Run the test suite, optionally filtering file names
devtools::test(filter="^regexp$")
# or the Makefile alternative from the arrow/r directory in a shell:
make test file=regexp
Some tests are conditionally enabled based on the availability of certain features in the package build (S3 support, compression libraries, etc.). Others are generally skipped by default but can be enabled with environment variables or other settings:
All tests are skipped on Linux if the package builds without the C++ libarrow. To make the build fail if libarrow is not available (as in, to test that the C++ build was successful), set TEST_R_WITH_ARROW=true
Some tests are disabled unless ARROW_R_DEV=true
Tests that require allocating >2GB of memory to test Large types are disabled unless ARROW_LARGE_MEMORY_TESTS=true
Integration tests against a real S3 bucket are disabled unless credentials are set in AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
; these are available on request
S3 tests using MinIO locally are enabled if the minio server
process is found running. If you’re running MinIO with custom settings, you can set MINIO_ACCESS_KEY
, MINIO_SECRET_KEY
, and MINIO_PORT
to override the defaults.
You can run package checks by using devtools::check()
and check test coverage with covr::package_coverage()
.
# All package checks
devtools::check()
# See test coverage statistics
covr::report()
covr::package_coverage()
For full package validation, you can run the following commands from a terminal.
R CMD build .
R CMD check arrow_*.tar.gz --as-cran
On a pull request, there are some actions you can trigger by commenting on the PR. We have additional CI checks that run nightly and can be requested on demand using an internal tool called crossbow. A few important GitHub comment commands are shown below.
@github-actions crossbow submit -g r
This runs each of the R-related CI tasks.
@github-actions crossbow submit {task-name}
See the r:
group definition near the beginning of the crossbow configuration for a list of glob expression patterns that match names of items in the tasks:
list below it.
TEST_OFFLINE_BUILD
: When set to true
, the build script will not download prebuilt the C++ library binary. It will turn off any features that require a download, unless they’re available in the tools/cpp/thirdparty/download/
subfolder of the tar.gz file. create_package_with_all_dependencies()
creates that subfolder. Regardless of this flag’s value, cmake
will be downloaded if it’s unavailable.TEST_R_WITHOUT_LIBARROW
: When set to true
, skip tests that would require the C++ Arrow library (that is, almost everything).Note that after any change to libarrow, you must reinstall it and run make clean
or git clean -fdx .
to remove any cached object code in the r/src/
directory before reinstalling the R package. This is only necessary if you make changes to libarrow source; you do not need to manually purge object files if you are only editing R or C++ code inside r/
.
If libarrow and the R package have diverged, you will see errors like:
Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so':
dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Symbol not found: __ZN5arrow2io16RandomAccessFile9ReadAsyncERKNS0_9IOContextExx
Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so
Expected in: flat namespace
in /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so
Error: loading failed
Execution halted
ERROR: loading failed
To resolve this, try rebuilding the Arrow library.
If you are installing from a user-level directory, and you already have a previous installation of libarrow in a system directory, you get you may get errors like the following when you install the R package:
Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so':
dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: /usr/local/lib/libarrow.400.dylib
Referenced from: /usr/local/lib/libparquet.400.dylib
Reason: image not found
If this happens, you need to make sure that you don’t let R link to your system library when building arrow. You can do this a number of different ways:
MAKEFLAGS
environment variable to "LDFLAGS="
(see below for an example) this is the recommended way to accomplish thiswith_makevars(list(LDFLAGS = ""), ...)
LDFLAGS=
to your ~/.R/Makevars
file (the least recommended way, though it is a common debugging approach suggested online)rpath
issuesIf the package fails to install/load with an error like this:
** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
ensure that -DARROW_INSTALL_NAME_RPATH=OFF
was passed (this is important on macOS to prevent problems at link time and is a no-op on other platforms). Alternatively, try setting the environment variable R_LD_LIBRARY_PATH
to wherever Arrow C++ was put in make install
, e.g. export R_LD_LIBRARY_PATH=/usr/local/lib
, and retry installing the R package.
When installing from source, if the R and C++ library versions do not match, installation may fail. If you’ve previously installed the libraries and want to upgrade the R package, you’ll need to update the Arrow C++ library first.
For any other build/configuration challenges, see the C++ developer guide.
There are a number of scripts that are triggered when the arrow R package is installed. For package users who are not interacting with the underlying code, these should all just work without configuration and pull in the most complete pieces (e.g. official binaries that we host). However, knowing about these scripts can help package developers troubleshoot if things go wrong in them or things go wrong in an install. See the installation vignette for more information. >>>>>>> master