Skip to contents

This document explains which C++ features are enabled in different Arrow R package build configurations, and documents the decisions behind our default feature set. This is intended as internal developer documentation for understanding which features are enabled in which builds. It is not intended to be a guide for installing the Arrow R package; for that, see the installation guide.

Overview

When the Arrow R package is installed, it needs a copy of the Arrow C++ library (libarrow). This can come from:

  1. Prebuilt binaries we host (for releases and nightlies)
  2. Source builds when binaries aren’t available or users opt out

The features available in libarrow depend on how it was built. This document covers the feature configuration for both scenarios.

Prebuilt libarrow binary configuration

We produce prebuilt libarrow binaries for macOS, Windows, and Linux. These binaries include more features than the default source build to provide users with a fully-featured experience out of the box.

Current binary feature set

Platform S3 GCS Configured in
macOS (ARM64, x86_64) ON ON dev/tasks/r/github.packages.yml
Windows ON ON ci/scripts/PKGBUILD
Linux (x86_64) ON ON compose.yaml (ubuntu-cpp-static)

Exceptions to our build defaults

Even though GCS defaults to OFF for source builds, we explicitly enable it in our prebuilt binaries because:

  1. Binary users expect features to “just work” - they shouldn’t need to rebuild from source to access cloud storage
  2. Build time is not a concern - we build binaries once in CI, not on user machines
  3. Parity across platforms - users get the same features regardless of OS

Feature configuration in source builds of libarrow

Source builds are controlled by r/inst/build_arrow_static.sh. The key environment variable is LIBARROW_MINIMAL:

  • LIBARROW_MINIMAL unset: Default feature set (Parquet, Dataset, JSON, common compression ON; S3/GCS/jemalloc OFF)
  • LIBARROW_MINIMAL=false: Full feature set (adds S3, jemalloc, additional compression)
  • LIBARROW_MINIMAL=true: Truly minimal (disables Parquet, Dataset, JSON, most compression, SIMD)

Features always enabled

These features are always built regardless of LIBARROW_MINIMAL:

Feature CMake Flag Notes
Compute ARROW_COMPUTE=ON Core compute functions
CSV ARROW_CSV=ON CSV reading/writing
Filesystem ARROW_FILESYSTEM=ON Local filesystem support
JSON ARROW_JSON=ON JSON reading
Parquet ARROW_PARQUET=ON Parquet file format
Dataset ARROW_DATASET=ON Multi-file datasets
Acero ARROW_ACERO=ON Query execution engine
Mimalloc ARROW_MIMALLOC=ON Memory allocator
LZ4 ARROW_WITH_LZ4=ON LZ4 compression
Snappy ARROW_WITH_SNAPPY=ON Snappy compression
RE2 ARROW_WITH_RE2=ON Regular expressions
UTF8Proc ARROW_WITH_UTF8PROC=ON Unicode support

Features controlled by LIBARROW_MINIMAL

When LIBARROW_MINIMAL=false, the following additional features are enabled (via $ARROW_DEFAULT_PARAM=ON):

Feature CMake Flag Default
S3 ARROW_S3 $ARROW_DEFAULT_PARAM
Jemalloc ARROW_JEMALLOC $ARROW_DEFAULT_PARAM
Brotli ARROW_WITH_BROTLI $ARROW_DEFAULT_PARAM
BZ2 ARROW_WITH_BZ2 $ARROW_DEFAULT_PARAM
Zlib ARROW_WITH_ZLIB $ARROW_DEFAULT_PARAM
Zstd ARROW_WITH_ZSTD $ARROW_DEFAULT_PARAM

Features that require explicit opt-in

GCS (Google Cloud Storage) is always off by default, even when LIBARROW_MINIMAL=false:

Feature CMake Flag Default Reason
GCS ARROW_GCS OFF Build complexity, dependency size

To enable GCS in a source build, you must explicitly set ARROW_GCS=ON.

Why is GCS off by default?

GCS was turned off by default in #48343 (December 2025) because:

  1. Building google-cloud-cpp is fragile and adds significant build time
  2. The dependency on abseil (ABSL) has caused compatibility issues
  3. Users who need GCS can still enable it explicitly

Configuration file locations

libarrow source build configuration

The main build script that controls source builds:

r/inst/build_arrow_static.sh - CMake flags and defaults (view source) the environment variables to look for are LIBARROW_MINIMAL, ARROW_*, and, ARROW_DEFAULT_PARAM

libarrow binary build configuration

Each platform has its own configuration file:

Platform Config file Key settings
macOS dev/tasks/r/github.packages.yml LIBARROW_MINIMAL=false, ARROW_GCS=ON
Windows ci/scripts/PKGBUILD ARROW_GCS=ON, ARROW_S3=ON
Linux compose.yaml (ubuntu-cpp-static) LIBARROW_MINIMAL=false, ARROW_GCS=ON

R-universe builds

R-universe builds the Arrow R package for users who want newer versions than CRAN. R-universe behavior varies by platform and architecture:

Platform Architecture Build method Features
macOS ARM64 Downloads prebuilt binary Full (S3 + GCS)
macOS x86_64 Downloads prebuilt binary Full (S3 + GCS)
Windows x86_64 Downloads prebuilt binary Full (S3 + GCS)
Windows ARM64 Not supported NA
Linux x86_64 Downloads prebuilt binary Full (S3 + GCS)
Linux ARM64 Builds from source S3 only (no GCS)

Why Linux ARM64 builds from source

We only publish prebuilt Linux binaries for x86_64 architecture. The binary selection logic in r/tools/nixlibs.R (line 263) explicitly checks for this:

if (identical(os, "darwin") || (identical(os, "linux") && identical(arch, "x86_64"))) {

When R-universe builds on Linux ARM64 runners, no binary is available, so it falls back to building from source using build_arrow_static.sh. Since GCS defaults to OFF in that script, Linux ARM64 users don’t get GCS support.

Enabling GCS for Linux ARM64

To provide full feature parity for Linux ARM64, we would need to:

  1. Add an ARM64 Linux build job to dev/tasks/r/github.packages.yml
  2. Update select_binary() in nixlibs.R to recognize linux-aarch64
  3. Add the artifact pattern to dev/tasks/tasks.yml
  4. Update the nightly upload workflow

See GH-36193 for tracking this work.

Alternatively, changing the GCS default in build_arrow_static.sh from OFF to $ARROW_DEFAULT_PARAM would enable GCS for all source builds, including Linux ARM64 on R-universe.

Checking installed features

Users can check which features are enabled in their installation:

# Show all capabilities
arrow::arrow_info()

# Check specific features
arrow::arrow_with_s3()
arrow::arrow_with_gcs()