Building Arrow C++

System setup

Arrow uses CMake as a build configuration system. We recommend building out-of-source. If you are not familiar with this terminology:

  • In-source build: cmake is invoked directly from the cpp directory. This can be inflexible when you wish to maintain multiple build environments (e.g. one for debug builds and another for release builds)

  • Out-of-source build: cmake is invoked from another directory, creating an isolated build environment that does not interact with any other build environment. For example, you could create cpp/build-debug and invoke cmake $CMAKE_ARGS .. from this directory

Building requires:

  • A C++11-enabled compiler. On Linux, gcc 4.8 and higher should be sufficient. For Windows, at least Visual Studio 2015 is required.

  • CMake 3.5 or higher

  • On Linux and macOS, either make or ninja build utilities

On Ubuntu/Debian you can install the requirements with:

sudo apt-get install \
     build-essential \
     cmake

On Alpine Linux:

apk add autoconf \
        bash \
        cmake \
        g++ \
        gcc \
        make

On macOS, you can use Homebrew:

git clone https://github.com/apache/arrow.git
cd arrow
brew update && brew bundle --file=cpp/Brewfile

With vcpkg:

git clone https://github.com/apache/arrow.git
cd arrow
vcpkg install \
  --x-manifest-root cpp \
  --feature-flags=versions \
  --clean-after-build

On MSYS2:

pacman --sync --refresh --noconfirm \
  ccache \
  git \
  mingw-w64-${MSYSTEM_CARCH}-boost \
  mingw-w64-${MSYSTEM_CARCH}-brotli \
  mingw-w64-${MSYSTEM_CARCH}-cmake \
  mingw-w64-${MSYSTEM_CARCH}-gcc \
  mingw-w64-${MSYSTEM_CARCH}-gflags \
  mingw-w64-${MSYSTEM_CARCH}-glog \
  mingw-w64-${MSYSTEM_CARCH}-gtest \
  mingw-w64-${MSYSTEM_CARCH}-lz4 \
  mingw-w64-${MSYSTEM_CARCH}-protobuf \
  mingw-w64-${MSYSTEM_CARCH}-python3-numpy \
  mingw-w64-${MSYSTEM_CARCH}-rapidjson \
  mingw-w64-${MSYSTEM_CARCH}-snappy \
  mingw-w64-${MSYSTEM_CARCH}-thrift \
  mingw-w64-${MSYSTEM_CARCH}-zlib \
  mingw-w64-${MSYSTEM_CARCH}-zstd

Building

The build system uses CMAKE_BUILD_TYPE=release by default, so if this argument is omitted then a release build will be produced.

Note

You need to more options to build on Windows. See Developing on Windows for details.

Minimal release build:

git clone https://github.com/apache/arrow.git
cd arrow/cpp
mkdir release
cd release
cmake ..
make

Minimal debug build with unit tests:

git clone https://github.com/apache/arrow.git
cd arrow/cpp
mkdir debug
cd debug
cmake -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON ..
make unittest

The unit tests are not built by default. After building, one can also invoke the unit tests using the ctest tool provided by CMake (note that test depends on python being available).

On some Linux distributions, running the test suite might require setting an explicit locale. If you see any locale-related errors, try setting the environment variable (which requires the locales package or equivalent):

export LC_ALL="en_US.UTF-8"

Faster builds with Ninja

Many contributors use the Ninja build system to get faster builds. It especially speeds up incremental builds. To use ninja, pass -GNinja when calling cmake and then use the ninja command instead of make.

Optional Components

By default, the C++ build system creates a fairly minimal build. We have several optional system components which you can opt into building by passing boolean flags to cmake.

  • -DARROW_COMPUTE=ON: Computational kernel functions and other support

  • -DARROW_CSV=ON: CSV reader module

  • -DARROW_CUDA=ON: CUDA integration for GPU development. Depends on NVIDIA CUDA toolkit. The CUDA toolchain used to build the library can be customized by using the $CUDA_HOME environment variable.

  • -DARROW_DATASET=ON: Dataset API, implies the Filesystem API

  • -DARROW_FILESYSTEM=ON: Filesystem API for accessing local and remote filesystems

  • -DARROW_FLIGHT=ON: Arrow Flight RPC system, which depends at least on gRPC

  • -DARROW_GANDIVA=ON: Gandiva expression compiler, depends on LLVM, Protocol Buffers, and re2

  • -DARROW_GANDIVA_JAVA=ON: Gandiva JNI bindings for Java

  • -DARROW_HDFS=ON: Arrow integration with libhdfs for accessing the Hadoop Filesystem

  • -DARROW_HIVESERVER2=ON: Client library for HiveServer2 database protocol

  • -DARROW_JSON=ON: JSON reader module

  • -DARROW_ORC=ON: Arrow integration with Apache ORC

  • -DARROW_PARQUET=ON: Apache Parquet libraries and Arrow integration

  • -DARROW_PLASMA=ON: Plasma Shared Memory Object Store

  • -DARROW_PLASMA_JAVA_CLIENT=ON: Build Java client for Plasma

  • -DARROW_PYTHON=ON: Arrow Python C++ integration library (required for building pyarrow). This library must be built against the same Python version for which you are building pyarrow. NumPy must also be installed. Enabling this option also enables ARROW_COMPUTE, ARROW_CSV, ARROW_DATASET, ARROW_FILESYSTEM, ARROW_HDFS, and ARROW_JSON.

  • -DARROW_S3=ON: Support for Amazon S3-compatible filesystems

  • -DARROW_WITH_BZ2=ON: Build support for BZ2 compression

  • -DARROW_WITH_ZLIB=ON: Build support for zlib (gzip) compression

  • -DARROW_WITH_LZ4=ON: Build support for lz4 compression

  • -DARROW_WITH_SNAPPY=ON: Build support for Snappy compression

  • -DARROW_WITH_ZSTD=ON: Build support for ZSTD compression

  • -DARROW_WITH_BROTLI=ON: Build support for Brotli compression

Some features of the core Arrow shared library can be switched off for improved build times if they are not required for your application:

  • -DARROW_IPC=ON: build the IPC extensions

Optional Targets

For development builds, you will often want to enable additional targets in enable to exercise your changes, using the following cmake options.

  • -DARROW_BUILD_BENCHMARKS=ON: Build executable benchmarks.

  • -DARROW_BUILD_EXAMPLES=ON: Build examples of using the Arrow C++ API.

  • -DARROW_BUILD_INTEGRATION=ON: Build additional executables that are used to exercise protocol interoperability between the different Arrow implementations.

  • -DARROW_BUILD_UTILITIES=ON: Build executable utilities.

  • -DARROW_BUILD_TESTS=ON: Build executable unit tests.

  • -DARROW_ENABLE_TIMING_TESTS=ON: If building unit tests, enable those unit tests that rely on wall-clock timing (this flag is disabled on CI because it can make test results flaky).

  • -DARROW_FUZZING=ON: Build fuzz targets and related executables.

Optional Checks

The following special checks are available as well. They instrument the generated code in various ways so as to detect select classes of problems at runtime (for example when executing unit tests).

  • -DARROW_USE_ASAN=ON: Enable Address Sanitizer to check for memory leaks, buffer overflows or other kinds of memory management issues.

  • -DARROW_USE_TSAN=ON: Enable Thread Sanitizer to check for races in multi-threaded code.

  • -DARROW_USE_UBSAN=ON: Enable Undefined Behavior Sanitizer to check for situations which trigger C++ undefined behavior.

Some of those options are mutually incompatible, so you may have to build several times with different options if you want to exercise all of them.

CMake version requirements

While we support CMake 3.5 and higher, some features require a newer version of CMake:

  • Building the benchmarks requires 3.6 or higher

  • Building zstd from source requires 3.7 or higher

  • Building Gandiva JNI bindings requires 3.11 or higher

LLVM and Clang Tools

We are currently using LLVM 8 for library builds and for other developer tools such as code formatting with clang-format. LLVM can be installed via most modern package managers (apt, yum, conda, Homebrew, vcpkg, chocolatey).

Build Dependency Management

The build system supports a number of third-party dependencies

  • AWSSDK: for S3 support, requires system cURL even we use the BUNDLE method described below

  • benchmark: Google benchmark, for testing

  • Boost: for cross-platform support

  • Brotli: for data compression

  • BZip2: for data compression

  • c-ares: a dependency of gRPC

  • gflags: for command line utilities (formerly Googleflags)

  • GLOG: for logging

  • gRPC: for remote procedure calls

  • GTest: Googletest, for testing

  • LLVM: a dependency of Gandiva

  • Lz4: for data compression

  • ORC: for Apache ORC format support

  • re2: for compute kernels and Gandiva, a dependency of gRPC

  • Protobuf: Google Protocol Buffers, for data serialization

  • RapidJSON: for data serialization

  • Snappy: for data compression

  • Thrift: Apache Thrift, for data serialization

  • utf8proc: for compute kernels

  • ZLIB: for data compression

  • zstd: for data compression

The CMake option ARROW_DEPENDENCY_SOURCE is a global option that instructs the build system how to resolve each dependency. There are a few options:

  • AUTO: Try to find package in the system default locations and build from source if not found

  • BUNDLED: Building the dependency automatically from source

  • SYSTEM: Finding the dependency in system paths using CMake’s built-in find_package function, or using pkg-config for packages that do not have this feature

  • CONDA: Use $CONDA_PREFIX as alternative SYSTEM PATH

  • VCPKG: Find dependencies installed by vcpkg, and if not found, run vcpkg install to install them

  • BREW: Use Homebrew default paths as an alternative SYSTEM path

The default method is AUTO unless you are developing within an active conda environment (detected by presence of the $CONDA_PREFIX environment variable), in which case it is CONDA.

Individual Dependency Resolution

While -DARROW_DEPENDENCY_SOURCE=$SOURCE sets a global default for all packages, the resolution strategy can be overridden for individual packages by setting -D$PACKAGE_NAME_SOURCE=... For example, to build Protocol Buffers from source, set

-DProtobuf_SOURCE=BUNDLED

This variable is unfortunately case-sensitive; the name used for each package is listed above, but the most up-to-date listing can be found in cpp/cmake_modules/ThirdpartyToolchain.cmake.

Bundled Dependency Versions

When using the BUNDLED method to build a dependency from source, the version number from cpp/thirdparty/versions.txt is used. There is also a dependency source downloader script (see below), which can be used to set up offline builds.

When using BUNDLED for dependency resolution (and if you use either the jemalloc or mimalloc allocators, which are recommended), statically linking the Arrow libraries in a third party project is more complex. See below for instructions about how to configure your build system in this case.

Offline Builds

If you do not use the above variables to direct the Arrow build system to preinstalled dependencies, they will be built automatically by the Arrow build system. The source archive for each dependency will be downloaded via the internet, which can cause issues in environments with limited access to the internet.

To enable offline builds, you can download the source artifacts yourself and use environment variables of the form ARROW_$LIBRARY_URL to direct the build system to read from a local file rather than accessing the internet.

To make this easier for you, we have prepared a script thirdparty/download_dependencies.sh which will download the correct version of each dependency to a directory of your choosing. It will print a list of bash-style environment variable statements at the end to use for your build script.

# Download tarballs into $HOME/arrow-thirdparty
$ ./thirdparty/download_dependencies.sh $HOME/arrow-thirdparty

You can then invoke CMake to create the build directory and it will use the declared environment variable pointing to downloaded archives instead of downloading them (one for each build dir!).

Statically Linking

When -DARROW_BUILD_STATIC=ON, all build dependencies built as static libraries by the Arrow build system will be merged together to create a static library arrow_bundled_dependencies. In UNIX-like environments (Linux, macOS, MinGW), this is called libarrow_bundled_dependencies.a and on Windows with Visual Studio arrow_bundled_dependencies.lib. This “dependency bundle” library is installed in the same place as the other Arrow static libraries.

If you are using CMake, the bundled dependencies will automatically be included when linking if you use the arrow_static CMake target. In other build systems, you may need to explicitly link to the dependency bundle. We created an example CMake-based build configuration to show you a working example.

On Linux and macOS, if your application does not link to the pthread library already, you must include -pthread in your linker setup. In CMake this can be accomplished with the Threads built-in package:

set(THREADS_PREFER_PTHREAD_FLAG ON)
find_package(Threads REQUIRED)
target_link_libraries(my_target PRIVATE Threads::Threads)

Extra debugging help

If you use the CMake option -DARROW_EXTRA_ERROR_CONTEXT=ON it will compile the libraries with extra debugging information on error checks inside the RETURN_NOT_OK macro. In unit tests with ASSERT_OK, this will yield error outputs like:

../src/arrow/ipc/ipc-read-write-test.cc:609: Failure
Failed
../src/arrow/ipc/metadata-internal.cc:508 code: TypeToFlatbuffer(fbb, *field.type(), &children, &layout, &type_enum, dictionary_memo, &type_offset)
../src/arrow/ipc/metadata-internal.cc:598 code: FieldToFlatbuffer(fbb, *schema.field(i), dictionary_memo, &offset)
../src/arrow/ipc/metadata-internal.cc:651 code: SchemaToFlatbuffer(fbb, schema, dictionary_memo, &fb_schema)
../src/arrow/ipc/writer.cc:697 code: WriteSchemaMessage(schema_, dictionary_memo_, &schema_fb)
../src/arrow/ipc/writer.cc:730 code: WriteSchema()
../src/arrow/ipc/writer.cc:755 code: schema_writer.Write(&dictionaries_)
../src/arrow/ipc/writer.cc:778 code: CheckStarted()
../src/arrow/ipc/ipc-read-write-test.cc:574 code: writer->WriteRecordBatch(batch)
NotImplemented: Unable to convert type: decimal(19, 4)

Deprecations and API Changes

We use the compiler definition ARROW_NO_DEPRECATED_API to disable APIs that have been deprecated. It is a good practice to compile third party applications with this flag to proactively catch and account for API changes.

Modular Build Targets

Since there are several major parts of the C++ project, we have provided modular CMake targets for building each library component, group of unit tests and benchmarks, and their dependencies:

  • make arrow for Arrow core libraries

  • make parquet for Parquet libraries

  • make gandiva for Gandiva (LLVM expression compiler) libraries

  • make plasma for Plasma libraries, server

Note

If you have selected Ninja as CMake generator, replace make arrow with ninja arrow, and so on.

To build the unit tests or benchmarks, add -tests or -benchmarks to the target name. So make arrow-tests will build the Arrow core unit tests. Using the -all target, e.g. parquet-all, will build everything.

If you wish to only build and install one or more project subcomponents, we have provided the CMake option ARROW_OPTIONAL_INSTALL to only install targets that have been built. For example, if you only wish to build the Parquet libraries, its tests, and its dependencies, you can run:

cmake .. -DARROW_PARQUET=ON \
      -DARROW_OPTIONAL_INSTALL=ON \
      -DARROW_BUILD_TESTS=ON
make parquet
make install

If you omit an explicit target when invoking make, all targets will be built.

Debugging with Xcode on macOS

Xcode is the IDE provided with macOS and can be use to develop and debug Arrow by generating an Xcode project:

cd cpp
mkdir xcode-build
cd xcode-build
cmake .. -G Xcode -DARROW_BUILD_TESTS=ON -DCMAKE_BUILD_TYPE=DEBUG
open arrow.xcodeproj

This will generate a project and open it in the Xcode app. As an alternative, the command xcodebuild will perform a command-line build using the generated project. It is recommended to use the “Automatically Create Schemes” option when first launching the project. Selecting an auto-generated scheme will allow you to build and run a unittest with breakpoints enabled.