Development Guidelines#
This section provides information for developers who wish to contribute to the C++ codebase.
Note
Since most of the project’s developers work on Linux or macOS, not all features or developer tools are uniformly supported on Windows. If you are on Windows, have a look at Developing on Windows.
Compiler warning levels#
The BUILD_WARNING_LEVEL
CMake option switches between sets of predetermined
compiler warning levels that we use for code tidiness. For release builds, the
default warning level is PRODUCTION
, while for debug builds the default is
CHECKIN
.
When using CHECKIN
for debug builds, -Werror
is added when using gcc
and clang, causing build failures for any warning, and /WX
is set with MSVC
having the same effect.
Running unit tests#
The -DARROW_BUILD_TESTS=ON
CMake option enables building of unit test
executables. You can then either run them individually, by launching the
desired executable, or run them all at once by launching the ctest
executable (which is part of the CMake suite).
A possible invocation is something like:
$ ctest -j16 --output-on-failure
where the -j16
option runs up to 16 tests in parallel, taking advantage
of multiple CPU cores and hardware threads.
Running benchmarks#
The -DARROW_BUILD_BENCHMARKS=ON
CMake option enables building of benchmark
executables. You can then run benchmarks individually by launching the
corresponding executable from the command line, e.g.:
$ ./build/release/arrow-builder-benchmark
Note
For meaningful benchmark numbers, it is very strongly recommended to build
in Release
mode, so as to enable compiler optimizations.
Code Style, Linting, and CI#
This project follows Google’s C++ Style Guide with these exceptions:
We relax the line length restriction to 90 characters.
We use the
NULLPTR
macro in header files (instead ofnullptr
) defined insrc/arrow/util/macros.h
to support building C++/CLI (ARROW-1134).We relax the guide’s rules regarding structs. For public headers we should use struct only for objects that are principally simple data containers where it is OK to expose all the internal members and any methods are primarily conveniences. For private headers the rules are relaxed further and structs can be used where convenient for types that do not need access control even though they may not be simple data containers.
We prefer pointers for output and input/output parameters (the style guide recommends mutable references in some cases).
Our continuous integration builds on GitHub Actions run the unit test suites on a variety of platforms and configuration, including using Address Sanitizer and Undefined Behavior Sanitizer to check for various patterns of misbehaviour such as memory leaks. In addition, the codebase is subjected to a number of code style and code cleanliness checks.
In order to have a passing CI build, your modified Git branch must pass the following checks:
C++ builds with the project’s active version of
clang
without compiler warnings with-DBUILD_WARNING_LEVEL=CHECKIN
. Note that there are classes of warnings (such as-Wdocumentation
, see more on this below) that are not caught bygcc
.Passes various C++ (and others) style checks, checked with the
lint
subcommand to Archery. This can also be fixed locally by runningarchery lint --cpplint --clang-format --clang-tidy --fix
.CMake files pass style checks, can be fixed by running
archery lint --cmake-format --fix
. This requires Python 3 and cmake_format (note: this currently does not work on Windows).
On pull requests, the “Dev / Lint” pipeline will run these checks, and report what files/lines need to be fixed, if any.
In order to account for variations in the behavior of clang-format
between
major versions of LLVM, we pin the version of clang-format
used. You can
confirm the current pinned version by finding the CLANG_TOOLS
variable
value in .env. Note that
the version must match exactly; a newer version (even a patch release) will
not work. LLVM can be installed through a system package manager or a package
manager like Conda or Homebrew, though note they may not offer the exact
version needed. Alternatively, binaries can be directly downloaded from the
LLVM website.
For convenience, C++ style checks can run via a build, in addition to
Archery. To do so, build one or more of the targets format
(for
clang-format), lint_cpp_cli
, lint
(for cpplint), or
clang-tidy
. For example:
$ cmake -GNinja ../cpp ...
$ ninja format lint clang-tidy lint_cpp_cli
Depending on how you installed clang-format, the build system may not be able to find it. In that case, invoking CMake will show errors like the following:
-- clang-format 12 not found
Or if the wrong version is installed:
-- clang-format found, but version did not match "^clang-format version 12"
You can provide an explicit path to the directory containing the clang-format
executable and others with the environment variable $CLANG_TOOLS_PATH
, or
by passing -DClangTools_PATH=$PATH_TO_CLANG_TOOLS
when invoking CMake. For
example:
# We unpacked LLVM here:
$ ~/tools/bin/clang-format --version
clang-format version 12.0.0
# Pass the directory containing the tools to CMake
$ cmake ../cpp -DClangTools_PATH=~/tools/bin/
...snip...
-- clang-tidy found at /home/user/tools/bin/clang-tidy
-- clang-format found at /home/user/tools/bin/clang-format
...snip...
To make linting more reproducible for everyone, we provide a docker compose
target that is executable from the root of the repository:
$ docker compose run ubuntu-lint
Alternatively, on an open pull request, the comment bot can format C++ code for you (it will push a commit to the branch that can then be pulled). Just comment the following:
@github-actions autotune
Cleaning includes with include-what-you-use (IWYU)#
We occasionally use Google’s include-what-you-use tool, also known as IWYU, to remove unnecessary imports.
To begin using IWYU, you must first build it by following the instructions in
the project’s documentation. Once the include-what-you-use
executable is in
your $PATH
, you must run CMake with -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
in a new out-of-source CMake build directory like so:
mkdir -p $ARROW_ROOT/cpp/iwyu
cd $ARROW_ROOT/cpp/iwyu
cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
-DARROW_BUILD_BENCHMARKS=ON \
-DARROW_BUILD_BENCHMARKS_REFERENCE=ON \
-DARROW_BUILD_TESTS=ON \
-DARROW_BUILD_UTILITIES=ON \
-DARROW_COMPUTE=ON \
-DARROW_CSV=ON \
-DARROW_DATASET=ON \
-DARROW_FILESYSTEM=ON \
-DARROW_FLIGHT=ON \
-DARROW_GANDIVA=ON \
-DARROW_HDFS=ON \
-DARROW_JSON=ON \
-DARROW_PARQUET=ON \
-DARROW_S3=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_ZLIB=ON \
-DARROW_WITH_ZSTD=ON \
..
In order for IWYU to run on the desired component in the codebase, it must be
enabled by the CMake configuration flags. Once this is done, you can run IWYU
on the whole codebase by running a helper iwyu.sh
script:
IWYU_SH=$ARROW_ROOT/cpp/build-support/iwyu/iwyu.sh
./$IWYU_SH
Since this is very time consuming, you can check a subset of files matching some string pattern with the special “match” option
./$IWYU_SH match $PATTERN
For example, if you wanted to do IWYU checks on all files in
src/arrow/array
, you could run
./$IWYU_SH match arrow/array
Checking for ABI and API stability#
To build ABI compliance reports, you need to install the two tools
abi-dumper
and abi-compliance-checker
.
Build Arrow C++ in Debug mode, alternatively you could use -Og
which also
builds with the necessary symbols but includes a bit of code optimization.
Once the build has finished, you can generate ABI reports using:
abi-dumper -lver 9 debug/libarrow.so -o ABI-9.dump
The above version number is freely selectable. As we want to compare versions,
you should now git checkout
the version you want to compare it to and re-run
the above command using a different version number. Once both reports are
generated, you can build a comparison report using
abi-compliance-checker -l libarrow -d1 ABI-PY-9.dump -d2 ABI-PY-10.dump
The report is then generated in compat_reports/libarrow
as a HTML.
API Documentation#
We use Doxygen style comments (///
) in header files for comments
that we wish to show up in API documentation for classes and
functions.
When using clang
and building with
-DBUILD_WARNING_LEVEL=CHECKIN
, the -Wdocumentation
flag is
used which checks for some common documentation inconsistencies, like
documenting some, but not all function parameters with \param
. See
the LLVM documentation warnings section
for more about this.
While we publish the API documentation as part of the main Sphinx-based
documentation site, you can also build the C++ API documentation anytime using
Doxygen. Run the following command from the cpp/apidoc
directory:
doxygen Doxyfile
This requires Doxygen to be installed.
Apache Parquet Development#
To build the C++ libraries for Apache Parquet, add the flag
-DARROW_PARQUET=ON
when invoking CMake.
To build Apache Parquet with encryption support, add the flag
-DPARQUET_REQUIRE_ENCRYPTION=ON
when invoking CMake. The Parquet libraries and unit tests
can be built with the parquet
make target:
make parquet
On Linux and macOS if you do not have Apache Thrift installed on your system,
or you are building with -DThrift_SOURCE=BUNDLED
, you must install
bison
and flex
packages. On Windows we handle these build dependencies
automatically when building Thrift from source.
Running ctest -L unittest
will run all built C++ unit tests, while ctest -L
parquet
will run only the Parquet unit tests. The unit tests depend on an
environment variable PARQUET_TEST_DATA
that depends on a git submodule to the
repository apache/parquet-testing:
git submodule update --init
export PARQUET_TEST_DATA=$ARROW_ROOT/cpp/submodules/parquet-testing/data
Here $ARROW_ROOT
is the absolute path to the Arrow codebase.
Arrow Flight RPC#
In addition to the Arrow dependencies, Flight requires:
gRPC (>= 1.14, roughly)
Protobuf (>= 3.6, earlier versions may work)
c-ares (used by gRPC)
By default, Arrow will try to download and build these dependencies when building Flight.
The optional flight
libraries and tests can be built by passing
-DARROW_FLIGHT=ON
.
cmake .. -DARROW_FLIGHT=ON -DARROW_BUILD_TESTS=ON
make
You can also use existing installations of the extra dependencies.
When building, set the environment variables gRPC_ROOT
and/or
Protobuf_ROOT
and/or c-ares_ROOT
.
We are developing against recent versions of gRPC, and the versions. The
grpc-cpp
package available from https://conda-forge.org/ is one reliable
way to obtain gRPC in a cross-platform way. You may try using system libraries
for gRPC and Protobuf, but these are likely to be too old. On macOS, you can
try Homebrew:
brew install grpc