This page provides general Python development guidelines and source build instructions for all platforms.
We follow a similar PEP8-like coding style to the pandas project.
The code must pass
flake8 (available from pip or conda) or it will fail the
build. Check for style errors before submitting your pull request with:
flake8 . flake8 --config=.flake8.cython .
autopep8 (also available from pip or conda) can automatically
fix many of the errors reported by
autopep8 --in-place ../integration/integration_test.py autopep8 --in-place --global-config=.flake8.cython pyarrow/table.pxi
We are using pytest to develop our unit test suite. After building the project (see below) you can run its unit tests like so:
Package requirements to run the unit tests are found in
requirements-test.txt and can be installed if needed with
pip install -r
The project has a number of custom command line options for its test suite. Some tests are disabled by default, for example. To see all the options, run
pytest pyarrow --help
and look for the “custom options” section.
We have many tests that are grouped together using pytest marks. Some of these
are disabled by default. To enable a test group, pass
--parquet. To disable a test group, prepend
--disable-parquet for example. To run only the unit tests for a
particular group, prepend
only- instead, for example
The test groups currently include:
gandiva: tests for Gandiva expression compiler (uses LLVM)
hdfs: tests that use libhdfs or libhdfs3 to access the Hadoop filesystem
hypothesis: tests that use the
hypothesismodule for generating random test cases. Note that
--hypothesisdoesn’t work due to a quirk with pytest, so you have to pass
large_memory: Test requiring a large amount of system RAM
orc: Apache ORC tests
parquet: Apache Parquet tests
plasma: Plasma Object Store tests
s3: Tests for Amazon S3
tensorflow: Tests that involve TensorFlow
flight: Flight RPC tests
Building on Linux and MacOS¶
On macOS, any modern XCode (6.4 or higher; the current version is 10) is sufficient.
On Linux, for this guide, we require a minimum of gcc 4.8, or clang 3.7 or higher. You can check your version by running
$ gcc --version
If the system compiler is older than gcc 4.8, it can be set to a newer version
$CXX environment variables:
export CC=gcc-4.8 export CXX=g++-4.8
Environment Setup and Build¶
First, let’s clone the Arrow git repository:
mkdir repos cd repos git clone https://github.com/apache/arrow.git
You should now see
$ ls -l total 8 drwxrwxr-x 12 wesm wesm 4096 Apr 15 19:19 arrow/
Using conda to build Arrow on macOS is complicated by the
fact that the conda-forge compilers require an older macOS SDK.
Conda offers some installation instructions;
the alternative would be to use Homebrew and
Let’s create a conda environment with all the C++ build and Python dependencies from conda-forge, targeting development for Python 3.7:
On Linux and macOS:
conda create -y -n pyarrow-dev -c conda-forge \ --file arrow/ci/conda_env_unix.yml \ --file arrow/ci/conda_env_cpp.yml \ --file arrow/ci/conda_env_python.yml \ compilers \ python=3.7
As of January 2019, the
compilers package is needed on many Linux
distributions to use packages from conda-forge.
With this out of the way, you can now activate the conda environment
conda activate pyarrow-dev
For Windows, see the Building on Windows section below.
We need to set some environment variables to let Arrow’s build system know about our build toolchain:
If you installed Python using the Anaconda distribution or Miniconda, you cannot currently use
to manage your development. Please follow the conda-based development
On macOS, use Homebrew to install all dependencies required for building Arrow C++:
brew update && brew bundle --file=arrow/cpp/Brewfile
See here for a list of dependencies you may need.
On Debian/Ubuntu, you need the following minimal set of dependencies. All other dependencies will be automatically built by Arrow’s third-party toolchain.
$ sudo apt-get install libjemalloc-dev libboost-dev \ libboost-filesystem-dev \ libboost-system-dev \ libboost-regex-dev \ python-dev \ autoconf \ flex \ bison
If you are building Arrow for Python 3, install
python3-dev instead of
On Arch Linux, you can get these dependencies via pacman.
$ sudo pacman -S jemalloc boost
Now, let’s create a Python virtualenv with all Python dependencies in the same folder as the repositories and a target installation folder:
virtualenv pyarrow source ./pyarrow/bin/activate pip install six numpy pandas cython pytest hypothesis # This is the folder where we will install the Arrow libraries during # development mkdir dist
If your cmake version is too old on Linux, you could get a newer one via
pip install cmake.
We need to set some environment variables to let Arrow’s build system know about our build toolchain:
export ARROW_HOME=$(pwd)/dist export LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH
Build and test¶
Now build and install the Arrow C++ libraries:
mkdir arrow/cpp/build pushd arrow/cpp/build cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ -DCMAKE_INSTALL_LIBDIR=lib \ -DARROW_FLIGHT=ON \ -DARROW_GANDIVA=ON \ -DARROW_ORC=ON \ -DARROW_PARQUET=ON \ -DARROW_PYTHON=ON \ -DARROW_PLASMA=ON \ -DARROW_BUILD_TESTS=ON \ .. make -j4 make install popd
Many of these components are optional, and can be switched off by setting them
ARROW_FLIGHT: RPC framework
ARROW_GANDIVA: LLVM-based expression compiler
ARROW_ORC: Support for Apache ORC file format
ARROW_PARQUET: Support for Apache Parquet file format
ARROW_PLASMA: Shared memory object store
If multiple versions of Python are installed in your environment, you may have
to pass additional parameters to cmake so that it can find the right
executable, headers and libraries. For example, specifying
-DPYTHON_EXECUTABLE=$VIRTUAL_ENV/bin/python (assuming that you’re in
virtualenv) enables cmake to choose the python executable which you are using.
On Linux systems with support for building on multiple architectures,
make may install libraries in the
lib64 directory by default. For
this reason we recommend passing
-DCMAKE_INSTALL_LIBDIR=lib because the
Python build scripts assume the library directory is
If you have conda installed but are not using it to manage dependencies,
and you have trouble building the C++ library, you may need to set
-DARROW_DEPENDENCY_SOURCE=AUTO or some other value (described
to explicitly tell CMake not to use conda.
For any other C++ build challenges, see C++ Development.
Now, build pyarrow:
pushd arrow/python export PYARROW_WITH_FLIGHT=1 export PYARROW_WITH_GANDIVA=1 export PYARROW_WITH_ORC=1 export PYARROW_WITH_PARQUET=1 python setup.py build_ext --inplace popd
If you did not build one of the optional components, set the corresponding
PYARROW_WITH_$COMPONENT environment variable to 0.
Now you are ready to install test dependencies and run Unit Testing, as described above.
To build a self-contained wheel (including the Arrow and Parquet C++
libraries), one can set
pip install wheel # if not installed python setup.py build_ext --build-type=$ARROW_BUILD_TYPE \ --bundle-arrow-cpp bdist_wheel
Building with CUDA support¶
pyarrow.cuda module offers support for using Arrow platform
components with Nvidia’s CUDA-enabled GPU devices. To build with this support,
-DARROW_CUDA=ON when building the C++ libraries, and set the following
environment variable when building pyarrow:
Since pyarrow depends on the Arrow C++ libraries, debugging can frequently involve crossing between Python and C++ shared libraries.
Using gdb on Linux¶
- To debug the C++ libraries with gdb while running the Python unit
test, first start pytest with gdb:
gdb --args python -m pytest pyarrow/tests/test_to_run.py -k $TEST_TO_MATCH
To set a breakpoint, use the same gdb syntax that you would when debugging a C++ unitttest, for example:
(gdb) b src/arrow/python/arrow_to_pandas.cc:1874 No source file named src/arrow/python/arrow_to_pandas.cc. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (src/arrow/python/arrow_to_pandas.cc:1874) pending.
Building on Windows¶
Building on Windows requires one of the following compilers to be installed:
During the setup of Build Tools ensure at least one Windows SDK is selected.
Visual Studio 2019 and its build tools are currently not supported.
We bootstrap a conda environment similar to above, but skipping some of the Linux/macOS-only packages:
First, starting from fresh clones of Apache Arrow:
git clone https://github.com/apache/arrow.git
conda create -y -n pyarrow-dev -c conda-forge ^ --file arrow\ci\conda_env_cpp.yml ^ --file arrow\ci\conda_env_python.yml ^ --file arrow\ci\conda_env_gandiva.yml ^ python=3.7 conda activate pyarrow-dev
Now, we build and install Arrow C++ libraries.
We set a number of environment variables:
the path of the installation directory of the Arrow C++ libraries as
add the path of installed DLL libraries to
and choose the compiler to be used
set ARROW_HOME=%cd%\arrow-dist set PATH=%ARROW_HOME%\bin;%PATH% set PYARROW_CMAKE_GENERATOR=Visual Studio 15 2017 Win64
This assumes Visual Studio 2017 or its build tools are used. For Visual Studio 2015 and its build tools use the following instead:
set PYARROW_CMAKE_GENERATOR=Visual Studio 14 2015 Win64
Let’s configure, build and install the Arrow C++ libraries:
mkdir arrow\cpp\build pushd arrow\cpp\build cmake -G "%PYARROW_CMAKE_GENERATOR%" ^ -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^ -DARROW_CXXFLAGS="/WX /MP" ^ -DARROW_GANDIVA=on ^ -DARROW_PARQUET=on ^ -DARROW_PYTHON=on ^ .. cmake --build . --target INSTALL --config Release popd
Now, we can build pyarrow:
pushd arrow\python set PYARROW_WITH_GANDIVA=1 set PYARROW_WITH_PARQUET=1 python setup.py build_ext --inplace popd
For building pyarrow, the above defined environment variables need to also
be set. Remember this if to want to re-build
pyarrow after your initial build.
Then run the unit tests with:
pushd arrow\python py.test pyarrow -v popd
With the above instructions the Arrow C++ libraries are not bundled with the Python extension. This is recommended for development as it allows the C++ libraries to be re-built separately.
As a consequence however,
python setup.py install will also not install
the Arrow C++ libraries. Therefore, to use
pyarrow in python,
must contain the directory with the Arrow .dll-files.
If you want to bundle the Arrow C++ libraries with
--bundle-arrow-cpp as build parameter:
python setup.py build_ext --bundle-arrow-cpp
Important: If you combine
Arrow C++ libraries get copied to the python source tree and are not cleared
python setup.py clean. They remain in place and will take precedence
over any later Arrow C++ libraries contained in
PATH. This can lead to
pyarrow is later built without
Running C++ unit tests for Python integration¶
Running C++ unit tests should not be necessary for most developers. If you do
want to run them, you need to pass
configuration of the Arrow C++ library build:
mkdir arrow\cpp\build pushd arrow\cpp\build cmake -G "%PYARROW_CMAKE_GENERATOR%" ^ -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^ -DARROW_CXXFLAGS="/WX /MP" ^ -DARROW_GANDIVA=on ^ -DARROW_PARQUET=on ^ -DARROW_PYTHON=on ^ -DARROW_BUILD_TESTS=ON ^ .. cmake --build . --target INSTALL --config Release popd
arrow-python-test.exe (C++ unit tests for python integration) to
run is a bit tricky because your
%PYTHONHOME% must be configured to point
to the active conda environment:
set PYTHONHOME=%CONDA_PREFIX% pushd arrow\cpp\build\release\Release arrow-python-test.exe popd
To run all tests of the Arrow C++ library, you can also run
set PYTHONHOME=%CONDA_PREFIX% pushd arrow\cpp\build ctest popd
Some components are not supported yet on Windows: