Python Development#

This page provides general Python development guidelines and source build instructions for all platforms.

Coding Style#

We follow a similar PEP8-like coding style to the pandas project. To check style issues, use the Archery subcommand lint:

$ pip install -e "arrow/dev/archery[lint]"

$ archery lint --python

Some of the issues can be automatically fixed by passing the --fix option:

$ archery lint --python --fix

Unit Testing#

We are using pytest to develop our unit test suite. After building the project (see below) you can run its unit tests like so:

$ python -m pytest arrow/python/pyarrow

Package requirements to run the unit tests are found in requirements-test.txt and can be installed if needed with pip install -r requirements-test.txt.

If you get import errors for pyarrow._lib or another PyArrow module when trying to run the tests, run python -m pytest arrow/python/pyarrow and check if the editable version of pyarrow was installed correctly.

The project has a number of custom command line options for its test suite. Some tests are disabled by default, for example. To see all the options, run

$ python -m pytest pyarrow --help

and look for the “custom options” section.

Test Groups#

We have many tests that are grouped together using pytest marks. Some of these are disabled by default. To enable a test group, pass --$GROUP_NAME, e.g. --parquet. To disable a test group, prepend disable, so --disable-parquet for example. To run only the unit tests for a particular group, prepend only- instead, for example --only-parquet.

The test groups currently include:

gandiva: tests for Gandiva expression compiler (uses LLVM)
hdfs: tests that use libhdfs or libhdfs3 to access the Hadoop filesystem
hypothesis: tests that use the hypothesis module for generating random test cases. Note that --hypothesis doesn’t work due to a quirk with pytest, so you have to pass --enable-hypothesis
large_memory: Test requiring a large amount of system RAM
orc: Apache ORC tests
parquet: Apache Parquet tests
plasma: Plasma Object Store tests
s3: Tests for Amazon S3
tensorflow: Tests that involve TensorFlow
flight: Flight RPC tests

Benchmarking#

For running the benchmarks, see Benchmarks.

Building on Linux and MacOS#

System Requirements#

On macOS, any modern XCode (6.4 or higher; the current version is 13) or Xcode Command Line Tools (xcode-select --install) are sufficient.

On Linux, for this guide, we require a minimum of gcc 4.8 or clang 3.7. You can check your version by running

$ gcc --version

If the system compiler is older than gcc 4.8, it can be set to a newer version using the $CC and $CXX environment variables:

$ export CC=gcc-4.8
$ export CXX=g++-4.8

Environment Setup and Build#

First, let’s clone the Arrow git repository:

$ git clone https://github.com/apache/arrow.git

Pull in the test data and setup the environment variables:

$ pushd arrow
$ git submodule update --init
$ export PARQUET_TEST_DATA="${PWD}/cpp/submodules/parquet-testing/data"
$ export ARROW_TEST_DATA="${PWD}/testing/data"
$ popd

Using Conda#

The conda package manager allows installing build-time dependencies for Arrow C++ and PyArrow as pre-built binaries, which can make Arrow development easier and faster.

Let’s create a conda environment with all the C++ build and Python dependencies from conda-forge, targeting development for Python 3.9:

On Linux and macOS:

$ conda create -y -n pyarrow-dev -c conda-forge \
       --file arrow/ci/conda_env_unix.txt \
       --file arrow/ci/conda_env_cpp.txt \
       --file arrow/ci/conda_env_python.txt \
       --file arrow/ci/conda_env_gandiva.txt \
       compilers \
       python=3.9 \
       pandas

As of January 2019, the compilers package is needed on many Linux distributions to use packages from conda-forge.

With this out of the way, you can now activate the conda environment

$ conda activate pyarrow-dev

For Windows, see the Building on Windows section below.

We need to set some environment variables to let Arrow’s build system know about our build toolchain:

$ export ARROW_HOME=$CONDA_PREFIX

Using system and bundled dependencies#

Warning

If you installed Python using the Anaconda distribution or Miniconda, you cannot currently use a pip-based virtual environment. Please follow the conda-based development instructions instead.

If not using conda, you must arrange for your system to provide the required build tools and dependencies. Note that if some dependencies are absent, the Arrow C++ build chain may still be able to download and compile them on the fly, but this will take a longer time than with pre-installed binaries.

On macOS, use Homebrew to install all dependencies required for building Arrow C++:

$ brew update && brew bundle --file=arrow/cpp/Brewfile

See here for a list of dependencies you may need.

On Debian/Ubuntu, you need the following minimal set of dependencies:

$ sudo apt-get install build-essential cmake python3-dev

Now, let’s create a Python virtual environment with all Python dependencies in the same folder as the repositories, and a target installation folder:

$ python3 -m venv pyarrow-dev
$ source ./pyarrow-dev/bin/activate
$ pip install -r arrow/python/requirements-build.txt

$ # This is the folder where we will install the Arrow libraries during
$ # development
$ mkdir dist

If your CMake version is too old on Linux, you could get a newer one via pip install cmake.

We need to set some environment variables to let Arrow’s build system know about our build toolchain:

$ export ARROW_HOME=$(pwd)/dist
$ export LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH

Build and test#

Now build the Arrow C++ libraries and install them into the directory we created above (stored in $ARROW_HOME):

$ mkdir arrow/cpp/build
$ pushd arrow/cpp/build

$ cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
        -DCMAKE_INSTALL_LIBDIR=lib \
        -DCMAKE_BUILD_TYPE=Debug \
        -DARROW_WITH_BZ2=ON \
        -DARROW_WITH_ZLIB=ON \
        -DARROW_WITH_ZSTD=ON \
        -DARROW_WITH_LZ4=ON \
        -DARROW_WITH_SNAPPY=ON \
        -DARROW_WITH_BROTLI=ON \
        -DARROW_PARQUET=ON \
        -DPARQUET_REQUIRE_ENCRYPTION=ON \
        -DARROW_PYTHON=ON \
        -DARROW_BUILD_TESTS=ON \
        ..
$ make -j4
$ make install
$ popd

There are a number of optional components that can can be switched ON by adding flags with ON:

ARROW_CUDA: Support for CUDA-enabled GPUs
ARROW_FLIGHT: Flight RPC framework
ARROW_GANDIVA: LLVM-based expression compiler
ARROW_ORC: Support for Apache ORC file format
ARROW_PARQUET: Support for Apache Parquet file format
PARQUET_REQUIRE_ENCRYPTION: Support for Parquet Modular Encryption
ARROW_PLASMA: Shared memory object store

Anything set to ON above can also be turned off. Note that some compression libraries are recommended for full Parquet support.

You may choose between different kinds of C++ build types:

-DCMAKE_BUILD_TYPE=Release (the default) produces a build with optimizations enabled and debugging information disabled;
-DCMAKE_BUILD_TYPE=Debug produces a build with optimizations disabled and debugging information enabled;
-DCMAKE_BUILD_TYPE=RelWithDebInfo produces a build with both optimizations and debugging information enabled.

Docker examples#

If you are having difficulty building the Python library from source, take a look at the python/examples/minimal_build directory which illustrates a complete build and test from source both with the conda- and pip-based build methods.

Debugging#

Since pyarrow depends on the Arrow C++ libraries, debugging can frequently involve crossing between Python and C++ shared libraries.

Using gdb on Linux#

To debug the C++ libraries with gdb while running the Python unit tests, first start pytest with gdb:

$ gdb --args python -m pytest pyarrow/tests/test_to_run.py -k $TEST_TO_MATCH

To set a breakpoint, use the same gdb syntax that you would when debugging a C++ program, for example:

(gdb) b src/arrow/python/arrow_to_pandas.cc:1874
No source file named src/arrow/python/arrow_to_pandas.cc.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (src/arrow/python/arrow_to_pandas.cc:1874) pending.

Building on Windows#

Building on Windows requires one of the following compilers to be installed:

Build Tools for Visual Studio 2017
Visual Studio 2017

During the setup of Build Tools, ensure at least one Windows SDK is selected.

We bootstrap a conda environment similar to above, but skipping some of the Linux/macOS-only packages:

First, starting from a fresh clone of Apache Arrow:

$ git clone https://github.com/apache/arrow.git

$ conda create -y -n pyarrow-dev -c conda-forge ^
      --file arrow\ci\conda_env_cpp.txt ^
      --file arrow\ci\conda_env_python.txt ^
      --file arrow\ci\conda_env_gandiva.txt ^
      python=3.9
$ conda activate pyarrow-dev

Now, we build and install Arrow C++ libraries.

We set a number of environment variables:

the path of the installation directory of the Arrow C++ libraries as ARROW_HOME
add the path of installed DLL libraries to PATH
and the CMake generator to be used as PYARROW_CMAKE_GENERATOR

$ set ARROW_HOME=%cd%\arrow-dist
$ set PATH=%ARROW_HOME%\bin;%PATH%
$ set PYARROW_CMAKE_GENERATOR=Visual Studio 15 2017 Win64

Let’s configure, build and install the Arrow C++ libraries:

$ mkdir arrow\cpp\build
$ pushd arrow\cpp\build
$ cmake -G "%PYARROW_CMAKE_GENERATOR%" ^
      -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^
      -DCMAKE_UNITY_BUILD=ON ^
      -DARROW_CXXFLAGS="/WX /MP" ^
      -DARROW_WITH_LZ4=on ^
      -DARROW_WITH_SNAPPY=on ^
      -DARROW_WITH_ZLIB=on ^
      -DARROW_WITH_ZSTD=on ^
      -DARROW_PARQUET=on ^
      -DARROW_PYTHON=on ^
      ..
$ cmake --build . --target INSTALL --config Release
$ popd

Now, we can build pyarrow:

$ pushd arrow\python
$ set PYARROW_WITH_PARQUET=1
$ python setup.py build_ext --inplace
$ popd

Note

For building pyarrow, the above defined environment variables need to also be set. Remember this if to want to re-build pyarrow after your initial build.

Then run the unit tests with:

$ pushd arrow\python
$ python -m pytest pyarrow
$ popd

Note

With the above instructions the Arrow C++ libraries are not bundled with the Python extension. This is recommended for development as it allows the C++ libraries to be re-built separately.

As a consequence however, python setup.py install will also not install the Arrow C++ libraries. Therefore, to use pyarrow in python, PATH must contain the directory with the Arrow .dll-files.

If you want to bundle the Arrow C++ libraries with pyarrow, add the --bundle-arrow-cpp option when building:

$ python setup.py build_ext --bundle-arrow-cpp

Important: If you combine --bundle-arrow-cpp with --inplace the Arrow C++ libraries get copied to the source tree and are not cleared by python setup.py clean. They remain in place and will take precedence over any later Arrow C++ libraries contained in PATH. This can lead to incompatibilities when pyarrow is later built without --bundle-arrow-cpp.

Running C++ unit tests for Python integration#

Running C++ unit tests should not be necessary for most developers. If you do want to run them, you need to pass -DARROW_BUILD_TESTS=ON during configuration of the Arrow C++ library build:

$ mkdir arrow\cpp\build
$ pushd arrow\cpp\build
$ cmake -G "%PYARROW_CMAKE_GENERATOR%" ^
      -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^
      -DARROW_CXXFLAGS="/WX /MP" ^
      -DARROW_PARQUET=on ^
      -DARROW_PYTHON=on ^
      -DARROW_BUILD_TESTS=ON ^
      ..
$ cmake --build . --target INSTALL --config Release
$ popd

Getting arrow-python-test.exe (C++ unit tests for python integration) to run is a bit tricky because your %PYTHONHOME% must be configured to point to the active conda environment:

$ set PYTHONHOME=%CONDA_PREFIX%
$ pushd arrow\cpp\build\release\Release
$ arrow-python-test.exe
$ popd

To run all tests of the Arrow C++ library, you can also run ctest:

$ set PYTHONHOME=%CONDA_PREFIX%
$ pushd arrow\cpp\build
$ ctest
$ popd

Caveats#

The Plasma component is not supported on Windows.

Development Guidelines

Continuous Integration