Packaging and Testing with Crossbow#
The content of arrow/dev/tasks directory aims for automating the process of
Arrow packaging and integration testing.
- Packages:
- C++ and Python conda-forge packages for Linux, macOS and Windows 
- Python Wheels for Linux, macOS and Windows 
- C++ and GLib Linux packages for multiple distributions 
- Java for Gandiva 
 
- Integration tests:
- Various docker tests 
- Pandas 
- Dask 
- Turbodbc 
- HDFS 
- Spark 
 
Architecture#
Executors#
Individual jobs are executed on public CI services, currently:
- Linux: GitHub Actions, Travis CI, Azure Pipelines 
- macOS: GitHub Actions, Azure Pipelines 
- Windows: GitHub Actions, Azure Pipelines 
Queue#
Because of the nature of how the CI services work, the scheduling of
jobs happens through an additional git repository, which acts like a job
queue for the tasks. Anyone can host a queue repository (usually
named <ghuser>/crossbow).
A job is a git commit on a particular git branch, containing the required
configuration files to run the requested builds (like .travis.yml,
azure-pipelines.yml, or crossbow.yml for GitHub Actions ).
Scheduler#
Crossbow handles version generation, task rendering and
submission. The tasks are defined in tasks.yml.
Install#
The following guide depends on GitHub, but theoretically any git server can be used.
If you are not using the ursacomputing/crossbow repository, you will need to complete the first two steps, otherwise proceed to step 3:
- Enable Azure Pipelines integrations for the newly created queue repository. 
- Clone either ursacomputing/crossbow if you are using that, or the newly created repository next to the arrow repository: - By default the scripts looks for a - crossbowclone next to the- arrowdirectory, but this can configured through command line arguments.- git clone https://github.com/<user>/crossbow crossbow - Important note: Crossbow only supports GitHub token based authentication. Although it overwrites the repository urls provided with ssh protocol, it’s advisable to use the HTTPS repository URLs. 
- Create a Personal Access Token with - repoand- workflowpermissions (other permissions are not needed)
- Locally export the token as an environment variable: - export GH_TOKEN=<token> - or pass as an argument to the CLI script - --github-token
- Install Python (minimum supported version is 3.10): Miniconda is preferred, see installation instructions:
- Install the archery toolset containing crossbow itself: - $ pip install -e "arrow/dev/archery[crossbow]" 
- Try running it: - $ archery crossbow --help 
Usage#
The script does the following:
- Detects the current repository, thus supports forks. The following snippet will build kszucs’s fork instead of the upstream apache/arrow repository. - $ git clone https://github.com/kszucs/arrow $ git clone https://github.com/kszucs/crossbow $ cd arrow/dev/tasks $ archery crossbow submit --help # show the available options $ archery crossbow submit conda-win conda-linux conda-osx 
- Gets the HEAD commit of the currently checked out branch and generates the version number based on setuptools_scm. So to build a particular branch check out before running the script: - $ git checkout ARROW-<ticket number> $ archery crossbow submit --dry-run conda-linux conda-osx - Note that the arrow branch must be pushed beforehand, because the script will clone the selected branch. 
- Reads and renders the required build configurations with the parameters substituted. 
- Create a branch per task, prefixed with the job id. For example, to build conda recipes on linux, it will create a new branch: - crossbow@build-<id>-conda-linux.
- Pushes the modified branches to GitHub which triggers the builds. For authentication it uses GitHub OAuth tokens described in the install section. 
Query the build status#
Build id (which has a corresponding branch in the queue repository) is returned
by the submit command.
$ archery crossbow status <build id / branch name>
Download the build artifacts#
$ archery crossbow artifacts <build id / branch name>
Examples#
Submit command accepts a list of task names and/or a list of task-group names to select which tasks to build.
Run multiple builds:
$ archery crossbow submit debian-stretch conda-linux-gcc-py37-r40
Repository: https://github.com/kszucs/arrow@tasks
Commit SHA: 810a718836bb3a8cefc053055600bdcc440e6702
Version: 0.9.1.dev48+g810a7188.d20180414
Pushed branches:
 - debian-stretch
 - conda-linux-gcc-py37-r40
Just render without applying or committing the changes:
$ archery crossbow submit --dry-run task_name
Run only conda package builds and a Linux one:
$ archery crossbow submit --group conda centos-7
Run wheel builds:
$ archery crossbow submit --group wheel
There are multiple task groups in the tasks.yml like docker, integration
and cpp-python for running docker based tests.
archery crossbow submit supports multiple options and arguments, for more
see its help page:
$ archery crossbow submit --help
 
    