Starting a Ballista Cluster using Docker

Build Docker Images

There are no officially published Docker images, so it is currently necessary to build the images from source.

Run the following commands to clone the source repository and build the Docker image.

git clone git@github.com:apache/arrow-ballista.git -b 0.9.0
cd arrow-ballista
./dev/build-ballista-docker.sh

This will create the following images:

  • apache/arrow-ballista-scheduler:0.9.0

  • apache/arrow-ballista-executor:0.9.0

Start a Scheduler

Start a scheduler using the following syntax:

docker run --network=host \
 -d apache/arrow-ballista-scheduler:0.9.0 \
 --bind-port 50050

Run docker ps to check that the process is running:

$ docker ps
CONTAINER ID   IMAGE                                   COMMAND                  CREATED         STATUS        PORTS     NAMES
8cdea4956c97   apache/arrow-ballista-scheduler:0.9.0   "/scheduler-entrypoi…"   2 seconds ago   Up 1 second             nervous_swirles

Run docker logs CONTAINER_ID to check the output from the process:

$ docker logs 8cdea4956c97
Starting nginx to serve Ballista Scheduler web UI on port 80
2022-09-19T13:51:34.792363Z  INFO main ThreadId(01) ballista_scheduler: Ballista v0.9.0 Scheduler listening on 0.0.0.0:50050
2022-09-19T13:51:34.792395Z  INFO main ThreadId(01) ballista_scheduler: Starting Scheduler grpc server with task scheduling policy of PullStaged
2022-09-19T13:51:34.792494Z  INFO main ThreadId(01) ballista_scheduler::scheduler_server::query_stage_scheduler: Starting QueryStageScheduler
2022-09-19T13:51:34.792581Z  INFO tokio-runtime-worker ThreadId(45) ballista_core::event_loop: Starting the event loop query_stage

Start Executors

Start one or more executor processes. Each executor process will need to listen on a different port.

docker run --network=host \
  -d apache/arrow-ballista-executor:0.9.0 \
  --external-host localhost --bind-port 50051

Use docker ps to check that both the scheduler and executor(s) are now running:

$ docker ps
CONTAINER ID   IMAGE                                   COMMAND                  CREATED         STATUS         PORTS     NAMES
f0b21f6b5050   apache/arrow-ballista-executor:0.9.0    "/executor-entrypoin…"   2 seconds ago   Up 1 second              relaxed_goldberg
8cdea4956c97   apache/arrow-ballista-scheduler:0.9.0   "/scheduler-entrypoi…"   2 minutes ago   Up 2 minutes             nervous_swirles

Use docker logs CONTAINER_ID to check the output from the executor(s):

$ docker logs f0b21f6b5050
2022-09-19T13:54:10.806231Z  INFO main ThreadId(01) ballista_executor: Running with config:
2022-09-19T13:54:10.806261Z  INFO main ThreadId(01) ballista_executor: work_dir: /tmp/.tmp5BdxT2
2022-09-19T13:54:10.806265Z  INFO main ThreadId(01) ballista_executor: concurrent_tasks: 48
2022-09-19T13:54:10.807454Z  INFO tokio-runtime-worker ThreadId(49) ballista_executor: Ballista v0.9.0 Rust Executor Flight Server listening on 0.0.0.0:50051
2022-09-19T13:54:10.807467Z  INFO tokio-runtime-worker ThreadId(46) ballista_executor::execution_loop: Starting poll work loop with scheduler

Using etcd as a Backing Store

NOTE: This functionality is currently experimental

Ballista can optionally use etcd as a backing store for the scheduler. Use the following commands to launch the scheduler with this option enabled.

docker run --network=host \
  -d apache/arrow-ballista-scheduler:0.9.0 \
  --bind-port 50050 \
  --config-backend etcd \
  --etcd-urls etcd:2379

Please refer to the etcd website for installation instructions. Etcd version 3.4.9 or later is recommended.