Apache Arrow Ballista 0.9.0 Release


Published 28 Oct 2022
By The Apache Arrow PMC (pmc)

Introduction

Ballista is an Arrow-native distributed SQL query engine implemented in Rust.

Ballista 0.9.0 is now available and is the most significant release since the project was donated to Apache Arrow in 2021.

This release represents 4 weeks of work, with 66 commits from 14 contributors:

    22  Andy Grove
    12  yahoNanJing
     6  Daniël Heres
     4  Brent Gardner
     4  dependabot[bot]
     4  r.4ntix
     3  Stefan Stanciulescu
     3  mingmwang
     2  Ken Suenobu
     2  Yang Jiang
     1  Metehan Yıldırım
     1  Trent Feda
     1  askoa
     1  yangzhong

Release Highlights

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.

Support for Cloud Object Stores and Distributed File Systems

This is the first release of Ballista to have documented support for querying data from distributed file systems and object stores. Currently, S3 and HDFS are supported. Support for Google Cloud Storage and Azure Blob Storage is planned for the next release.

Flight SQL & JDBC support

The Ballista scheduler now implements the Flight SQL protocol, enabling any compliant Flight SQL client to connect to and run queries against a Ballista cluster.

The Apache Arrow Flight SQL JDBC driver can be used to connect Business Intelligence tools to a Ballista cluster.

Python Bindings

It is now possible to connect to a Ballista cluster from Python and execute queries using both the DataFrame and SQL interfaces.

Scheduler Web User Interface and REST API

The scheduler now has a web user interface for monitoring queries. It is also possible to view graphical query plans that show how the query was executed, along with metrics.

The REST API that powers the user interface can also be accessed directly.

Simplified Kubernetes Deployment

Ballista now provides a Helm chart for simplified Kubernetes deployment.

User Guide

The user guide is published at https://arrow.apache.org/ballista/ and provides deployment instructions for Docker, Docker Compose, and Kubernetes, as well as references for configuring and tuning Ballista.

Roadmap

The Ballista community is currently focused on the following tasks for the next release:

  • Support for Azure Blob Storage and Google Cloud Storage
  • Improve benchmark performance by implementing more query optimizations
  • Improve scheduler web user interface
  • Publish Docker images to GitHub Container Registry

The detailed list of issues planned for the 0.10.0 release can be found in the tracking issue.

Getting Involved

Ballista has a friendly community and we welcome contributions. A good place to start is to following the instructions in the user guide and try using Ballista with your own SQL queries and ETL pipelines, and file issues for any bugs or feature suggestions.