Apache Arrow Ballista 0.9.0 Release
28 Oct 2022
By The Apache Arrow PMC (pmc)
Ballista is an Arrow-native distributed SQL query engine implemented in Rust.
Ballista 0.9.0 is now available and is the most significant release since the project was donated to Apache Arrow in 2021.
This release represents 4 weeks of work, with 66 commits from 14 contributors:
22 Andy Grove
6 Daniël Heres
4 Brent Gardner
3 Stefan Stanciulescu
2 Ken Suenobu
2 Yang Jiang
1 Metehan Yıldırım
1 Trent Feda
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.
Support for Cloud Object Stores and Distributed File Systems
This is the first release of Ballista to have documented support for querying data from distributed file systems and object stores. Currently, S3 and HDFS are supported. Support for Google Cloud Storage and Azure Blob Storage is planned for the next release.
Flight SQL & JDBC support
The Ballista scheduler now implements the Flight SQL protocol, enabling any compliant Flight SQL client to connect to and run queries against a Ballista cluster.
The Apache Arrow Flight SQL JDBC driver can be used to connect Business Intelligence tools to a Ballista cluster.
It is now possible to connect to a Ballista cluster from Python and execute queries using both the DataFrame and SQL interfaces.
Scheduler Web User Interface and REST API
The scheduler now has a web user interface for monitoring queries. It is also possible to view graphical query plans that show how the query was executed, along with metrics.
The REST API that powers the user interface can also be accessed directly.
Simplified Kubernetes Deployment
Ballista now provides a Helm chart for simplified Kubernetes deployment.
The user guide is published at https://arrow.apache.org/ballista/ and provides deployment instructions for Docker, Docker Compose, and Kubernetes, as well as references for configuring and tuning Ballista.
The Ballista community is currently focused on the following tasks for the next release:
- Support for Azure Blob Storage and Google Cloud Storage
- Improve benchmark performance by implementing more query optimizations
- Improve scheduler web user interface
- Publish Docker images to GitHub Container Registry
The detailed list of issues planned for the 0.10.0 release can be found in the tracking issue.
Ballista has a friendly community and we welcome contributions. A good place to start is to following the instructions in the user guide and try using Ballista with your own SQL queries and ETL pipelines, and file issues for any bugs or feature suggestions.