Apache Arrow Ballista 0.5.0 Release

Published 18 Aug 2021
By The Apache Arrow PMC (pmc)

Ballista extends DataFusion to provide support for distributed queries. This is the first release of Ballista since the project was donated to the Apache Arrow project and includes 80 commits from 11 contributors.

git shortlog -sn 4.0.0..5.0.0 ballista/rust/client ballista/rust/core ballista/rust/executor ballista/rust/scheduler
Andy Grove
Jiayu Liu
Andrew Lamb
Ximo Guanter
Daniël Heres
QP Hou
Jorge Leitao
Javier Goday
K.I. (Dennis) Jung
Mike Seddon
sathis

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.

Performance and Scalability

Ballista is now capable of running complex SQL queries at scale and supports scalable distributed joins. We have been benchmarking using individual queries from the TPC-H benchmark at scale factors up to 1000 (1 TB). When running against CSV files, performance is generally very close to DataFusion, and significantly faster in some cases due to the fact that the scheduler limits the number of concurrent tasks that run at any given time. Performance against large Parquet datasets is currently non ideal due to some issues (#867, #868) that we hope to resolve for the next release.

New Features

The main new features in this release are:

Ballista queries can now be executed by calling DataFrame.collect()
The shuffle mechanism has been re-implemented
Distributed hash-partitioned joins are now supported
Keda autoscaling is supported

To get started with Ballista, refer to the crate documentation.

Now that the basic functionality is in place, the focus for the next release will be to improve the performance and scalability as well as improving the documentation.

How to Get Involved

If you are interested in contributing to Ballista, we would love to have you! You can help by trying out Ballista on some of your own data and projects and filing bug reports and helping to improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for beginners is here and the full list is here.