Apache Arrow Ballista 0.5.0 Release
18 Aug 2021
By The Apache Arrow PMC (pmc)
Ballista extends DataFusion to provide support for distributed queries. This is the first release of Ballista since the project was donated to the Apache Arrow project and includes 80 commits from 11 contributors.
git shortlog -sn 4.0.0..5.0.0 ballista/rust/client ballista/rust/core ballista/rust/executor ballista/rust/scheduler 27 Andy Grove 15 Jiayu Liu 12 Andrew Lamb 8 Ximo Guanter 6 Daniël Heres 5 QP Hou 2 Jorge Leitao 1 Javier Goday 1 K.I. (Dennis) Jung 1 Mike Seddon 1 sathis
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.
Performance and Scalability
Ballista is now capable of running complex SQL queries at scale and supports scalable distributed joins. We have been benchmarking using individual queries from the TPC-H benchmark at scale factors up to 1000 (1 TB). When running against CSV files, performance is generally very close to DataFusion, and significantly faster in some cases due to the fact that the scheduler limits the number of concurrent tasks that run at any given time. Performance against large Parquet datasets is currently non ideal due to some issues (#867, #868) that we hope to resolve for the next release.
The main new features in this release are:
- Ballista queries can now be executed by calling DataFrame.collect()
- The shuffle mechanism has been re-implemented
- Distributed hash-partitioned joins are now supported
- Keda autoscaling is supported
To get started with Ballista, refer to the crate documentation.
Now that the basic functionality is in place, the focus for the next release will be to improve the performance and scalability as well as improving the documentation.
How to Get Involved
If you are interested in contributing to Ballista, we would love to have you! You can help by trying out Ballista on some of your own data and projects and filing bug reports and helping to improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for beginners is here and the full list is here.