Apache Arrow DataFusion 13.0.0 Project Update
25 Oct 2022
By The Apache Arrow PMC (pmc)
DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project to:
- Support SQL support
- Support DataFrame API
- Support a Domain Specific Query Language
- Easily and quickly read and process Parquet, JSON, Avro or CSV data.
- Read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
Even though DataFusion is 4 years “young,” it has seen significant community growth in the last few months and the momentum continues to accelerate.
DataFusion is used as the engine in many open source and commercial projects and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a “LLVM for database and AI systems”(alternate link) with announcements such as the release of FaceBook’s Velox engine, the major investments in Acero as well as the continued popularity of Apache Calcite and other similar technologies.
While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and extension points for just about everything. Some DataFusion users use a subset of the features such as the frontend (e.g. dask-sql) or the execution engine, (e.g. Blaze), and some use many different components to build both SQL based and customized DSL based systems such as InfluxDB IOx and VegaFusion.
One of DataFusion’s advantages is its implementation in Rust and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the ease of parallelization with the high quality and standardized
async ecosystem , as well as its modern dependency management system and wonderful performance.
We have increased the frequency of DataFusion releases to monthly instead of quarterly. This makes it easier for the increasing number of projects that now depend on DataFusion.
We have also completed the “graduation” of Ballista to its own top-level arrow-ballista repository which decouples the two projects and allows each project to move even faster.
Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
Improved Support for Cloud Object Stores
DataFusion now supports many major cloud object stores (Amazon S3, Azure Blob Storage, and Google Cloud Storage) “out of the box” via the object_store crate. Using this integration, DataFusion optimizes reading parquet files by reading only the parts of the files that are needed.
DataFusion now supports correlated subqueries, by rewriting them as joins. See the Subquery page in the User Guide for more information.
In addition to numerous other small improvements, the following SQL features are now supported:
CUBEgrouping set expressions #2446
SUM DISTINCTaggregate support #2405
NOT INSubqueries by rewriting them to
- Non equality predicates in
More DDL Support
Just as it is important to query, it is also important to give users the ability to define their data sources. We have added:
- Custom / Dynamic table provider factories #3311
SHOW CREATE TABLEfor support for views #2830
Performance is always an important goal for DataFusion, and there are a number of significant new optimizations such as
- Optimizations of TopK (queries with a
OFFSETclause): #3527, #2521
- Convert cross joins to inner joins when possible #3482
- Sort preserving
- Improvements in group by and sort performance #2375
Internally the optimizer has been significantly enhanced as well.
- Casting / coercion now happens during logical planning #3185 #3636
- More sophisticated expression analysis and simplification is available
- The parquet reader can now read directly from parquet files on remote object storage #2489 #3051
- Experimental support for “predicate pushdown” with late materialization after filtering during the scan (another blog post on this topic is coming soon).
- Support reading directly from AWS S3 and other object stores via
- Support for
- Expanded support for the
INlist and better built in coercion.
- Expanded support for date/time manipulation such as
date_binbuilt-in function , timestamp
TIMEliteral values #3010, #3110, #3034
- Binary operations (
XOR, etc): #3037 #3420
IS [NOT] UNKNOWN#3235, #3246
With the community growing and code accelerating, there is so much great stuff on the horizon. Some features we expect to land in the next few months:
- Complete Parquet Pushdown
- Additional date/time support
- Cost models, Nested Join Optimizations, analysis framework #128, #3843, #3845
The DataFusion 9.0.0 and 13.0.0 releases consists of 433 PRs from 64 distinct contributors. This does not count all the work that goes into our dependencies such as arrow, parquet, and object_store, that much of the same community helps nurture.
How to Get Involved
Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!
If you are interested in contributing to DataFusion, we would love to have you join us on our journey to create the most advanced open source query engine. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here.
Check out our Communication Doc on more ways to engage with the community.
Appendix: Contributor Shoutout
To give a sense of the number of people who contribute to this project regularly, we present for your consideration the following list derived from
git shortlog -sn 9.0.0..13.0.0 . Thank you all again!
87 Andy Grove 71 Andrew Lamb 29 Kun Liu 29 Kirk Mitchener 17 Wei-Ting Kuo 14 Yang Jiang 12 Raphael Taylor-Davies 11 Batuhan Taskaya 10 Brent Gardner 10 Remzi Yang 10 comphead 10 xudong.w 8 AssHero 7 Ruihang Xia 6 Dan Harris 6 Daniël Heres 6 Ian Alexander Joiner 6 Mike Roberts 6 askoa 4 BaymaxHWY 4 gorkem 4 jakevin 3 George Andronchik 3 Sarah Yurick 3 Stuart Carnie 2 Dalton Modlin 2 Dmitry Patsura 2 JasonLi 2 Jon Mease 2 Marco Neumann 2 yahoNanJing 1 Adilet Sarsembayev 1 Ayush Dattagupta 1 Dezhi Wu 1 Dhamotharan Sritharan 1 Eduard Karacharov 1 Francis Du 1 Harbour Zheng 1 Ismaël Mejía 1 Jack Klamer 1 Jeremy Dyer 1 Jiayu Liu 1 Kamil Konior 1 Liang-Chi Hsieh 1 Martin Grigorov 1 Matthijs Brobbel 1 Mehmet Ozan Kabak 1 Metehan Yıldırım 1 Morgan Cassels 1 Nitish Tiwari 1 Renjie Liu 1 Rito Takeuchi 1 Robert Pack 1 Thomas Cameron 1 Vrishabh 1 Xin Hao 1 Yijie Shen 1 byteink 1 kamille 1 mateuszkj 1 nvartolomei 1 yourenawo 1 Özgür Akkurt