Apache Arrow DataFusion 16.0.0 Project Update
19 Jan 2023
By The Apache Arrow PMC (pmc)
DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. It is targeted primarily at developers creating data intensive analytics, and offers mature SQL support, a DataFrame API, and many extension points.
Systems based on DataFusion perform very well in benchmarks, especially considering they operate directly on parquet files rather than first loading into a specialized format. Some recent highlights include clickbench and the Cloudfuse.io standalone query engines page.
DataFusion is also part of a longer term trend, articulated clearly by Andy Pavlo in his 2022 Databases Retrospective. Database frameworks are proliferating and it is likely that all OLAP DBMSs and other data heavy applications, such as machine learning, will require a vectorized, highly performant query engine in the next 5 years to remain relevant. The only practical way to make such technology so widely available without many millions of dollars of investment is though open source engine such as DataFusion or Velox.
The rest of this post describes the improvements made to DataFusion over the last three months and some hints of where we are heading.
The DataFusion 16.0.0 release consists of 543 PRs from 73 distinct contributors, not including all the work that goes into dependencies such as arrow, parquet, and object_store, that much of the same community helps support. Thank you all for your help
Several new systems based on DataFusion were recently added:
Performance and efficiency are core values for DataFusion. While there is still a gap between DataFusion and the best of breed, tightly integrated systems such as DuckDB and Polars, DataFusion is closing the gap quickly. Performance highlights from the last three months:
- Up to 30% Faster Sorting and Merging using the new Row Format
- Advanced predicate pushdown, directly on parquet, directly from object storage, enabling sub millisecond filtering.
INexpressions evaluation (#4057)
- Sort and partition aware optimizations (#3969 and #4691)
- Filter selectivity analysis (#3868)
Runtime Resource Limits
Previously, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins.
In version 16.0.0, it is possible to limit DataFusion’s memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to optionally spill to secondary storage. See #3941 for more detail.
SQL Window Functions
SQL Window Functions are useful for a variety of analysis and DataFusion’s implementation support expanded significantly:
- Custom window frames such as
... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2 FOLLOWING)
- Unbounded window frames such as
... OVER (ORDER BY ... RANGE UNBOUNDED ROWS PRECEDING)
- Support for the
NTILEwindow function (#4676)
- Support for
Joins are often the most complicated operations to handle well in analytics systems and DataFusion 16.0.0 offers significant improvements such as
- Cost based optimizer (CBO) automatically reorders join evaluations, selects algorithms (Merge / Hash), and pick build side based on available statistics and join type (
LEFT, etc) (#4219)
- Fast non
column=columnequijoins such as
JOIN ON a.x + 5 = b.y
- Better performance on non-equijoins (#4562)
One emerging use case for Datafusion is as a foundation for streaming-first data platforms. An important prerequisite is support for incremental execution for queries that can be computed incrementally.
With this release, DataFusion now supports the following streaming features:
- Data ingestion from infinite files such as FIFOs (#4694),
- Detection of pipeline-breaking queries in streaming use cases (#4694),
- Automatic input swapping for joins so probe side is a data stream (#4694),
- Intelligent elision of pipeline-breaking sort operations whenever possible (#4691),
- Incremental execution for more types of queries; e.g. queries involving finite window frames (#4777).
These are a major steps forward, and we plan even more improvements over the next few releases.
Better Support for Distributed Catalogs
16.0.0 has been enhanced support for asynchronous catalogs (#4607) to better support distributed metadata stores such as Delta.io and Apache Iceberg which require asynchronous I/O during planning to access remote catalogs. Previously, DataFusion required synchronous access to all relevant catalog information.
Additional SQL Support
SQL support continues to improve, including some of these highlights:
- Add TPC-DS query planning regression tests #4719
- Support for
- Automatic coercions ast between Date and Timestamp #4726
- Support type coercion for timestamp and utf8 #4312
- Full support for time32 and time64 literal values (
- New functions, incuding
- Compressed CSV/JSON support #3642
The community has also invested in new sqllogic based tests to keep improving DataFusion’s quality with less effort.
Plan Serialization and Substrait
DataFusion now supports serialization of physical plans, with a custom protocol buffers format. In addition, we are adding initial support for Substrait, a Cross-Language Serialization for Relational Algebra
How to Get Involved
Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!
If you are interested in contributing to DataFusion, we would love to have you join us. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here.
Check out our Communication Doc on more ways to engage with the community.
Appendix: Contributor Shoutout
Here is a list of people who have contributed PRs to this project over the last three releases, derived from
git shortlog -sn 13.0.0..16.0.0 . Thank you all!
113 Andrew Lamb 58 jakevin 46 Raphael Taylor-Davies 30 Andy Grove 19 Batuhan Taskaya 19 Remzi Yang 17 ygf11 16 Burak 16 Jeffrey 16 Marco Neumann 14 Kun Liu 12 Yang Jiang 10 mingmwang 9 Daniël Heres 9 Mustafa akur 9 comphead 9 mvanschellebeeck 9 xudong.w 7 dependabot[bot] 7 yahoNanJing 6 Brent Gardner 5 AssHero 4 Jiayu Liu 4 Wei-Ting Kuo 4 askoa 3 André Calado Coroado 3 Jie Han 3 Jon Mease 3 Metehan Yıldırım 3 Nga Tran 3 Ruihang Xia 3 baishen 2 Berkay Şahin 2 Dan Harris 2 Dongyan Zhou 2 Eduard Karacharov 2 Kikkon 2 Liang-Chi Hsieh 2 Marko Milenković 2 Martin Grigorov 2 Roman Nozdrin 2 Tim Van Wassenhove 2 r.4ntix 2 unconsolable 2 unvalley 1 Ajaya Agrawal 1 Alexander Spies 1 ArkashaJavelin 1 Artjoms Iskovs 1 BoredPerson 1 Christian Salvati 1 Creampanda 1 Data Psycho 1 Francis Du 1 Francis Le Roy 1 LFC 1 Marko Grujic 1 Matt Willian 1 Matthijs Brobbel 1 Max Burke 1 Mehmet Ozan Kabak 1 Rito Takeuchi 1 Roman Zeyde 1 Vrishabh 1 Zhang Li 1 ZuoTiJia 1 byteink 1 cfraz89 1 nbr 1 xxchan 1 yujie.zhang 1 zembunia 1 哇呜哇呜呀咦耶