Apache Arrow DataFusion 16.0.0 Project Update

Published 19 Jan 2023
By The Apache Arrow PMC (pmc)

Introduction

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. It is targeted primarily at developers creating data intensive analytics, and offers mature SQL support, a DataFrame API, and many extension points.

Systems based on DataFusion perform very well in benchmarks, especially considering they operate directly on parquet files rather than first loading into a specialized format. Some recent highlights include clickbench and the Cloudfuse.io standalone query engines page.

DataFusion is also part of a longer term trend, articulated clearly by Andy Pavlo in his 2022 Databases Retrospective. Database frameworks are proliferating and it is likely that all OLAP DBMSs and other data heavy applications, such as machine learning, will require a vectorized, highly performant query engine in the next 5 years to remain relevant. The only practical way to make such technology so widely available without many millions of dollars of investment is though open source engine such as DataFusion or Velox.

The rest of this post describes the improvements made to DataFusion over the last three months and some hints of where we are heading.

Community Growth

We again saw significant growth in the DataFusion community since our last update. There are some interesting metrics on OSSRank.

The DataFusion 16.0.0 release consists of 543 PRs from 73 distinct contributors, not including all the work that goes into dependencies such as arrow, parquet, and object_store, that much of the same community helps support. Thank you all for your help

Several new systems based on DataFusion were recently added:

Performance 🚀

Performance and efficiency are core values for DataFusion. While there is still a gap between DataFusion and the best of breed, tightly integrated systems such as DuckDB and Polars, DataFusion is closing the gap quickly. Performance highlights from the last three months:

Up to 30% Faster Sorting and Merging using the new Row Format
Advanced predicate pushdown, directly on parquet, directly from object storage, enabling sub millisecond filtering.
70% faster IN expressions evaluation (#4057)
Sort and partition aware optimizations (#3969 and #4691)
Filter selectivity analysis (#3868)

Runtime Resource Limits

Previously, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins.

In version 16.0.0, it is possible to limit DataFusion’s memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to optionally spill to secondary storage. See #3941 for more detail.

SQL Window Functions

SQL Window Functions are useful for a variety of analysis and DataFusion’s implementation support expanded significantly:

Custom window frames such as ... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2 FOLLOWING)
Unbounded window frames such as ... OVER (ORDER BY ... RANGE UNBOUNDED ROWS PRECEDING)
Support for the NTILE window function (#4676)
Support for GROUPS mode (#4155)

Improved Joins

Joins are often the most complicated operations to handle well in analytics systems and DataFusion 16.0.0 offers significant improvements such as

Cost based optimizer (CBO) automatically reorders join evaluations, selects algorithms (Merge / Hash), and pick build side based on available statistics and join type (INNER, LEFT, etc) (#4219)
Fast non column=column equijoins such as JOIN ON a.x + 5 = b.y
Better performance on non-equijoins (#4562)

Streaming Execution

One emerging use case for Datafusion is as a foundation for streaming-first data platforms. An important prerequisite is support for incremental execution for queries that can be computed incrementally.

With this release, DataFusion now supports the following streaming features:

Data ingestion from infinite files such as FIFOs (#4694),
Detection of pipeline-breaking queries in streaming use cases (#4694),
Automatic input swapping for joins so probe side is a data stream (#4694),
Intelligent elision of pipeline-breaking sort operations whenever possible (#4691),
Incremental execution for more types of queries; e.g. queries involving finite window frames (#4777).

These are a major steps forward, and we plan even more improvements over the next few releases.

Better Support for Distributed Catalogs

16.0.0 has been enhanced support for asynchronous catalogs (#4607) to better support distributed metadata stores such as Delta.io and Apache Iceberg which require asynchronous I/O during planning to access remote catalogs. Previously, DataFusion required synchronous access to all relevant catalog information.

Additional SQL Support

SQL support continues to improve, including some of these highlights:

Add TPC-DS query planning regression tests #4719
Support for PREPARE statement #4490
Automatic coercions ast between Date and Timestamp #4726
Support type coercion for timestamp and utf8 #4312
Full support for time32 and time64 literal values (ScalarValue) #4156
New functions, incuding uuid() #4041, current_time #4054, current_date #4022
Compressed CSV/JSON support #3642

The community has also invested in new sqllogic based tests to keep improving DataFusion’s quality with less effort.

Plan Serialization and Substrait

DataFusion now supports serialization of physical plans, with a custom protocol buffers format. In addition, we are adding initial support for Substrait, a Cross-Language Serialization for Relational Algebra

How to Get Involved

Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!

If you are interested in contributing to DataFusion, we would love to have you join us. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here.

Check out our Communication Doc on more ways to engage with the community.

Appendix: Contributor Shoutout

Here is a list of people who have contributed PRs to this project over the last three releases, derived from git shortlog -sn 13.0.0..16.0.0 . Thank you all!

Andrew Lamb
jakevin
Raphael Taylor-Davies
Andy Grove
Batuhan Taskaya
Remzi Yang
ygf11
Burak
Jeffrey
Marco Neumann
Kun Liu
Yang Jiang
mingmwang
Daniël Heres
Mustafa akur
comphead
mvanschellebeeck
xudong.w
dependabot[bot]
yahoNanJing
Brent Gardner
AssHero
Jiayu Liu
Wei-Ting Kuo
askoa
André Calado Coroado
Jie Han
Jon Mease
Metehan Yıldırım
Nga Tran
Ruihang Xia
baishen
Berkay Şahin
Dan Harris
Dongyan Zhou
Eduard Karacharov
Kikkon
Liang-Chi Hsieh
Marko Milenković
Martin Grigorov
Roman Nozdrin
Tim Van Wassenhove
r.4ntix
unconsolable
unvalley
Ajaya Agrawal
Alexander Spies
ArkashaJavelin
Artjoms Iskovs
BoredPerson
Christian Salvati
Creampanda
Data Psycho
Francis Du
Francis Le Roy
LFC
Marko Grujic
Matt Willian
Matthijs Brobbel
Max Burke
Mehmet Ozan Kabak
Rito Takeuchi
Roman Zeyde
Vrishabh
Zhang Li
ZuoTiJia
byteink
cfraz89
nbr
xxchan
yujie.zhang
zembunia
哇呜哇呜呀咦耶