Apache Arrow DataFusion 16.0.0 Project Update


Published 19 Jan 2023
By The Apache Arrow PMC (pmc)

Introduction

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. It is targeted primarily at developers creating data intensive analytics, and offers mature SQL support, a DataFrame API, and many extension points.

Systems based on DataFusion perform very well in benchmarks, especially considering they operate directly on parquet files rather than first loading into a specialized format. Some recent highlights include clickbench and the Cloudfuse.io standalone query engines page.

DataFusion is also part of a longer term trend, articulated clearly by Andy Pavlo in his 2022 Databases Retrospective. Database frameworks are proliferating and it is likely that all OLAP DBMSs and other data heavy applications, such as machine learning, will require a vectorized, highly performant query engine in the next 5 years to remain relevant. The only practical way to make such technology so widely available without many millions of dollars of investment is though open source engine such as DataFusion or Velox.

The rest of this post describes the improvements made to DataFusion over the last three months and some hints of where we are heading.

Community Growth

We again saw significant growth in the DataFusion community since our last update. There are some interesting metrics on OSSRank.

The DataFusion 16.0.0 release consists of 543 PRs from 73 distinct contributors, not including all the work that goes into dependencies such as arrow, parquet, and object_store, that much of the same community helps support. Thank you all for your help

Several new systems based on DataFusion were recently added:

Performance 🚀

Performance and efficiency are core values for DataFusion. While there is still a gap between DataFusion and the best of breed, tightly integrated systems such as DuckDB and Polars, DataFusion is closing the gap quickly. Performance highlights from the last three months:

  • Up to 30% Faster Sorting and Merging using the new Row Format
  • Advanced predicate pushdown, directly on parquet, directly from object storage, enabling sub millisecond filtering.
  • 70% faster IN expressions evaluation (#4057)
  • Sort and partition aware optimizations (#3969 and #4691)
  • Filter selectivity analysis (#3868)

Runtime Resource Limits

Previously, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins.

In version 16.0.0, it is possible to limit DataFusion’s memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to optionally spill to secondary storage. See #3941 for more detail.

SQL Window Functions

SQL Window Functions are useful for a variety of analysis and DataFusion’s implementation support expanded significantly:

  • Custom window frames such as ... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2 FOLLOWING)
  • Unbounded window frames such as ... OVER (ORDER BY ... RANGE UNBOUNDED ROWS PRECEDING)
  • Support for the NTILE window function (#4676)
  • Support for GROUPS mode (#4155)

Improved Joins

Joins are often the most complicated operations to handle well in analytics systems and DataFusion 16.0.0 offers significant improvements such as

  • Cost based optimizer (CBO) automatically reorders join evaluations, selects algorithms (Merge / Hash), and pick build side based on available statistics and join type (INNER, LEFT, etc) (#4219)
  • Fast non column=column equijoins such as JOIN ON a.x + 5 = b.y
  • Better performance on non-equijoins (#4562)

Streaming Execution

One emerging use case for Datafusion is as a foundation for streaming-first data platforms. An important prerequisite is support for incremental execution for queries that can be computed incrementally.

With this release, DataFusion now supports the following streaming features:

  • Data ingestion from infinite files such as FIFOs (#4694),
  • Detection of pipeline-breaking queries in streaming use cases (#4694),
  • Automatic input swapping for joins so probe side is a data stream (#4694),
  • Intelligent elision of pipeline-breaking sort operations whenever possible (#4691),
  • Incremental execution for more types of queries; e.g. queries involving finite window frames (#4777).

These are a major steps forward, and we plan even more improvements over the next few releases.

Better Support for Distributed Catalogs

16.0.0 has been enhanced support for asynchronous catalogs (#4607) to better support distributed metadata stores such as Delta.io and Apache Iceberg which require asynchronous I/O during planning to access remote catalogs. Previously, DataFusion required synchronous access to all relevant catalog information.

Additional SQL Support

SQL support continues to improve, including some of these highlights:

  • Add TPC-DS query planning regression tests #4719
  • Support for PREPARE statement #4490
  • Automatic coercions ast between Date and Timestamp #4726
  • Support type coercion for timestamp and utf8 #4312
  • Full support for time32 and time64 literal values (ScalarValue) #4156
  • New functions, incuding uuid() #4041, current_time #4054, current_date #4022
  • Compressed CSV/JSON support #3642

The community has also invested in new sqllogic based tests to keep improving DataFusion’s quality with less effort.

Plan Serialization and Substrait

DataFusion now supports serialization of physical plans, with a custom protocol buffers format. In addition, we are adding initial support for Substrait, a Cross-Language Serialization for Relational Algebra

How to Get Involved

Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!

If you are interested in contributing to DataFusion, we would love to have you join us. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here.

Check out our Communication Doc on more ways to engage with the community.

Appendix: Contributor Shoutout

Here is a list of people who have contributed PRs to this project over the last three releases, derived from git shortlog -sn 13.0.0..16.0.0 . Thank you all!

   113	Andrew Lamb
    58	jakevin
    46	Raphael Taylor-Davies
    30	Andy Grove
    19	Batuhan Taskaya
    19	Remzi Yang
    17	ygf11
    16	Burak
    16	Jeffrey
    16	Marco Neumann
    14	Kun Liu
    12	Yang Jiang
    10	mingmwang
     9	Daniël Heres
     9	Mustafa akur
     9	comphead
     9	mvanschellebeeck
     9	xudong.w
     7	dependabot[bot]
     7	yahoNanJing
     6	Brent Gardner
     5	AssHero
     4	Jiayu Liu
     4	Wei-Ting Kuo
     4	askoa
     3	André Calado Coroado
     3	Jie Han
     3	Jon Mease
     3	Metehan Yıldırım
     3	Nga Tran
     3	Ruihang Xia
     3	baishen
     2	Berkay Şahin
     2	Dan Harris
     2	Dongyan Zhou
     2	Eduard Karacharov
     2	Kikkon
     2	Liang-Chi Hsieh
     2	Marko Milenković
     2	Martin Grigorov
     2	Roman Nozdrin
     2	Tim Van Wassenhove
     2	r.4ntix
     2	unconsolable
     2	unvalley
     1	Ajaya Agrawal
     1	Alexander Spies
     1	ArkashaJavelin
     1	Artjoms Iskovs
     1	BoredPerson
     1	Christian Salvati
     1	Creampanda
     1	Data Psycho
     1	Francis Du
     1	Francis Le Roy
     1	LFC
     1	Marko Grujic
     1	Matt Willian
     1	Matthijs Brobbel
     1	Max Burke
     1	Mehmet Ozan Kabak
     1	Rito Takeuchi
     1	Roman Zeyde
     1	Vrishabh
     1	Zhang Li
     1	ZuoTiJia
     1	byteink
     1	cfraz89
     1	nbr
     1	xxchan
     1	yujie.zhang
     1	zembunia
     1	哇呜哇呜呀咦耶