Apache Arrow DataFusion 13.0.0 Project Update


Published 25 Oct 2022
By The Apache Arrow PMC (pmc)

Introduction

Apache Arrow DataFusion 13.0.0 is released, and this blog contains an update on the project for the 5 months since our last update in May 2022.

DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project to:

  • Support SQL support
  • Support DataFrame API
  • Support a Domain Specific Query Language
  • Easily and quickly read and process Parquet, JSON, Avro or CSV data.
  • Read from remote object stores such as AWS S3, Azure Blob Storage, GCP.

Even though DataFusion is 4 years “young,” it has seen significant community growth in the last few months and the momentum continues to accelerate.

Background

DataFusion is used as the engine in many open source and commercial projects and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a “LLVM for database and AI systems”(alternate link) with announcements such as the release of FaceBook’s Velox engine, the major investments in Acero as well as the continued popularity of Apache Calcite and other similar technologies.

While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and extension points for just about everything. Some DataFusion users use a subset of the features such as the frontend (e.g. dask-sql) or the execution engine, (e.g. Blaze), and some use many different components to build both SQL based and customized DSL based systems such as InfluxDB IOx and VegaFusion.

One of DataFusion’s advantages is its implementation in Rust and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the ease of parallelization with the high quality and standardized async ecosystem , as well as its modern dependency management system and wonderful performance.

Summary

We have increased the frequency of DataFusion releases to monthly instead of quarterly. This makes it easier for the increasing number of projects that now depend on DataFusion.

We have also completed the “graduation” of Ballista to its own top-level arrow-ballista repository which decouples the two projects and allows each project to move even faster.

Along with numerous other bug fixes and smaller improvements, here are some of the major advances:

Improved Support for Cloud Object Stores

DataFusion now supports many major cloud object stores (Amazon S3, Azure Blob Storage, and Google Cloud Storage) “out of the box” via the object_store crate. Using this integration, DataFusion optimizes reading parquet files by reading only the parts of the files that are needed.

Advanced SQL

DataFusion now supports correlated subqueries, by rewriting them as joins. See the Subquery page in the User Guide for more information.

In addition to numerous other small improvements, the following SQL features are now supported:

  • ROWS, RANGE, PRECEDING and FOLLOWING in OVER clauses #3570
  • ROLLUP and CUBE grouping set expressions #2446
  • SUM DISTINCT aggregate support #2405
  • IN and NOT IN Subqueries by rewriting them to SEMI / ANTI #2421
  • Non equality predicates in ON clause of LEFT, RIGHT, and FULL joins #2591
  • Exact MEDIAN #3009
  • GROUPING SETS/CUBE/ROLLUP #2716

More DDL Support

Just as it is important to query, it is also important to give users the ability to define their data sources. We have added:

  • CREATE VIEW #2279
  • DESCRIBE <table> #2642
  • Custom / Dynamic table provider factories #3311
  • SHOW CREATE TABLE for support for views #2830

Faster Execution

Performance is always an important goal for DataFusion, and there are a number of significant new optimizations such as

  • Optimizations of TopK (queries with a LIMIT or OFFSET clause): #3527, #2521
  • Reduce left/right/full joins to inner join #2750
  • Convert cross joins to inner joins when possible #3482
  • Sort preserving SortMergeJoin #2699
  • Improvements in group by and sort performance #2375
  • Adaptive regex_replace implementation #3518

Optimizer Enhancements

Internally the optimizer has been significantly enhanced as well.

  • Casting / coercion now happens during logical planning #3185 #3636
  • More sophisticated expression analysis and simplification is available

Parquet

  • The parquet reader can now read directly from parquet files on remote object storage #2489 #3051
  • Experimental support for “predicate pushdown” with late materialization after filtering during the scan (another blog post on this topic is coming soon).
  • Support reading directly from AWS S3 and other object stores via datafusion-cli #3631

DataType Support

  • Support for TimestampTz #3660
  • Expanded support for the Decimal type, including IN list and better built in coercion.
  • Expanded support for date/time manipulation such as date_bin built-in function , timestamp +/- interval, TIME literal values #3010, #3110, #3034
  • Binary operations (AND, XOR, etc): #3037 #3420
  • IS TRUE/FALSE and IS [NOT] UNKNOWN #3235, #3246

Upcoming Work

With the community growing and code accelerating, there is so much great stuff on the horizon. Some features we expect to land in the next few months:

Community Growth

The DataFusion 9.0.0 and 13.0.0 releases consists of 433 PRs from 64 distinct contributors. This does not count all the work that goes into our dependencies such as arrow, parquet, and object_store, that much of the same community helps nurture.

How to Get Involved

Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!

If you are interested in contributing to DataFusion, we would love to have you join us on our journey to create the most advanced open source query engine. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here.

Check out our Communication Doc on more ways to engage with the community.

Appendix: Contributor Shoutout

To give a sense of the number of people who contribute to this project regularly, we present for your consideration the following list derived from git shortlog -sn 9.0.0..13.0.0 . Thank you all again!

    87	Andy Grove
    71	Andrew Lamb
    29	Kun Liu
    29	Kirk Mitchener
    17	Wei-Ting Kuo
    14	Yang Jiang
    12	Raphael Taylor-Davies
    11	Batuhan Taskaya
    10	Brent Gardner
    10	Remzi Yang
    10	comphead
    10	xudong.w
     8	AssHero
     7	Ruihang Xia
     6	Dan Harris
     6	Daniël Heres
     6	Ian Alexander Joiner
     6	Mike Roberts
     6	askoa
     4	BaymaxHWY
     4	gorkem
     4	jakevin
     3	George Andronchik
     3	Sarah Yurick
     3	Stuart Carnie
     2	Dalton Modlin
     2	Dmitry Patsura
     2	JasonLi
     2	Jon Mease
     2	Marco Neumann
     2	yahoNanJing
     1	Adilet Sarsembayev
     1	Ayush Dattagupta
     1	Dezhi Wu
     1	Dhamotharan Sritharan
     1	Eduard Karacharov
     1	Francis Du
     1	Harbour Zheng
     1	Ismaël Mejía
     1	Jack Klamer
     1	Jeremy Dyer
     1	Jiayu Liu
     1	Kamil Konior
     1	Liang-Chi Hsieh
     1	Martin Grigorov
     1	Matthijs Brobbel
     1	Mehmet Ozan Kabak
     1	Metehan Yıldırım
     1	Morgan Cassels
     1	Nitish Tiwari
     1	Renjie Liu
     1	Rito Takeuchi
     1	Robert Pack
     1	Thomas Cameron
     1	Vrishabh
     1	Xin Hao
     1	Yijie Shen
     1	byteink
     1	kamille
     1	mateuszkj
     1	nvartolomei
     1	yourenawo
     1	Özgür Akkurt