Apache Arrow DataFusion 13.0.0 Project Update

Published 25 Oct 2022
By The Apache Arrow PMC (pmc)

Introduction

Apache Arrow DataFusion 13.0.0 is released, and this blog contains an update on the project for the 5 months since our last update in May 2022.

DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project to:

Support SQL support
Support DataFrame API
Support a Domain Specific Query Language
Easily and quickly read and process Parquet, JSON, Avro or CSV data.
Read from remote object stores such as AWS S3, Azure Blob Storage, GCP.

Even though DataFusion is 4 years “young,” it has seen significant community growth in the last few months and the momentum continues to accelerate.

Background

DataFusion is used as the engine in many open source and commercial projects and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a “LLVM for database and AI systems”(alternate link) with announcements such as the release of FaceBook’s Velox engine, the major investments in Acero as well as the continued popularity of Apache Calcite and other similar technologies.

While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and extension points for just about everything. Some DataFusion users use a subset of the features such as the frontend (e.g. dask-sql) or the execution engine, (e.g. Blaze), and some use many different components to build both SQL based and customized DSL based systems such as InfluxDB IOx and VegaFusion.

One of DataFusion’s advantages is its implementation in Rust and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the ease of parallelization with the high quality and standardized async ecosystem , as well as its modern dependency management system and wonderful performance.

Summary

We have increased the frequency of DataFusion releases to monthly instead of quarterly. This makes it easier for the increasing number of projects that now depend on DataFusion.

We have also completed the “graduation” of Ballista to its own top-level arrow-ballista repository which decouples the two projects and allows each project to move even faster.

Along with numerous other bug fixes and smaller improvements, here are some of the major advances:

Improved Support for Cloud Object Stores

DataFusion now supports many major cloud object stores (Amazon S3, Azure Blob Storage, and Google Cloud Storage) “out of the box” via the object_store crate. Using this integration, DataFusion optimizes reading parquet files by reading only the parts of the files that are needed.

Advanced SQL

DataFusion now supports correlated subqueries, by rewriting them as joins. See the Subquery page in the User Guide for more information.

In addition to numerous other small improvements, the following SQL features are now supported:

ROWS, RANGE, PRECEDING and FOLLOWING in OVER clauses #3570
ROLLUP and CUBE grouping set expressions #2446
SUM DISTINCT aggregate support #2405
IN and NOT IN Subqueries by rewriting them to SEMI / ANTI #2421
Non equality predicates in ON clause of LEFT, RIGHT, and FULL joins #2591
Exact MEDIAN #3009
GROUPING SETS/CUBE/ROLLUP #2716

More DDL Support

Just as it is important to query, it is also important to give users the ability to define their data sources. We have added:

CREATE VIEW #2279
DESCRIBE <table> #2642
Custom / Dynamic table provider factories #3311
SHOW CREATE TABLE for support for views #2830

Faster Execution

Performance is always an important goal for DataFusion, and there are a number of significant new optimizations such as

Optimizations of TopK (queries with a LIMIT or OFFSET clause): #3527, #2521
Reduce left/right/full joins to inner join #2750
Convert cross joins to inner joins when possible #3482
Sort preserving SortMergeJoin #2699
Improvements in group by and sort performance #2375
Adaptive regex_replace implementation #3518

Optimizer Enhancements

Internally the optimizer has been significantly enhanced as well.

Casting / coercion now happens during logical planning #3185 #3636
More sophisticated expression analysis and simplification is available

Parquet

The parquet reader can now read directly from parquet files on remote object storage #2489 #3051
Experimental support for “predicate pushdown” with late materialization after filtering during the scan (another blog post on this topic is coming soon).
Support reading directly from AWS S3 and other object stores via datafusion-cli #3631

DataType Support

Support for TimestampTz #3660
Expanded support for the Decimal type, including IN list and better built in coercion.
Expanded support for date/time manipulation such as date_bin built-in function , timestamp +/- interval, TIME literal values #3010, #3110, #3034
Binary operations (AND, XOR, etc): #3037 #3420
IS TRUE/FALSE and IS [NOT] UNKNOWN #3235, #3246

Upcoming Work

With the community growing and code accelerating, there is so much great stuff on the horizon. Some features we expect to land in the next few months:

Complete Parquet Pushdown
Additional date/time support
Cost models, Nested Join Optimizations, analysis framework #128, #3843, #3845

Community Growth

The DataFusion 9.0.0 and 13.0.0 releases consists of 433 PRs from 64 distinct contributors. This does not count all the work that goes into our dependencies such as arrow, parquet, and object_store, that much of the same community helps nurture.

How to Get Involved

Kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!

If you are interested in contributing to DataFusion, we would love to have you join us on our journey to create the most advanced open source query engine. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here.

Check out our Communication Doc on more ways to engage with the community.

Appendix: Contributor Shoutout

To give a sense of the number of people who contribute to this project regularly, we present for your consideration the following list derived from git shortlog -sn 9.0.0..13.0.0 . Thank you all again!

Andy Grove
Andrew Lamb
Kun Liu
Kirk Mitchener
Wei-Ting Kuo
Yang Jiang
Raphael Taylor-Davies
Batuhan Taskaya
Brent Gardner
Remzi Yang
comphead
xudong.w
AssHero
Ruihang Xia
Dan Harris
Daniël Heres
Ian Alexander Joiner
Mike Roberts
askoa
BaymaxHWY
gorkem
jakevin
George Andronchik
Sarah Yurick
Stuart Carnie
Dalton Modlin
Dmitry Patsura
JasonLi
Jon Mease
Marco Neumann
yahoNanJing
Adilet Sarsembayev
Ayush Dattagupta
Dezhi Wu
Dhamotharan Sritharan
Eduard Karacharov
Francis Du
Harbour Zheng
Ismaël Mejía
Jack Klamer
Jeremy Dyer
Jiayu Liu
Kamil Konior
Liang-Chi Hsieh
Martin Grigorov
Matthijs Brobbel
Mehmet Ozan Kabak
Metehan Yıldırım
Morgan Cassels
Nitish Tiwari
Renjie Liu
Rito Takeuchi
Robert Pack
Thomas Cameron
Vrishabh
Xin Hao
Yijie Shen
byteink
kamille
mateuszkj
nvartolomei
yourenawo
Özgür Akkurt