Apache Arrow DataFusion 6.0.0 Release

Published 19 Nov 2021
By The Apache Arrow PMC (pmc)

Introduction

DataFusion is an embedded query engine which leverages the unique features of Rust and Apache Arrow to provide a system that is high performance, easy to connect, easy to embed, and high quality.

The Apache Arrow team is pleased to announce the DataFusion 6.0.0 release. This covers 4 months of development work and includes 134 commits from the following 28 distinct contributors.

Andrew Lamb
Jiayu Liu
xudong963
rdettai
QP Hou
Matthew Turner
Daniël Heres
Guillaume Balaine
Francis Du
Marco Neumann
Jon Mease
Nga Tran
Yijie Shen
Ruihang Xia
Liang-Chi Hsieh
baishen
Andy Grove
Jason Tianyi Wang
Nan Zhu
Antoine Wendlinger
Krisztián Szűcs
Mike Seddon
Conner Murphy
Patrick More
Taehoon Moon
Tiphaine Ruy
adsharma
lichuan6

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the complete changelog.

New Website

Befitting a growing project, DataFusion now has its own website hosted as part of the main Apache Arrow Website

Roadmap

The community worked to gather their thoughts about where we are taking DataFusion into a public Roadmap for the first time

New Features

Runtime operator metrics collection framework
Object store abstraction for unified access to local or remote storage
Hive style table partitioning support, for Parquet, CSV, Avro and Json files
DataFrame API support for: except, intersect, show, limit and window functions
SQL
- EXPLAIN ANALYZE with runtime metrics
- trim ( [ LEADING | TRAILING | BOTH ] [ FROM ] string text [, characters text ] ) syntax
- Postgres style regular expression matching operators ~, ~*, !~, and !~*
- SQL set operators UNION, INTERSECT, and EXCEPT
- cume_dist, percent_rank window functions
- digest, blake2s, blake2b, blake3 crypto functions
- HyperLogLog based approx_distinct
- is distinct from and is not distinct from
- CREATE TABLE AS SELECT
- Accessing elements of nested Struct and List columns (e.g. SELECT struct_column['field_name'], array_column[0] FROM ...)
- Boolean expressions in CASE statement
- DROP TABLE
- VALUES List
- Postgres regex match operators
Support for Avro format
Support for ScalarValue::Struct
Automatic schema inference for CSV files
Better interactive editing support in datafusion-cli as well as psql style commands such as \d, \?, and \q
Generic constant evaluation and simplification framework
Added common subexpression eliminate query plan optimization rule
Python binding 0.4.0 with all Datafusion 6.0.0 features

With these new features, we are also now passing TPC-H queries 8, 13 and 21.

For the full list of new features with their relevant PRs, see the enhancements section in the changelog.

`async` planning and decoupling file format from table layout

Driven by the need to support Hive style table partitioning, @rdettai introduced the following design change to the Datafusion core.

The code for reading specific file formats (Parquet, Avro, CSV, and JSON) was separated from the logic that handles grouping sets of files into execution partitions.
The query planning process was made async.

As a result, we are able to replace the old Parquet, CSV and JSON table providers with a single ListingTable table provider.

This also sets up DataFusion and its plug-in ecosystem to supporting a wide range of catalogs and various object store implementations. You can read more about this change in the design document and on the arrow-datafusion#1010 PR.

How to Get Involved

If you are interested in contributing to DataFusion, we would love to have you! You can help by trying out DataFusion on some of your own data and projects and filing bug reports and helping to improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for beginners is here and the full list is here.

Check out our new Communication Doc on more ways to engage with the community.