Apache Arrow DataFusion 7.0.0 Release


Published 28 Feb 2022
By The Apache Arrow PMC (pmc)

Introduction

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

When you want to extend your Rust project with SQL support, a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is definitely worth checking out.

DataFusion’s SQL, DataFrame, and manual PlanBuilder API let users access a sophisticated query optimizer and execution engine capable of fast, resource efficient, and parallel execution that takes optimal advantage of todays multicore hardware. Being written in Rust means DataFusion can offer both the safety of dynamic languages as well as the resource efficiency of a compiled language.

The Apache Arrow team is pleased to announce the DataFusion 7.0.0 release. This covers 4 months of development work and includes 195 commits from the following 37 distinct contributors.

    44  Andrew Lamb
    24  Kun Liu
    23  Jiayu Liu
    17  xudong.w
    11  Yijie Shen
     9  Matthew Turner
     7  Liang-Chi Hsieh
     5  Lin Ma
     4  Stephen Carman
     4  James Katz
     4  Dmitry Patsura
     4  QP Hou
     3  dependabot[bot]
     3  Remzi Yang
     3  Yang
     3  ic4y
     3  Daniël Heres
     2  Andy Grove
     2  Raphael Taylor-Davies
     2  Jason Tianyi Wang
     2  Dan Harris
     2  Sergey Melnychuk
     1  Nitish Tiwari
     1  Dom
     1  Eduard Karacharov
     1  Javier Goday
     1  Boaz
     1  Marko Mikulicic
     1  Max Burke
     1  Carol (Nichols || Goulding)
     1  Phillip Cloud
     1  Rich
     1  Toby Hede
     1  Will Jones
     1  r.4ntix
     1  rdettai

The following section highlights some of the improvements in this release. Of course, many other bug fixes and improvements have also been made and we refer you to the complete changelog for the full detail.

Summary

  • DataFusion Crate
    • The DataFusion crate is being split into multiple crates to decrease compilation times and improve the development experience. Initially, datafusion-common (the core DataFusion components) and datafusion-expr (DataFusion expressions, functions, and operators) have been split out. There will be additional splits after the 7.0 release.
  • Performance Improvements and Optimizations
    • Arrow’s dyn scalar kernels are now used to enable efficient operations on DictionaryArrays #1685
    • Switch from std::sync::Mutex to parking_lot::Mutex #1720
  • New Features
    • Support for memory tracking and spilling to disk
      • MemoryMananger and DiskManager #1526
      • Out of core sort #1526
      • New metrics
        • Gauge and CurrentMemoryUsage #1682
        • Spill_count and spilled_bytes #1641
    • New math functions
      • Approx_quantile #1529
      • stddev and variance (sample and population) #1525
      • corr #1561
    • Support decimal type #1394#1407#1408#1431#1483#1554#1640
    • Support for reading Parquet files with evolved schemas #1622#1709
    • Support for registering DataFrame as table #1699
    • Support for the substring function #1621
    • Support array_agg(distinct ...) #1579
    • Support sort on unprojected columns #1415
  • Additional Integration Points
    • A new public Expression simplification API #1717
  • DataFusion-Contrib
  • Arrow2

Documentation and Roadmap

We are working to consolidate the documentation into the official site. You can find more details there on topics such as the SQL status and a user guide. This is also an area we would love to get help from the broader community #1821.

To provide transparency on DataFusion’s priorities to users and developers a three month roadmap will be published at the beginning of each quarter. This can be found here here.

Upcoming Attractions

How to Get Involved

If you are interested in contributing to DataFusion, and learning about state of the art query processing, we would love to have you join us on the journey! You can help by trying out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here

Check out our new Communication Doc on more ways to engage with the community.