Apache Arrow DataFusion 26.0.0

Published 24 Jun 2023
By The Apache Arrow PMC (pmc)

It has been a whirlwind 6 months of DataFusion development since our last update: the community has grown, many features have been added, performance improved and we are discussing branching out to our own top level Apache Project.

Background

Apache Arrow DataFusion is an extensible query engine and database toolkit, written in Rust, that uses Apache Arrow as its in-memory format.

DataFusion, along with Apache Calcite, Facebook’s Velox and similar technology are part of the next generation “Deconstructed Database” architectures, where new systems are built on a foundation of fast, modular components, rather as a single tightly integrated system.

While single tightly integrated systems such as Spark, DuckDB and Pola.rs are great pieces of technology, our community believes that anyone developing new data heavy application, such as those common in machine learning in the next 5 years, will require a high performance, vectorized, query engine to remain relevant. The only practical way to gain access to such technology without investing many millions of dollars to build a new tightly integrated engine, is though open source projects like DataFusion and similar enabling technologies such as Apache Arrow and Rust.

DataFusion is targeted primarily at developers creating other data intensive analytics, and offers:

High performance, native, parallel streaming execution engine
Mature SQL support, featuring subqueries, window functions, grouping sets, and more
Built in support for Parquet, Avro, CSV, JSON and Arrow formats and easy extension for others
Native DataFrame API and python bindings
Well documented source code and architecture, designed to be customized to suit downstream project needs
High quality, easy to use code released every 2 weeks to crates.io
Welcoming, open community, governed by the highly regarded and well understood Apache Software Foundation

The rest of this post highlights some of the improvements we have made to DataFusion over the last 6 months and a preview of where we are heading. You can see a list of all changes in the detailed CHANGELOG.

(Even) Better Performance

Various benchmarks show DataFusion to be quite close or even faster to the state of the art in analytic performance (at the moment this seems to be DuckDB). We continually work on improving performance (see #5546 for a list) and would love additional help in this area.

DataFusion now reads single large Parquet files significantly faster by parallelizing across multiple cores. Native speeds for reading JSON and CSV files are also up to 2.5x faster thanks to improvements upstream in arrow-rs JSON reader and CSV reader.

Also, we have integrated the arrow-rs Row Format into DataFusion resulting in up to 2-3x faster sorting and merging.

Improved Documentation and Website

Part of growing the DataFusion community is ensuring that DataFusion’s features are understood and that it is easy to contribute and participate. To that end the website has been cleaned up, the architecture guide expanded, the roadmap updated, and several overview talks created:

Apr 2023 Query Engine: recording and slides
April 2023 Logical Plan and Expressions: recording and slides
April 2023 Physical Plan and Execution: recording and slides

New Features

More Streaming, Less Memory

We have made significant progress on the streaming execution roadmap such as unbounded datasources, streaming group by, sophisticated sort and repartitioning improvements in the optimizer, and support for symmetric hash join (read more about that in the great Synnada Blog Post on the topic). Together, these features both 1) make it easier to build streaming systems using DataFusion that can incrementally generate output before (or ever) seeing the end of the input and 2) allow general queries to use less memory and generate their results faster.

We have also improved the runtime memory management system so that DataFusion now stays within its declared memory budget generate runtime errors.

DML Support (`INSERT`, `DELETE`, `UPDATE`, etc)

Part of building high performance data systems includes writing data, and DataFusion supports several features for creating new files:

INSERT INTO and SELECT ... INTO support for memory backed and CSV tables
New API for writing data into TableProviders

We are working on easier to use COPY INTO syntax, better support for writing parquet, JSON, and AVRO, and more – see our tracking epic for more details.

Timestamp and Intervals

One mark of the maturity of a SQL engine is how it handles the tricky world of timestamp, date, times and interval arithmetic. DataFusion is feature complete in this area and behaves as you would expect, supporting queries such as

SELECT now() + '1 month' FROM my_table;

We still have a long tail of date and time improvements, which we are working on as well.

Querying Structured Types (`List` and `Struct`s)

Arrow and Parquet support nested data well and DataFusion lets you easily query such Struct and List. For example, you can use DataFusion to read and query the JSON Datasets for Exploratory OLAP - Mendeley Data like this:

----------
-- Explore structured data using SQL
----------
SELECT delete FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL limit 10;
+---------------------------------------------------------------------------------------------------------------------------+
| delete                                                                                                                    |
+---------------------------------------------------------------------------------------------------------------------------+
| {status: {id: {$numberLong: 135037425050320896}, id_str: 135037425050320896, user_id: 334902461, user_id_str: 334902461}} |
| {status: {id: {$numberLong: 134703982051463168}, id_str: 134703982051463168, user_id: 405383453, user_id_str: 405383453}} |
| {status: {id: {$numberLong: 134773741740765184}, id_str: 134773741740765184, user_id: 64823441, user_id_str: 64823441}}   |
| {status: {id: {$numberLong: 132543659655704576}, id_str: 132543659655704576, user_id: 45917834, user_id_str: 45917834}}   |
| {status: {id: {$numberLong: 133786431926697984}, id_str: 133786431926697984, user_id: 67229952, user_id_str: 67229952}}   |
| {status: {id: {$numberLong: 134619093570560002}, id_str: 134619093570560002, user_id: 182430773, user_id_str: 182430773}} |
| {status: {id: {$numberLong: 134019857527214080}, id_str: 134019857527214080, user_id: 257396311, user_id_str: 257396311}} |
| {status: {id: {$numberLong: 133931546469076993}, id_str: 133931546469076993, user_id: 124539548, user_id_str: 124539548}} |
| {status: {id: {$numberLong: 134397743350296576}, id_str: 134397743350296576, user_id: 139836391, user_id_str: 139836391}} |
| {status: {id: {$numberLong: 127833661767823360}, id_str: 127833661767823360, user_id: 244442687, user_id_str: 244442687}} |
+---------------------------------------------------------------------------------------------------------------------------+

----------
-- Select some deeply nested fields
----------
SELECT
  delete['status']['id']['$numberLong'] as delete_id,
  delete['status']['user_id'] as delete_user_id
FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL LIMIT 10;

+--------------------+----------------+
| delete_id          | delete_user_id |
+--------------------+----------------+
| 135037425050320896 | 334902461      |
| 134703982051463168 | 405383453      |
| 134773741740765184 | 64823441       |
| 132543659655704576 | 45917834       |
| 133786431926697984 | 67229952       |
| 134619093570560002 | 182430773      |
| 134019857527214080 | 257396311      |
| 133931546469076993 | 124539548      |
| 134397743350296576 | 139836391      |
| 127833661767823360 | 244442687      |
+--------------------+----------------+

Subqueries All the Way Down

DataFusion can run many different subqueries by rewriting them to joins. It has been able to run the full suite of TPC-H queries for at least the last year, but recently we have implemented significant improvements to this logic, sufficient to run almost all queries in the TPC-DS benchmark as well.

Community and Project Growth

The six months since our last update saw significant growth in the DataFusion community. Between versions 17.0.0 and 26.0.0, DataFusion merged 711 PRs from 107 distinct contributors, not including all the work that goes into our core dependencies such as arrow, parquet, and object_store, that much of the same community helps support.

In addition, we have added 7 new committers and 1 new PMC member to the Apache Arrow project, largely focused on DataFusion, and we learned about some of the cool new systems which are using DataFusion. Given the growth of the community and interest in the project, we also clarified the mission statement and are discussing “graduate”ing DataFusion to a new top level Apache Software Foundation project.

How to Get Involved

Kudos to everyone in the community who has contributed ideas, discussions, bug reports, documentation and code. It is exciting to be innovating on the next generation of database architectures together!

If you are interested in contributing to DataFusion, we would love to have you join us. You can try out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here.

Check out our Communication Doc for more ways to engage with the community.