Introduction¶
DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.
DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.
Features¶
Feature-rich SQL support and DataFrame API
Blazingly fast, vectorized, multi-threaded, streaming execution engine.
Native support for Parquet, CSV, JSON, and Avro file formats. Support for custom file formats and non file datasources via the
TableProvider
trait.Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL, other query languages, custom plan and execution nodes, optimizer passes, and more.
Streaming, asynchronous IO directly from popular object stores, including AWS S3, Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
ObjectStore
trait.A state of the art query optimizer with projection and filter pushdown, sort aware optimizations, automatic join reordering, expression coercion, and more.
Permissive Apache 2.0 License, Apache Software Foundation governance
Written in Rust, a modern system language with development productivity similar to Java or Golang, the performance of C++, and loved by programmers everywhere.
Support for Substrait for query plan serialization, making it easier to integrate DataFusion with other projects, and to pass plans across language boundaries.
Use Cases¶
DataFusion can be used without modification as an embedded SQL engine or can be customized and used as a foundation for building new systems. Here are some examples of systems built using DataFusion:
Specialized Analytical Database systems such as CeresDB and more general Apache Spark like system such a Ballista.
New query language engines such as prql-query and accelerators such as VegaFusion
Research platform for new Database Systems, such as Flock
SQL support to another library, such as dask sql
Streaming data platforms such as Synnada
Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as qv
A faster Spark runtime replacement Blaze
By using DataFusion, the projects are freed to focus on their specific features, and avoid reimplementing general (but still necessary) features such as an expression representation, standard optimizations, execution plans, file format support, etc.
Known Users¶
Here are some of the projects known to use DataFusion:
Ballista Distributed SQL Query Engine
Blaze Spark accelerator with DataFusion at its core
CeresDB Distributed Time-Series Database
CnosDB Open Source Distributed Time Series Database
Dask SQL Distributed SQL query engine in Python
datafusion-tui Text UI for DataFusion
delta-rs Native Rust implementation of Delta Lake
GreptimeDB Open Source & Cloud Native Distributed Time Series Database
InfluxDB IOx Time Series Database
Kamu Planet-scale streaming data pipeline
Parseable Log storage and observability platform
qv Quickly view your data
bdt Boring Data Tool
Seafowl CDN-friendly analytical database
Synnada Streaming-first framework for data products
VegaFusion Server-side acceleration for the Vega visualization grammar
ZincObserve Distributed cloud native observability platform
Integrations and Extensions¶
There are a number of community projects that extend DataFusion or provide integrations with other systems.
Language Bindings¶
Integrations¶
Why DataFusion?¶
High Performance: Leveraging Rust and Arrow’s memory model, DataFusion is very fast.
Easy to Connect: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
Easy to Embed: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
High Quality: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.