DataFusion aims to be the query engine of choice for new, fast data centric systems such as databases, dataframe libraries, machine learning and streaming applications by leveraging the unique features of Rust and Apache Arrow.
Blazingly fast, vectorized, multi-threaded, streaming execution engine.
Native support for Parquet, CSV, JSON, and Avro file formats. Support for custom file formats and non file datasources via the
Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL, other query languages, custom plan and execution nodes, optimizer passes, and more.
Streaming, asynchronous IO directly from popular object stores, including AWS S3, Azure Blob Storage, and Google Cloud Storage (Other storage systems are supported via the
A state of the art query optimizer with expression coercion and simplification, projection and filter pushdown, sort and distribution aware optimizations, automatic join reordering, and more.
Permissive Apache 2.0 License, predictable and well understood Apache Software Foundation governance.
Support for Substrait query plans, to easily pass plans across language and system boundaries.
DataFusion can be used without modification as an embedded SQL engine or can be customized and used as a foundation for building new systems.
While most current usecases are “analytic” or (throughput) some components of DataFusion such as the plan representations, are suitable for “streaming” and “transaction” style systems (low latency).
Here are some example systems built using DataFusion:
Research platform for new Database Systems, such as Flock
SQL support to another library, such as dask sql
Streaming data platforms such as Synnada
Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as qv
Native Spark runtime replacement such as Blaze
By using DataFusion, projects are freed to focus on their specific features, and avoid reimplementing general (but still necessary) features such as an expression representation, standard optimizations, parellelized streaming execution plans, file format support, etc.
Here are some active projects using DataFusion:
Arroyo Distributed stream processing engine in Rust
Ballista Distributed SQL Query Engine
CeresDB Distributed Time-Series Database
CnosDB Open Source Distributed Time Series Database
Dask SQL Distributed SQL query engine in Python
Exon Analysis toolkit for life-science applications
delta-rs Native Rust implementation of Delta Lake
GreptimeDB Open Source & Cloud Native Distributed Time Series Database
GlareDB Fast SQL database for querying and analyzing distributed data.
InfluxDB IOx Time Series Database
Kamu Planet-scale streaming data pipeline
Lance Modern columnar data format for ML
Parseable Log storage and observability platform
qv Quickly view your data
bdt Boring Data Tool
Restate Easily build resilient applications using distributed durable async/await
Seafowl CDN-friendly analytical database
Synnada Streaming-first framework for data products
ZincObserve Distributed cloud native observability platform
Here are some less active projects that used DataFusion:
Integrations and Extensions¶
There are a number of community projects that extend DataFusion or provide integrations with other systems, some of which are described below:
High Performance: Leveraging Rust and Arrow’s memory model, DataFusion is very fast.
Easy to Connect: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
Easy to Embed: Allowing extension at almost any point in its design, and published regularly as a crate on crates.io, DataFusion can be integrated and tailored for your specific usecase.
High Quality: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can and is used as the foundation for production systems.