Using DataFusion as a library¶
Create a new project¶
cargo new hello_datafusion
$ cd hello_datafusion
$ tree .
.
├── Cargo.toml
└── src
└── main.rs
1 directory, 2 files
Default Configuration¶
DataFusion is published on crates.io, and is well documented on docs.rs.
To get started, add the following to your Cargo.toml
file:
[dependencies]
datafusion = "11.0"
Create a main function¶
Update the main.rs file with your first datafusion application based on Example usage
use datafusion::prelude::*;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
// register the table
let ctx = SessionContext::new();
ctx.register_csv("test", "<PATH_TO_YOUR_CSV_FILE>", CsvReadOptions::new()).await?;
// create a plan to run a SQL query
let df = ctx.sql("SELECT * FROM test").await?;
// execute and print results
df.show().await?;
Ok(())
}
Extensibility¶
DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:
User Defined Functions (UDFs)
User Defined Aggregate Functions (UDAFs)
User Defined Table Source (
TableProvider
) for tablesUser Defined
Optimizer
passes (plan rewrites)User Defined
LogicalPlan
nodesUser Defined
ExecutionPlan
nodes
Rust Version Compatibility¶
This crate is tested with the latest stable version of Rust. We do not currently test against other, older versions of the Rust compiler.
Optimized Configuration¶
For an optimized build several steps are required. First, use the below in your Cargo.toml
. It is
worth noting that using the settings in the [profile.release]
section will significantly increase the build time.
[dependencies]
datafusion = { version = "11.0" , features = ["simd"]}
tokio = { version = "^1.0", features = ["rt-multi-thread"] }
snmalloc-rs = "0.2"
[profile.release]
lto = true
codegen-units = 1
Then, in main.rs.
update the memory allocator with the below after your imports:
use datafusion::prelude::*;
#[global_allocator]
static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
async fn main() -> datafusion::error::Result<()> {
Ok(())
}
Finally, in order to build with the simd
optimization cargo nightly
is required.
rustup toolchain install nightly
Based on the instruction set architecture you are building on you will want to configure the target-cpu
as well, ideally
with native
or at least avx2
.
RUSTFLAGS='-C target-cpu=native' cargo +nightly run --release