Example Usage

In this example some simple processing is performed on the example.csv file.

Update Cargo.toml

Add the following to your Cargo.toml file:

datafusion = "22"
tokio = "1.0"

Run a SQL query against data stored in a CSV:

use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
  // register the table
  let ctx = SessionContext::new();
  ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new()).await?;

  // create a plan to run a SQL query
  let df = ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100").await?;

  // execute and print results
  df.show().await?;
  Ok(())
}

Use the DataFrame API to process data stored in a CSV:

use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
  // create the dataframe
  let ctx = SessionContext::new();
  let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;

  let df = df.filter(col("a").lt_eq(col("b")))?
           .aggregate(vec![col("a")], vec![min(col("b"))])?
           .limit(0, Some(100))?;

  // execute and print results
  df.show().await?;
  Ok(())
}

Output from both examples

+---+--------+
| a | MIN(b) |
+---+--------+
| 1 | 2      |
+---+--------+

Identifiers and Capitalization

Please be aware that all identifiers are effectively made lower-case in SQL, so if your csv file has capital letters (ex: Name) you must put your column name in double quotes or the examples won’t work.

To illustrate this behavior, consider the capitalized_example.csv file:

Run a SQL query against data stored in a CSV:

use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
  // register the table
  let ctx = SessionContext::new();
  ctx.register_csv("example", "tests/data/capitalized_example.csv", CsvReadOptions::new()).await?;

  // create a plan to run a SQL query
  let df = ctx.sql("SELECT \"A\", MIN(b) FROM example WHERE \"A\" <= c GROUP BY \"A\" LIMIT 100").await?;

  // execute and print results
  df.show().await?;
  Ok(())
}

Use the DataFrame API to process data stored in a CSV:

use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
  // create the dataframe
  let ctx = SessionContext::new();
  let df = ctx.read_csv("tests/data/capitalized_example.csv", CsvReadOptions::new()).await?;

  let df = df
      // col will parse the input string, hence requiring double quotes to maintain the capitalization
      .filter(col("\"A\"").lt_eq(col("c")))?
      // alternatively use ident to pass in an unqualified column name directly without parsing
      .aggregate(vec![ident("A")], vec![min(col("b"))])?
      .limit(0, Some(100))?;

  // execute and print results
  df.show().await?;
  Ok(())
}

Output from both examples

+---+--------+
| A | MIN(b) |
+---+--------+
| 2 | 1      |
| 1 | 2      |
+---+--------+

Extensibility

DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:

  • User Defined Functions (UDFs)

  • User Defined Aggregate Functions (UDAFs)

  • User Defined Table Source (TableProvider) for tables

  • User Defined Optimizer passes (plan rewrites)

  • User Defined LogicalPlan nodes

  • User Defined ExecutionPlan nodes

Rust Version Compatibility

This crate is tested with the latest stable version of Rust. We do not currently test against other, older versions of the Rust compiler.

Optimized Configuration

For an optimized build several steps are required. First, use the below in your Cargo.toml. It is worth noting that using the settings in the [profile.release] section will significantly increase the build time.

[dependencies]
datafusion = { version = "22.0" , features = ["simd"]}
tokio = { version = "^1.0", features = ["rt-multi-thread"] }
snmalloc-rs = "0.3"

[profile.release]
lto = true
codegen-units = 1

Then, in main.rs. update the memory allocator with the below after your imports:

use datafusion::prelude::*;

#[global_allocator]
static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
  Ok(())
}

Finally, in order to build with the simd optimization cargo nightly is required.

rustup toolchain install nightly

Based on the instruction set architecture you are building on you will want to configure the target-cpu as well, ideally with native or at least avx2.

RUSTFLAGS='-C target-cpu=native' cargo +nightly run --release