Working with Exprs¶
Expr
is short for “expression”. It is a core abstraction in DataFusion for representing a computation, and follows the standard “expression tree” abstraction found in most compilers and databases.
For example, the SQL expression a + b
would be represented as an Expr
with a BinaryExpr
variant. A BinaryExpr
has a left and right Expr
and an operator.
As another example, the SQL expression a + b * c
would be represented as an Expr
with a BinaryExpr
variant. The left Expr
would be a
and the right Expr
would be another BinaryExpr
with a left Expr
of b
and a right Expr
of c
. As a classic expression tree, this would look like:
┌────────────────────┐
│ BinaryExpr │
│ op: + │
└────────────────────┘
▲ ▲
┌───────┘ └────────────────┐
│ │
┌────────────────────┐ ┌────────────────────┐
│ Expr::Col │ │ BinaryExpr │
│ col: a │ │ op: * │
└────────────────────┘ └────────────────────┘
▲ ▲
┌────────┘ └─────────┐
│ │
┌────────────────────┐ ┌────────────────────┐
│ Expr::Col │ │ Expr::Col │
│ col: b │ │ col: c │
└────────────────────┘ └────────────────────┘
As the writer of a library, you may want to use or create Expr
s to represent computations that you want to perform. This guide will walk you through how to make your own scalar UDF as an Expr
and how to rewrite Expr
s to inline the simple UDF.
There are also executable examples for working with Expr
s:
A Scalar UDF Example¶
We’ll use a ScalarUDF
expression as our example. This necessitates implementing an actual UDF, and for ease we’ll use the same example from the adding UDFs guide.
So assuming you’ve written that function, you can use it to create an Expr
:
let add_one_udf = create_udf(
"add_one",
vec![DataType::Int64],
Arc::new(DataType::Int64),
Volatility::Immutable,
make_scalar_function(add_one), // <-- the function we wrote
);
// make the expr `add_one(5)`
let expr = add_one_udf.call(vec![lit(5)]);
// make the expr `add_one(my_column)`
let expr = add_one_udf.call(vec![col("my_column")]);
If you’d like to learn more about Expr
s, before we get into the details of creating and rewriting them, you can read the expression user-guide.
Rewriting Exprs¶
Rewriting Expressions is the process of taking an Expr
and transforming it into another Expr
. This is useful for a number of reasons, including:
Simplifying
Expr
s to make them easier to evaluateOptimizing
Expr
s to make them faster to evaluateConverting
Expr
s to other forms, e.g. converting aBinaryExpr
to aCastExpr
In our example, we’ll use rewriting to update our add_one
UDF, to be rewritten as a BinaryExpr
with a Literal
of 1. We’re effectively inlining the UDF.
Rewriting with transform
¶
To implement the inlining, we’ll need to write a function that takes an Expr
and returns a Result<Expr>
. If the expression is not to be rewritten Transformed::No
is used to wrap the original Expr
. If the expression is to be rewritten, Transformed::Yes
is used to wrap the new Expr
.
fn rewrite_add_one(expr: Expr) -> Result<Expr> {
expr.transform(&|expr| {
Ok(match expr {
Expr::ScalarUDF(scalar_fun) if scalar_fun.fun.name == "add_one" => {
let input_arg = scalar_fun.args[0].clone();
let new_expression = input_arg + lit(1i64);
Transformed::Yes(new_expression)
}
_ => Transformed::No(expr),
})
})
}
Creating an OptimizerRule
¶
In DataFusion, an OptimizerRule
is a trait that supports rewritingExpr
s that appear in various parts of the LogicalPlan
. It follows DataFusion’s general mantra of trait implementations to drive behavior.
We’ll call our rule AddOneInliner
and implement the OptimizerRule
trait. The OptimizerRule
trait has two methods:
name
- returns the name of the ruletry_optimize
- takes aLogicalPlan
and returns anOption<LogicalPlan>
. If the rule is able to optimize the plan, it returnsSome(LogicalPlan)
with the optimized plan. If the rule is not able to optimize the plan, it returnsNone
.
struct AddOneInliner {}
impl OptimizerRule for AddOneInliner {
fn name(&self) -> &str {
"add_one_inliner"
}
fn try_optimize(
&self,
plan: &LogicalPlan,
config: &dyn OptimizerConfig,
) -> Result<Option<LogicalPlan>> {
// Map over the expressions and rewrite them
let new_expressions = plan
.expressions()
.into_iter()
.map(|expr| rewrite_add_one(expr))
.collect::<Result<Vec<_>>>()?;
let inputs = plan.inputs().into_iter().cloned().collect::<Vec<_>>();
let plan = plan.with_new_exprs(&new_expressions, &inputs);
plan.map(Some)
}
}
Note the use of rewrite_add_one
which is mapped over plan.expressions()
to rewrite the expressions, then plan.with_new_exprs
is used to create a new LogicalPlan
with the rewritten expressions.
We’re almost there. Let’s just test our rule works properly.
Testing the Rule¶
Testing the rule is fairly simple, we can create a SessionState with our rule and then create a DataFrame and run a query. The logical plan will be optimized by our rule.
use datafusion::prelude::*;
let rules = Arc::new(AddOneInliner {});
let state = ctx.state().with_optimizer_rules(vec![rules]);
let ctx = SessionContext::with_state(state);
ctx.register_udf(add_one);
let sql = "SELECT add_one(1) AS added_one";
let plan = ctx.sql(sql).await?.logical_plan();
println!("{:?}", plan);
This results in the following output:
Projection: Int64(1) + Int64(1) AS added_one
EmptyRelation
I.e. the add_one
UDF has been inlined into the projection.
Conclusion¶
In this guide, we’ve seen how to create Expr
s programmatically and how to rewrite them. This is useful for simplifying and optimizing Expr
s. We’ve also seen how to test our rule to ensure it works properly.