ConceptsΒΆ

In this section, we will cover a basic example to introduce a few key concepts.

import datafusion
from datafusion import col
import pyarrow

# create a context
ctx = datafusion.SessionContext()

# create a RecordBatch and a new DataFrame from it
batch = pyarrow.RecordBatch.from_arrays(
    [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
    names=["a", "b"],
)
df = ctx.create_dataframe([[batch]])

# create a new statement
df = df.select(
    col("a") + col("b"),
    col("a") - col("b"),
)

# execute and collect the first (and only) batch
result = df.collect()[0]

The first statement group:

# create a context
ctx = datafusion.SessionContext()

creates a SessionContext, that is, the main interface for executing queries with DataFusion. It maintains the state of the connection between a user and an instance of the DataFusion engine. Additionally it provides the following functionality:

  • Create a DataFrame from a CSV or Parquet data source.

  • Register a CSV or Parquet data source as a table that can be referenced from a SQL query.

  • Register a custom data source that can be referenced from a SQL query.

  • Execute a SQL query

The second statement group creates a DataFrame,

# create a RecordBatch and a new DataFrame from it
batch = pyarrow.RecordBatch.from_arrays(
    [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
    names=["a", "b"],
)
df = ctx.create_dataframe([[batch]])

A DataFrame refers to a (logical) set of rows that share the same column names, similar to a Pandas DataFrame. DataFrames are typically created by calling a method on SessionContext, such as read_csv, and can then be modified by calling the transformation methods, such as DataFrame.filter(), DataFrame.select(), DataFrame.aggregate(), and DataFrame.limit() to build up a query definition.

The third statement uses Expressions to build up a query definition.

df = df.select(
    col("a") + col("b"),
    col("a") - col("b"),
)

Finally the collect method converts the logical plan represented by the DataFrame into a physical plan and execute it, collecting all results into a list of RecordBatch.