AggregationΒΆ

An aggregate or aggregation is a function where the values of multiple rows are processed together to form a single summary value. For performing an aggregation, DataFusion provides the DataFrame.aggregate()

In [1]: from datafusion import SessionContext

In [2]: from datafusion import column, lit

In [3]: from datafusion import functions as f

In [4]: import random

In [5]: ctx = SessionContext()

In [6]: df = ctx.from_pydict(
   ...:     {
   ...:         "a": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
   ...:         "b": ["one", "one", "two", "three", "two", "two", "one", "three"],
   ...:         "c": [random.randint(0, 100) for _ in range(8)],
   ...:         "d": [random.random() for _ in range(8)],
   ...:     },
   ...:     name="foo_bar"
   ...: )
   ...: 

In [7]: col_a = column("a")

In [8]: col_b = column("b")

In [9]: col_c = column("c")

In [10]: col_d = column("d")

In [11]: df.aggregate([], [f.approx_distinct(col_c), f.approx_median(col_d), f.approx_percentile_cont(col_d, lit(0.5))])
Out[11]: 
DataFrame()
+----------------------------+--------------------------+------------------------------------------------+
| APPROX_DISTINCT(foo_bar.c) | APPROX_MEDIAN(foo_bar.d) | APPROX_PERCENTILE_CONT(foo_bar.d,Float64(0.5)) |
+----------------------------+--------------------------+------------------------------------------------+
| 8                          | 0.5697794594361903       | 0.5697794594361903                             |
+----------------------------+--------------------------+------------------------------------------------+

When the group_by list is empty the aggregation is done over the whole DataFrame. For grouping the group_by list must contain at least one column

In [12]: df.aggregate([col_a], [f.sum(col_c), f.max(col_d), f.min(col_d)])
Out[12]: 
DataFrame()
+-----+----------------+---------------------+---------------------+
| a   | SUM(foo_bar.c) | MAX(foo_bar.d)      | MIN(foo_bar.d)      |
+-----+----------------+---------------------+---------------------+
| foo | 324            | 0.6771596827294076  | 0.40753646099412066 |
| bar | 37             | 0.41904078110353304 | 0.3542972021920777  |
+-----+----------------+---------------------+---------------------+

More than one column can be used for grouping

In [13]: df.aggregate([col_a, col_b], [f.sum(col_c), f.max(col_d), f.min(col_d)])
Out[13]: 
DataFrame()
+-----+-------+----------------+---------------------+---------------------+
| a   | b     | SUM(foo_bar.c) | MAX(foo_bar.d)      | MIN(foo_bar.d)      |
+-----+-------+----------------+---------------------+---------------------+
| bar | three | 8              | 0.40065129768988994 | 0.40065129768988994 |
| bar | two   | 20             | 0.41904078110353304 | 0.41904078110353304 |
| bar | one   | 9              | 0.3542972021920777  | 0.3542972021920777  |
| foo | one   | 133            | 0.6630616508067007  | 0.40753646099412066 |
| foo | two   | 96             | 0.6629873075021545  | 0.6307660910358457  |
| foo | three | 95             | 0.6771596827294076  | 0.6771596827294076  |
+-----+-------+----------------+---------------------+---------------------+