AggregationΒΆ
An aggregate or aggregation is a function where the values of multiple rows are processed together to form a single summary value.
For performing an aggregation, DataFusion provides the DataFrame.aggregate()
In [1]: from datafusion import SessionContext
In [2]: from datafusion import column, lit
In [3]: from datafusion import functions as f
In [4]: import random
In [5]: ctx = SessionContext()
In [6]: df = ctx.from_pydict(
...: {
...: "a": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
...: "b": ["one", "one", "two", "three", "two", "two", "one", "three"],
...: "c": [random.randint(0, 100) for _ in range(8)],
...: "d": [random.random() for _ in range(8)],
...: },
...: name="foo_bar"
...: )
...:
In [7]: col_a = column("a")
In [8]: col_b = column("b")
In [9]: col_c = column("c")
In [10]: col_d = column("d")
In [11]: df.aggregate([], [f.approx_distinct(col_c), f.approx_median(col_d), f.approx_percentile_cont(col_d, lit(0.5))])
Out[11]:
DataFrame()
+----------------------------+--------------------------+------------------------------------------------+
| APPROX_DISTINCT(foo_bar.c) | APPROX_MEDIAN(foo_bar.d) | APPROX_PERCENTILE_CONT(foo_bar.d,Float64(0.5)) |
+----------------------------+--------------------------+------------------------------------------------+
| 8 | 0.5222802737411103 | 0.5222802737411103 |
+----------------------------+--------------------------+------------------------------------------------+
When the group_by
list is empty the aggregation is done over the whole DataFrame
. For grouping
the group_by
list must contain at least one column
In [12]: df.aggregate([col_a], [f.sum(col_c), f.max(col_d), f.min(col_d)])
Out[12]:
DataFrame()
+-----+----------------+--------------------+---------------------+
| a | SUM(foo_bar.c) | MAX(foo_bar.d) | MIN(foo_bar.d) |
+-----+----------------+--------------------+---------------------+
| foo | 380 | 0.541408832006397 | 0.17630456686114293 |
| bar | 174 | 0.6815750485777792 | 0.5222802737411103 |
+-----+----------------+--------------------+---------------------+
More than one column can be used for grouping
In [13]: df.aggregate([col_a, col_b], [f.sum(col_c), f.max(col_d), f.min(col_d)])
Out[13]:
DataFrame()
+-----+-------+----------------+---------------------+---------------------+
| a | b | SUM(foo_bar.c) | MAX(foo_bar.d) | MIN(foo_bar.d) |
+-----+-------+----------------+---------------------+---------------------+
| bar | one | 12 | 0.5222802737411103 | 0.5222802737411103 |
| foo | one | 186 | 0.5253409991137165 | 0.17630456686114293 |
| bar | three | 79 | 0.6815750485777792 | 0.6815750485777792 |
| bar | two | 83 | 0.6258125086771043 | 0.6258125086771043 |
| foo | two | 170 | 0.35039735290941953 | 0.21687303184912854 |
| foo | three | 24 | 0.541408832006397 | 0.541408832006397 |
+-----+-------+----------------+---------------------+---------------------+