pyarrow.TableGroupBy#

class pyarrow.TableGroupBy(table, keys, use_threads=True)#

Bases: object

A grouping of columns in a table on which to perform aggregations.

Parameters:
tablepyarrow.Table

Input table to execute the aggregation on.

keysstr or list[str]

Name of the grouped columns.

use_threadsbool, default True

Whether to use multithreading or not. When set to True (the default), no stable ordering of the output is guaranteed.

Examples

>>> import pyarrow as pa
>>> t = pa.table([
...       pa.array(["a", "a", "b", "b", "c"]),
...       pa.array([1, 2, 3, 4, 5]),
... ], names=["keys", "values"])

Grouping of columns:

>>> pa.TableGroupBy(t,"keys")
<pyarrow.lib.TableGroupBy object at ...>

Perform aggregations:

>>> pa.TableGroupBy(t,"keys").aggregate([("values", "sum")])
pyarrow.Table
keys: string
values_sum: int64
----
keys: [["a","b","c"]]
values_sum: [[3,7,5]]
__init__(self, table, keys, use_threads=True)#

Methods

__init__(self, table, keys[, use_threads])

aggregate(self, aggregations)

Perform an aggregation over the grouped columns of the table.

aggregate(self, aggregations)#

Perform an aggregation over the grouped columns of the table.

Parameters:
aggregationslist[tuple(str, str)] or list[tuple(str, str, FunctionOptions)]

List of tuples, where each tuple is one aggregation specification and consists of: aggregation column name followed by function name and optionally aggregation function option. Pass empty list to get a single row for each group. The column name can be a string, an empty list or a list of column names, for unary, nullary and n-ary aggregation functions respectively.

For the list of function names and respective aggregation function options see Grouped Aggregations.

Returns:
Table

Results of the aggregation functions.

Examples

>>> import pyarrow as pa
>>> t = pa.table([
...       pa.array(["a", "a", "b", "b", "c"]),
...       pa.array([1, 2, 3, 4, 5]),
... ], names=["keys", "values"])

Sum the column “values” over the grouped column “keys”:

>>> t.group_by("keys").aggregate([("values", "sum")])
pyarrow.Table
keys: string
values_sum: int64
----
keys: [["a","b","c"]]
values_sum: [[3,7,5]]

Count the rows over the grouped column “keys”:

>>> t.group_by("keys").aggregate([([], "count_all")])
pyarrow.Table
keys: string
count_all: int64
----
keys: [["a","b","c"]]
count_all: [[2,2,1]]

Do multiple aggregations:

>>> t.group_by("keys").aggregate([
...    ("values", "sum"),
...    ("keys", "count")
... ])
pyarrow.Table
keys: string
values_sum: int64
keys_count: int64
----
keys: [["a","b","c"]]
values_sum: [[3,7,5]]
keys_count: [[2,2,1]]

Count the number of non-null values for column “values” over the grouped column “keys”:

>>> import pyarrow.compute as pc
>>> t.group_by(["keys"]).aggregate([
...    ("values", "count", pc.CountOptions(mode="only_valid"))
... ])
pyarrow.Table
keys: string
values_count: int64
----
keys: [["a","b","c"]]
values_count: [[2,2,1]]

Get a single row for each group in column “keys”:

>>> t.group_by("keys").aggregate([])
pyarrow.Table
keys: string
----
keys: [["a","b","c"]]