pyarrow.TableGroupBy#
- class pyarrow.TableGroupBy(table, keys, use_threads=True)#
Bases:
object
A grouping of columns in a table on which to perform aggregations.
- Parameters:
Examples
>>> import pyarrow as pa >>> t = pa.table([ ... pa.array(["a", "a", "b", "b", "c"]), ... pa.array([1, 2, 3, 4, 5]), ... ], names=["keys", "values"])
Grouping of columns:
>>> pa.TableGroupBy(t,"keys") <pyarrow.lib.TableGroupBy object at ...>
Perform aggregations:
>>> pa.TableGroupBy(t,"keys").aggregate([("values", "sum")]) pyarrow.Table keys: string values_sum: int64 ---- keys: [["a","b","c"]] values_sum: [[3,7,5]]
- __init__(self, table, keys, use_threads=True)#
Methods
__init__
(self, table, keys[, use_threads])aggregate
(self, aggregations)Perform an aggregation over the grouped columns of the table.
- aggregate(self, aggregations)#
Perform an aggregation over the grouped columns of the table.
- Parameters:
- aggregations
list
[tuple
(str
,str
)] orlist
[tuple
(str
,str
,FunctionOptions
)] List of tuples, where each tuple is one aggregation specification and consists of: aggregation column name followed by function name and optionally aggregation function option. Pass empty list to get a single row for each group. The column name can be a string, an empty list or a list of column names, for unary, nullary and n-ary aggregation functions respectively.
For the list of function names and respective aggregation function options see Grouped Aggregations.
- aggregations
- Returns:
Table
Results of the aggregation functions.
Examples
>>> import pyarrow as pa >>> t = pa.table([ ... pa.array(["a", "a", "b", "b", "c"]), ... pa.array([1, 2, 3, 4, 5]), ... ], names=["keys", "values"])
Sum the column “values” over the grouped column “keys”:
>>> t.group_by("keys").aggregate([("values", "sum")]) pyarrow.Table keys: string values_sum: int64 ---- keys: [["a","b","c"]] values_sum: [[3,7,5]]
Count the rows over the grouped column “keys”:
>>> t.group_by("keys").aggregate([([], "count_all")]) pyarrow.Table keys: string count_all: int64 ---- keys: [["a","b","c"]] count_all: [[2,2,1]]
Do multiple aggregations:
>>> t.group_by("keys").aggregate([ ... ("values", "sum"), ... ("keys", "count") ... ]) pyarrow.Table keys: string values_sum: int64 keys_count: int64 ---- keys: [["a","b","c"]] values_sum: [[3,7,5]] keys_count: [[2,2,1]]
Count the number of non-null values for column “values” over the grouped column “keys”:
>>> import pyarrow.compute as pc >>> t.group_by(["keys"]).aggregate([ ... ("values", "count", pc.CountOptions(mode="only_valid")) ... ]) pyarrow.Table keys: string values_count: int64 ---- keys: [["a","b","c"]] values_count: [[2,2,1]]
Get a single row for each group in column “keys”:
>>> t.group_by("keys").aggregate([]) pyarrow.Table keys: string ---- keys: [["a","b","c"]]