Compute Functions¶

Arrow supports logical compute operations over inputs of possibly varying types.

The standard compute operations are provided by the pyarrow.compute module and can be used directly:

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> a = pa.array([1, 1, 2, 3])
>>> pc.sum(a)
<pyarrow.Int64Scalar: 7>

The grouped aggregation functions raise an exception instead and need to be used through the pyarrow.Table.group_by() capabilities. See Grouped Aggregations for more details.

Standard Compute Functions¶

Many compute functions support both array (chunked or not) and scalar inputs, but some will mandate either. For example, sort_indices requires its first and only input to be an array.

Below are a few simple examples:

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> a = pa.array([1, 1, 2, 3])
>>> b = pa.array([4, 1, 2, 8])
>>> pc.equal(a, b)
<pyarrow.lib.BooleanArray object at 0x7f686e4eef30>
[
  false,
  true,
  true,
  false
]
>>> x, y = pa.scalar(7.8), pa.scalar(9.3)
>>> pc.multiply(x, y)
<pyarrow.DoubleScalar: 72.54>

These functions can do more than just element-by-element operations. Here is an example of sorting a table:

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> t = pa.table({'x':[1,2,3],'y':[3,2,1]})
>>> i = pc.sort_indices(t, sort_keys=[('y', 'ascending')])
>>> i
<pyarrow.lib.UInt64Array object at 0x7fcee5df75e8>
[
  2,
  1,
  0
]

For a complete list of the compute functions that PyArrow provides you can refer to Compute Functions reference.

Grouped Aggregations¶

PyArrow supports grouped aggregations over pyarrow.Table through the pyarrow.Table.group_by() method. The method will return a grouping declaration to which the hash aggregation functions can be applied:

>>> import pyarrow as pa
>>> t = pa.table([
...       pa.array(["a", "a", "b", "b", "c"]),
...       pa.array([1, 2, 3, 4, 5]),
... ], names=["keys", "values"])
>>> t.group_by("keys").aggregate([("values", "sum")])
pyarrow.Table
values_sum: int64
keys: string
----
values_sum: [[3,7,5]]
keys: [["a","b","c"]]

The "sum" aggregation passed to the aggregate method in the previous example is the hash_sum compute function.

Multiple aggregations can be performed at the same time by providing them to the aggregate method:

>>> import pyarrow as pa
>>> t = pa.table([
...       pa.array(["a", "a", "b", "b", "c"]),
...       pa.array([1, 2, 3, 4, 5]),
... ], names=["keys", "values"])
>>> t.group_by("keys").aggregate([
...    ("values", "sum"),
...    ("keys", "count")
... ])
pyarrow.Table
values_sum: int64
keys_count: int64
keys: string
----
values_sum: [[3,7,5]]
keys_count: [[2,2,1]]
keys: [["a","b","c"]]

Aggregation options can also be provided for each aggregation function, for example we can use CountOptions to change how we count null values:

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> table_with_nulls = pa.table([
...    pa.array(["a", "a", "a"]),
...    pa.array([1, None, None])
... ], names=["keys", "values"])
>>> table_with_nulls.group_by(["keys"]).aggregate([
...    ("values", "count", pc.CountOptions(mode="all"))
... ])
pyarrow.Table
values_count: int64
keys: string
----
values_count: [[3]]
keys: [["a"]]
>>> table_with_nulls.group_by(["keys"]).aggregate([
...    ("values", "count", pc.CountOptions(mode="only_valid"))
... ])
pyarrow.Table
values_count: int64
keys: string
----
values_count: [[1]]
keys: [["a"]]

Following is a list of all supported grouped aggregation functions. You can use them with or without the "hash_" prefix.

hash_all	Whether all elements in each group evaluate to true	`ScalarAggregateOptions`
hash_any	Whether any element in each group evaluates to true	`ScalarAggregateOptions`
hash_approximate_median	Compute approximate medians of values in each group	`ScalarAggregateOptions`
hash_count	Count the number of null / non-null values in each group	`CountOptions`
hash_count_distinct	Count the distinct values in each group	`CountOptions`
hash_distinct	Keep the distinct values in each group	`CountOptions`
hash_max	Compute the minimum or maximum of values in each group	`ScalarAggregateOptions`
hash_mean	Compute the mean of values in each group	`ScalarAggregateOptions`
hash_min	Compute the minimum or maximum of values in each group	`ScalarAggregateOptions`
hash_min_max	Compute the minimum and maximum of values in each group	`ScalarAggregateOptions`
hash_product	Compute the product of values in each group	`ScalarAggregateOptions`
hash_stddev	Compute the standard deviation of values in each group
hash_sum	Sum values in each group	`ScalarAggregateOptions`
hash_tdigest	Compute approximate quantiles of values in each group	`TDigestOptions`
hash_variance	Compute the variance of values in each group

Data Types and In-Memory Data Model

Memory and IO Interfaces