Arrow Compute#

Apache Arrow provides compute functions to facilitate efficient and portable data processing. In this article, you will use Arrow’s compute functionality to:

  1. Calculate a sum over a column

  2. Calculate element-wise sums over two columns

  3. Search for a value in a column

Pre-requisites#

Before continuing, make sure you have:

  1. An Arrow installation, which you can set up here: Using Arrow C++ in your own project

  2. An understanding of basic Arrow data structures from Basic Arrow Data Structures

Setup#

Before running some computations, we need to fill in a couple gaps:

  1. We need to include necessary headers.

  2. A main() is needed to glue things together.

  3. We need data to play with.

Includes#

Before writing C++ code, we need some includes. We’ll get iostream for output, then import Arrow’s compute functionality:

#include <arrow/api.h>
#include <arrow/compute/api.h>

#include <iostream>

Main()#

For our glue, we’ll use the main() pattern from the previous tutorial on data structures:

int main() {
  arrow::Status st = RunMain();
  if (!st.ok()) {
    std::cerr << st << std::endl;
    return 1;
  }
  return 0;
}

Which, like when we used it before, is paired with a RunMain():

arrow::Status RunMain() {

Generating Tables for Computation#

Before we begin, we’ll initialize a Table with two columns to play with. We’ll use the method from Basic Arrow Data Structures, so look back there if anything’s confusing:

  // Create a couple 32-bit integer arrays.
  arrow::Int32Builder int32builder;
  int32_t some_nums_raw[5] = {34, 624, 2223, 5654, 4356};
  ARROW_RETURN_NOT_OK(int32builder.AppendValues(some_nums_raw, 5));
  std::shared_ptr<arrow::Array> some_nums;
  ARROW_ASSIGN_OR_RAISE(some_nums, int32builder.Finish());

  int32_t more_nums_raw[5] = {75342, 23, 64, 17, 736};
  ARROW_RETURN_NOT_OK(int32builder.AppendValues(more_nums_raw, 5));
  std::shared_ptr<arrow::Array> more_nums;
  ARROW_ASSIGN_OR_RAISE(more_nums, int32builder.Finish());

  // Make a table out of our pair of arrays.
  std::shared_ptr<arrow::Field> field_a, field_b;
  std::shared_ptr<arrow::Schema> schema;

  field_a = arrow::field("A", arrow::int32());
  field_b = arrow::field("B", arrow::int32());

  schema = arrow::schema({field_a, field_b});

  std::shared_ptr<arrow::Table> table;
  table = arrow::Table::Make(schema, {some_nums, more_nums}, 5);

Calculating a Sum over an Array#

Using a computation function has two general steps, which we separate here:

  1. Preparing a Datum for output

  2. Calling compute::Sum(), a convenience function for summation over an Array

  3. Retrieving and printing output

Prepare Memory for Output with Datum#

When computation is done, we need somewhere for our results to go. In Arrow, the object for such output is called Datum. This object is used to pass around inputs and outputs in compute functions, and can contain many differently-shaped Arrow data structures. We’ll need it to retrieve the output from compute functions.

  // The Datum class is what all compute functions output to, and they can take Datums
  // as inputs, as well.
  arrow::Datum sum;

Call Sum()#

Here, we’ll get our Table, which has columns “A” and “B”, and sum over column “A.” For summation, there is a convenience function, called compute::Sum(), which reduces the complexity of the compute interface. We’ll look at the more complex version for the next computation. For a given function, refer to Compute Functions to see if there is a convenience function. compute::Sum() takes in a given Array or ChunkedArray – here, we use Table::GetColumnByName() to pass in column A. Then, it outputs to a Datum. Putting that all together, we get this:

  // Here, we can use arrow::compute::Sum. This is a convenience function, and the next
  // computation won't be so simple. However, using these where possible helps
  // readability.
  ARROW_ASSIGN_OR_RAISE(sum, arrow::compute::Sum({table->GetColumnByName("A")}));

Get Results from Datum#

The previous step leaves us with a Datum which contains our sum. However, we cannot print it directly – its flexibility in holding arbitrary Arrow data structures means we have to retrieve our data carefully. First, to understand what’s in it, we can check which kind of data structure it is, then what kind of primitive is being held:

  // Get the kind of Datum and what it holds -- this is a Scalar, with int64.
  std::cout << "Datum kind: " << sum.ToString()
            << " content type: " << sum.type()->ToString() << std::endl;

This should report the Datum stores a Scalar with a 64-bit integer. Just to see what the value is, we can print it out like so, which yields 12891:

  // Note that we explicitly request a scalar -- the Datum cannot simply give what it is,
  // you must ask for the correct type.
  std::cout << sum.scalar_as<arrow::Int64Scalar>().value << std::endl;

Now we’ve used compute::Sum() and gotten what we want out of it!

Calculating Element-Wise Array Addition with CallFunction()#

A next layer of complexity uses what compute::Sum() was helpfully hiding: compute::CallFunction(). For this example, we will explore how to use the more robust compute::CallFunction() with the “add” compute function. The pattern remains similar:

  1. Preparing a Datum for output

  2. Calling compute::CallFunction() with “add”

  3. Retrieving and printing output

Prepare Memory for Output with Datum#

Once more, we’ll need a Datum for any output we get:

  arrow::Datum element_wise_sum;

Use CallFunction() with “add”#

compute::CallFunction() takes the name of the desired function as its first argument, then the data inputs for said function as a vector in its second argument. Right now, we want an element-wise addition between columns “A” and “B”. So, we’ll ask for “add,” pass in columns “A and B”, and output to our Datum. Put this all together, and we get:

  // Get element-wise sum of both columns A and B in our Table. Note that here we use
  // CallFunction(), which takes the name of the function as the first argument.
  ARROW_ASSIGN_OR_RAISE(element_wise_sum, arrow::compute::CallFunction(
                                              "add", {table->GetColumnByName("A"),
                                                      table->GetColumnByName("B")}));

See also

Available functions for a list of other functions to go with compute::CallFunction()

Get Results from Datum#

Again, the Datum needs some careful handling. Said handling is much easier when we know what’s in it. This Datum holds a ChunkedArray with 32-bit integers, but we can print that to confirm:

  // Get the kind of Datum and what it holds -- this is a ChunkedArray, with int32.
  std::cout << "Datum kind: " << element_wise_sum.ToString()
            << " content type: " << element_wise_sum.type()->ToString() << std::endl;

Since it’s a ChunkedArray, we request that from the DatumChunkedArray has a ChunkedArray::ToString() method, so we’ll use that to print out its contents:

  // This time, we get a ChunkedArray, not a scalar.
  std::cout << element_wise_sum.chunked_array()->ToString() << std::endl;

The output looks like this:

Datum kind: ChunkedArray content type: int32
[
  [
    75376,
    647,
    2287,
    5671,
    5092
  ]
]

Now, we’ve used compute::CallFunction(), instead of a convenience function! This enables a much wider range of available computations.

Searching for a Value with CallFunction() and Options#

One class of computations remains. compute::CallFunction() uses a vector for data inputs, but computation often needs additional arguments to function. In order to supply this, computation functions may be associated with structs where their arguments can be defined. You can check a given function to see which struct it uses here. For this example, we’ll search for a value in column “A” using the “index” compute function. This process has three steps, as opposed to the two from before:

  1. Preparing a Datum for output

  2. Preparing compute::IndexOptions

  3. Calling compute::CallFunction() with “index” and compute::IndexOptions

  4. Retrieving and printing output

Prepare Memory for Output with Datum#

We’ll need a Datum for any output we get:

  // Use an options struct to set up searching for 2223 in column A (the third item).
  arrow::Datum third_item;

Configure “index” with IndexOptions#

For this exploration, we’ll use the “index” function – this is a searching method, which returns the index of an input value. In order to pass this input value, we require an compute::IndexOptions struct. So, let’s make that struct:

  // An options struct is used in lieu of passing an arbitrary amount of arguments.
  arrow::compute::IndexOptions index_options;

In a searching function, one requires a target value. Here, we’ll use 2223, the third item in column A, and configure our struct accordingly:

  // We need an Arrow Scalar, not a raw value.
  index_options.value = arrow::MakeScalar(2223);

Use CallFunction() with “index” and IndexOptions#

To actually run the function, we use compute::CallFunction() again, this time passing our IndexOptions struct by reference as a third argument. As before, the first argument is the name of the function, and the second our data input:

  ARROW_ASSIGN_OR_RAISE(
      third_item, arrow::compute::CallFunction("index", {table->GetColumnByName("A")},
                                               &index_options));

Get Results from Datum#

One last time, let’s see what our Datum has! This will be a Scalar with a 64-bit integer, and the output will be 2:

  // Get the kind of Datum and what it holds -- this is a Scalar, with int64
  std::cout << "Datum kind: " << third_item.ToString()
            << " content type: " << third_item.type()->ToString() << std::endl;
  // We get a scalar -- the location of 2223 in column A, which is 2 in 0-based indexing.
  std::cout << third_item.scalar_as<arrow::Int64Scalar>().value << std::endl;

Ending Program#

At the end, we just return arrow::Status::OK(), so the main() knows that we’re done, and that everything’s okay, just like the preceding tutorials.

  return arrow::Status::OK();
}

With that, you’ve used compute functions which fall into the three main types – with and without convenience functions, then with an Options struct. Now you can process any Table you need to, and solve whatever data problem you have that fits into memory!

Which means that now we have to see how we can work with larger-than-memory datasets, via Arrow Datasets in the next article.

Refer to the below for a copy of the complete code:

 19// (Doc section: Includes)
 20#include <arrow/api.h>
 21#include <arrow/compute/api.h>
 22
 23#include <iostream>
 24// (Doc section: Includes)
 25
 26// (Doc section: RunMain)
 27arrow::Status RunMain() {
 28  // (Doc section: RunMain)
 29  // (Doc section: Create Tables)
 30  // Create a couple 32-bit integer arrays.
 31  arrow::Int32Builder int32builder;
 32  int32_t some_nums_raw[5] = {34, 624, 2223, 5654, 4356};
 33  ARROW_RETURN_NOT_OK(int32builder.AppendValues(some_nums_raw, 5));
 34  std::shared_ptr<arrow::Array> some_nums;
 35  ARROW_ASSIGN_OR_RAISE(some_nums, int32builder.Finish());
 36
 37  int32_t more_nums_raw[5] = {75342, 23, 64, 17, 736};
 38  ARROW_RETURN_NOT_OK(int32builder.AppendValues(more_nums_raw, 5));
 39  std::shared_ptr<arrow::Array> more_nums;
 40  ARROW_ASSIGN_OR_RAISE(more_nums, int32builder.Finish());
 41
 42  // Make a table out of our pair of arrays.
 43  std::shared_ptr<arrow::Field> field_a, field_b;
 44  std::shared_ptr<arrow::Schema> schema;
 45
 46  field_a = arrow::field("A", arrow::int32());
 47  field_b = arrow::field("B", arrow::int32());
 48
 49  schema = arrow::schema({field_a, field_b});
 50
 51  std::shared_ptr<arrow::Table> table;
 52  table = arrow::Table::Make(schema, {some_nums, more_nums}, 5);
 53  // (Doc section: Create Tables)
 54
 55  // (Doc section: Sum Datum Declaration)
 56  // The Datum class is what all compute functions output to, and they can take Datums
 57  // as inputs, as well.
 58  arrow::Datum sum;
 59  // (Doc section: Sum Datum Declaration)
 60  // (Doc section: Sum Call)
 61  // Here, we can use arrow::compute::Sum. This is a convenience function, and the next
 62  // computation won't be so simple. However, using these where possible helps
 63  // readability.
 64  ARROW_ASSIGN_OR_RAISE(sum, arrow::compute::Sum({table->GetColumnByName("A")}));
 65  // (Doc section: Sum Call)
 66  // (Doc section: Sum Datum Type)
 67  // Get the kind of Datum and what it holds -- this is a Scalar, with int64.
 68  std::cout << "Datum kind: " << sum.ToString()
 69            << " content type: " << sum.type()->ToString() << std::endl;
 70  // (Doc section: Sum Datum Type)
 71  // (Doc section: Sum Contents)
 72  // Note that we explicitly request a scalar -- the Datum cannot simply give what it is,
 73  // you must ask for the correct type.
 74  std::cout << sum.scalar_as<arrow::Int64Scalar>().value << std::endl;
 75  // (Doc section: Sum Contents)
 76
 77  // (Doc section: Add Datum Declaration)
 78  arrow::Datum element_wise_sum;
 79  // (Doc section: Add Datum Declaration)
 80  // (Doc section: Add Call)
 81  // Get element-wise sum of both columns A and B in our Table. Note that here we use
 82  // CallFunction(), which takes the name of the function as the first argument.
 83  ARROW_ASSIGN_OR_RAISE(element_wise_sum, arrow::compute::CallFunction(
 84                                              "add", {table->GetColumnByName("A"),
 85                                                      table->GetColumnByName("B")}));
 86  // (Doc section: Add Call)
 87  // (Doc section: Add Datum Type)
 88  // Get the kind of Datum and what it holds -- this is a ChunkedArray, with int32.
 89  std::cout << "Datum kind: " << element_wise_sum.ToString()
 90            << " content type: " << element_wise_sum.type()->ToString() << std::endl;
 91  // (Doc section: Add Datum Type)
 92  // (Doc section: Add Contents)
 93  // This time, we get a ChunkedArray, not a scalar.
 94  std::cout << element_wise_sum.chunked_array()->ToString() << std::endl;
 95  // (Doc section: Add Contents)
 96
 97  // (Doc section: Index Datum Declare)
 98  // Use an options struct to set up searching for 2223 in column A (the third item).
 99  arrow::Datum third_item;
100  // (Doc section: Index Datum Declare)
101  // (Doc section: IndexOptions Declare)
102  // An options struct is used in lieu of passing an arbitrary amount of arguments.
103  arrow::compute::IndexOptions index_options;
104  // (Doc section: IndexOptions Declare)
105  // (Doc section: IndexOptions Assign)
106  // We need an Arrow Scalar, not a raw value.
107  index_options.value = arrow::MakeScalar(2223);
108  // (Doc section: IndexOptions Assign)
109  // (Doc section: Index Call)
110  ARROW_ASSIGN_OR_RAISE(
111      third_item, arrow::compute::CallFunction("index", {table->GetColumnByName("A")},
112                                               &index_options));
113  // (Doc section: Index Call)
114  // (Doc section: Index Inspection)
115  // Get the kind of Datum and what it holds -- this is a Scalar, with int64
116  std::cout << "Datum kind: " << third_item.ToString()
117            << " content type: " << third_item.type()->ToString() << std::endl;
118  // We get a scalar -- the location of 2223 in column A, which is 2 in 0-based indexing.
119  std::cout << third_item.scalar_as<arrow::Int64Scalar>().value << std::endl;
120  // (Doc section: Index Inspection)
121  // (Doc section: Ret)
122  return arrow::Status::OK();
123}
124// (Doc section: Ret)
125
126// (Doc section: Main)
127int main() {
128  arrow::Status st = RunMain();
129  if (!st.ok()) {
130    std::cerr << st << std::endl;
131    return 1;
132  }
133  return 0;
134}
135// (Doc section: Main)