Arrow Compute#
Apache Arrow provides compute functions to facilitate efficient and portable data processing. In this article, you will use Arrow’s compute functionality to:
Calculate a sum over a column
Calculate element-wise sums over two columns
Search for a value in a column
Pre-requisites#
Before continuing, make sure you have:
An Arrow installation, which you can set up here: Using Arrow C++ in your own project. If you’re compiling Arrow yourself, be sure you compile with the compute module enabled (i.e.,
-DARROW_COMPUTE=ON
), see Optional Components.An understanding of basic Arrow data structures from Basic Arrow Data Structures
Setup#
Before running some computations, we need to fill in a couple gaps:
We need to include necessary headers.
A
main()
is needed to glue things together.We need data to play with.
Includes#
Before writing C++ code, we need some includes. We’ll get iostream
for output, then import Arrow’s
compute functionality:
#include <arrow/api.h>
#include <arrow/compute/api.h>
#include <iostream>
Main()#
For our glue, we’ll use the main()
pattern from the previous tutorial on
data structures:
int main() {
arrow::Status st = RunMain();
if (!st.ok()) {
std::cerr << st << std::endl;
return 1;
}
return 0;
}
Which, like when we used it before, is paired with a RunMain()
:
arrow::Status RunMain() {
Generating Tables for Computation#
Before we begin, we’ll initialize a Table
with two columns to play with. We’ll use
the method from Basic Arrow Data Structures, so look back
there if anything’s confusing:
// Create a couple 32-bit integer arrays.
arrow::Int32Builder int32builder;
int32_t some_nums_raw[5] = {34, 624, 2223, 5654, 4356};
ARROW_RETURN_NOT_OK(int32builder.AppendValues(some_nums_raw, 5));
std::shared_ptr<arrow::Array> some_nums;
ARROW_ASSIGN_OR_RAISE(some_nums, int32builder.Finish());
int32_t more_nums_raw[5] = {75342, 23, 64, 17, 736};
ARROW_RETURN_NOT_OK(int32builder.AppendValues(more_nums_raw, 5));
std::shared_ptr<arrow::Array> more_nums;
ARROW_ASSIGN_OR_RAISE(more_nums, int32builder.Finish());
// Make a table out of our pair of arrays.
std::shared_ptr<arrow::Field> field_a, field_b;
std::shared_ptr<arrow::Schema> schema;
field_a = arrow::field("A", arrow::int32());
field_b = arrow::field("B", arrow::int32());
schema = arrow::schema({field_a, field_b});
std::shared_ptr<arrow::Table> table;
table = arrow::Table::Make(schema, {some_nums, more_nums}, 5);
Calculating a Sum over an Array#
Using a computation function has two general steps, which we separate here:
Preparing a
Datum
for outputCalling
compute::Sum()
, a convenience function for summation over anArray
Retrieving and printing output
Prepare Memory for Output with Datum#
When computation is done, we need somewhere for our results to go. In
Arrow, the object for such output is called Datum
. This object is used
to pass around inputs and outputs in compute functions, and can contain
many differently-shaped Arrow data structures. We’ll need it to retrieve
the output from compute functions.
// The Datum class is what all compute functions output to, and they can take Datums
// as inputs, as well.
arrow::Datum sum;
Call Sum()#
Here, we’ll get our Table
, which has columns “A” and “B”, and sum over
column “A.” For summation, there is a convenience function, called
compute::Sum()
, which reduces the complexity of the compute interface. We’ll look
at the more complex version for the next computation. For a given
function, refer to Compute Functions to see if there is a
convenience function. compute::Sum()
takes in a given Array
or ChunkedArray
– here, we use Table::GetColumnByName()
to pass in column A. Then, it outputs to
a Datum
. Putting that all together, we get this:
// Here, we can use arrow::compute::Sum. This is a convenience function, and the next
// computation won't be so simple. However, using these where possible helps
// readability.
ARROW_ASSIGN_OR_RAISE(sum, arrow::compute::Sum({table->GetColumnByName("A")}));
Get Results from Datum#
The previous step leaves us with a Datum
which contains our sum.
However, we cannot print it directly – its flexibility in holding
arbitrary Arrow data structures means we have to retrieve our data
carefully. First, to understand what’s in it, we can check which kind of
data structure it is, then what kind of primitive is being held:
// Get the kind of Datum and what it holds -- this is a Scalar, with int64.
std::cout << "Datum kind: " << sum.ToString()
<< " content type: " << sum.type()->ToString() << std::endl;
This should report the Datum
stores a Scalar
with a 64-bit integer. Just
to see what the value is, we can print it out like so, which yields
12891:
// Note that we explicitly request a scalar -- the Datum cannot simply give what it is,
// you must ask for the correct type.
std::cout << sum.scalar_as<arrow::Int64Scalar>().value << std::endl;
Now we’ve used compute::Sum()
and gotten what we want out of it!
Calculating Element-Wise Array Addition with CallFunction()#
A next layer of complexity uses what compute::Sum()
was helpfully hiding:
compute::CallFunction()
. For this example, we will explore how to use the more
robust compute::CallFunction()
with the “add” compute function. The pattern
remains similar:
Preparing a Datum for output
Calling
compute::CallFunction()
with “add”Retrieving and printing output
Prepare Memory for Output with Datum#
Once more, we’ll need a Datum for any output we get:
arrow::Datum element_wise_sum;
Use CallFunction() with “add”#
compute::CallFunction()
takes the name of the desired function as its first
argument, then the data inputs for said function as a vector in its
second argument. Right now, we want an element-wise addition between
columns “A” and “B”. So, we’ll ask for “add,” pass in columns “A and B”,
and output to our Datum
. Put this all together, and we get:
// Get element-wise sum of both columns A and B in our Table. Note that here we use
// CallFunction(), which takes the name of the function as the first argument.
ARROW_ASSIGN_OR_RAISE(element_wise_sum, arrow::compute::CallFunction(
"add", {table->GetColumnByName("A"),
table->GetColumnByName("B")}));
See also
Available functions for a list of other functions to go with compute::CallFunction()
Get Results from Datum#
Again, the Datum
needs some careful handling. Said handling is much
easier when we know what’s in it. This Datum
holds a ChunkedArray
with
32-bit integers, but we can print that to confirm:
// Get the kind of Datum and what it holds -- this is a ChunkedArray, with int32.
std::cout << "Datum kind: " << element_wise_sum.ToString()
<< " content type: " << element_wise_sum.type()->ToString() << std::endl;
Since it’s a ChunkedArray
, we request that from the Datum
– ChunkedArray
has a ChunkedArray::ToString()
method, so we’ll use that to print out its contents:
// This time, we get a ChunkedArray, not a scalar.
std::cout << element_wise_sum.chunked_array()->ToString() << std::endl;
The output looks like this:
Datum kind: ChunkedArray content type: int32
[
[
75376,
647,
2287,
5671,
5092
]
]
Now, we’ve used compute::CallFunction()
, instead of a convenience function! This
enables a much wider range of available computations.
Searching for a Value with CallFunction() and Options#
One class of computations remains. compute::CallFunction()
uses a vector for data
inputs, but computation often needs additional arguments to function. In
order to supply this, computation functions may be associated with
structs where their arguments can be defined. You can check a given
function to see which struct it uses here. For this example, we’ll search for a value in column “A” using
the “index” compute function. This process has three steps, as opposed
to the two from before:
Preparing a
Datum
for outputPreparing
compute::IndexOptions
Calling
compute::CallFunction()
with “index” andcompute::IndexOptions
Retrieving and printing output
Prepare Memory for Output with Datum#
We’ll need a Datum
for any output we get:
// Use an options struct to set up searching for 2223 in column A (the third item).
arrow::Datum third_item;
Configure “index” with IndexOptions#
For this exploration, we’ll use the “index” function – this is a
searching method, which returns the index of an input value. In order to
pass this input value, we require an compute::IndexOptions
struct. So, let’s make
that struct:
// An options struct is used in lieu of passing an arbitrary amount of arguments.
arrow::compute::IndexOptions index_options;
In a searching function, one requires a target value. Here, we’ll use 2223, the third item in column A, and configure our struct accordingly:
// We need an Arrow Scalar, not a raw value.
index_options.value = arrow::MakeScalar(2223);
Use CallFunction() with “index” and IndexOptions#
To actually run the function, we use compute::CallFunction()
again, this time
passing our IndexOptions struct by reference as a third argument. As
before, the first argument is the name of the function, and the second
our data input:
ARROW_ASSIGN_OR_RAISE(
third_item, arrow::compute::CallFunction("index", {table->GetColumnByName("A")},
&index_options));
Get Results from Datum#
One last time, let’s see what our Datum
has! This will be a Scalar
with
a 64-bit integer, and the output will be 2:
// Get the kind of Datum and what it holds -- this is a Scalar, with int64
std::cout << "Datum kind: " << third_item.ToString()
<< " content type: " << third_item.type()->ToString() << std::endl;
// We get a scalar -- the location of 2223 in column A, which is 2 in 0-based indexing.
std::cout << third_item.scalar_as<arrow::Int64Scalar>().value << std::endl;
Ending Program#
At the end, we just return arrow::Status::OK()
, so the main()
knows that
we’re done, and that everything’s okay, just like the preceding
tutorials.
return arrow::Status::OK();
}
With that, you’ve used compute functions which fall into the three main
types – with and without convenience functions, then with an Options
struct. Now you can process any Table
you need to, and solve whatever
data problem you have that fits into memory!
Which means that now we have to see how we can work with larger-than-memory datasets, via Arrow Datasets in the next article.
Refer to the below for a copy of the complete code:
19// (Doc section: Includes)
20#include <arrow/api.h>
21#include <arrow/compute/api.h>
22
23#include <iostream>
24// (Doc section: Includes)
25
26// (Doc section: RunMain)
27arrow::Status RunMain() {
28 // (Doc section: RunMain)
29 // (Doc section: Create Tables)
30 // Create a couple 32-bit integer arrays.
31 arrow::Int32Builder int32builder;
32 int32_t some_nums_raw[5] = {34, 624, 2223, 5654, 4356};
33 ARROW_RETURN_NOT_OK(int32builder.AppendValues(some_nums_raw, 5));
34 std::shared_ptr<arrow::Array> some_nums;
35 ARROW_ASSIGN_OR_RAISE(some_nums, int32builder.Finish());
36
37 int32_t more_nums_raw[5] = {75342, 23, 64, 17, 736};
38 ARROW_RETURN_NOT_OK(int32builder.AppendValues(more_nums_raw, 5));
39 std::shared_ptr<arrow::Array> more_nums;
40 ARROW_ASSIGN_OR_RAISE(more_nums, int32builder.Finish());
41
42 // Make a table out of our pair of arrays.
43 std::shared_ptr<arrow::Field> field_a, field_b;
44 std::shared_ptr<arrow::Schema> schema;
45
46 field_a = arrow::field("A", arrow::int32());
47 field_b = arrow::field("B", arrow::int32());
48
49 schema = arrow::schema({field_a, field_b});
50
51 std::shared_ptr<arrow::Table> table;
52 table = arrow::Table::Make(schema, {some_nums, more_nums}, 5);
53 // (Doc section: Create Tables)
54
55 // (Doc section: Sum Datum Declaration)
56 // The Datum class is what all compute functions output to, and they can take Datums
57 // as inputs, as well.
58 arrow::Datum sum;
59 // (Doc section: Sum Datum Declaration)
60 // (Doc section: Sum Call)
61 // Here, we can use arrow::compute::Sum. This is a convenience function, and the next
62 // computation won't be so simple. However, using these where possible helps
63 // readability.
64 ARROW_ASSIGN_OR_RAISE(sum, arrow::compute::Sum({table->GetColumnByName("A")}));
65 // (Doc section: Sum Call)
66 // (Doc section: Sum Datum Type)
67 // Get the kind of Datum and what it holds -- this is a Scalar, with int64.
68 std::cout << "Datum kind: " << sum.ToString()
69 << " content type: " << sum.type()->ToString() << std::endl;
70 // (Doc section: Sum Datum Type)
71 // (Doc section: Sum Contents)
72 // Note that we explicitly request a scalar -- the Datum cannot simply give what it is,
73 // you must ask for the correct type.
74 std::cout << sum.scalar_as<arrow::Int64Scalar>().value << std::endl;
75 // (Doc section: Sum Contents)
76
77 // (Doc section: Add Datum Declaration)
78 arrow::Datum element_wise_sum;
79 // (Doc section: Add Datum Declaration)
80 // (Doc section: Add Call)
81 // Get element-wise sum of both columns A and B in our Table. Note that here we use
82 // CallFunction(), which takes the name of the function as the first argument.
83 ARROW_ASSIGN_OR_RAISE(element_wise_sum, arrow::compute::CallFunction(
84 "add", {table->GetColumnByName("A"),
85 table->GetColumnByName("B")}));
86 // (Doc section: Add Call)
87 // (Doc section: Add Datum Type)
88 // Get the kind of Datum and what it holds -- this is a ChunkedArray, with int32.
89 std::cout << "Datum kind: " << element_wise_sum.ToString()
90 << " content type: " << element_wise_sum.type()->ToString() << std::endl;
91 // (Doc section: Add Datum Type)
92 // (Doc section: Add Contents)
93 // This time, we get a ChunkedArray, not a scalar.
94 std::cout << element_wise_sum.chunked_array()->ToString() << std::endl;
95 // (Doc section: Add Contents)
96
97 // (Doc section: Index Datum Declare)
98 // Use an options struct to set up searching for 2223 in column A (the third item).
99 arrow::Datum third_item;
100 // (Doc section: Index Datum Declare)
101 // (Doc section: IndexOptions Declare)
102 // An options struct is used in lieu of passing an arbitrary amount of arguments.
103 arrow::compute::IndexOptions index_options;
104 // (Doc section: IndexOptions Declare)
105 // (Doc section: IndexOptions Assign)
106 // We need an Arrow Scalar, not a raw value.
107 index_options.value = arrow::MakeScalar(2223);
108 // (Doc section: IndexOptions Assign)
109 // (Doc section: Index Call)
110 ARROW_ASSIGN_OR_RAISE(
111 third_item, arrow::compute::CallFunction("index", {table->GetColumnByName("A")},
112 &index_options));
113 // (Doc section: Index Call)
114 // (Doc section: Index Inspection)
115 // Get the kind of Datum and what it holds -- this is a Scalar, with int64
116 std::cout << "Datum kind: " << third_item.ToString()
117 << " content type: " << third_item.type()->ToString() << std::endl;
118 // We get a scalar -- the location of 2223 in column A, which is 2 in 0-based indexing.
119 std::cout << third_item.scalar_as<arrow::Int64Scalar>().value << std::endl;
120 // (Doc section: Index Inspection)
121 // (Doc section: Ret)
122 return arrow::Status::OK();
123}
124// (Doc section: Ret)
125
126// (Doc section: Main)
127int main() {
128 arrow::Status st = RunMain();
129 if (!st.ok()) {
130 std::cerr << st << std::endl;
131 return 1;
132 }
133 return 0;
134}
135// (Doc section: Main)