6 Manipulating Data - Arrays

6.1 Introduction

An Arrow Array is roughly equivalent to an R vector - it can be used to represent a single column of data, with all values having the same data type.

A number of base R functions which have S3 generic methods have been implemented to work on Arrow Arrays; for example mean, min, and max.

6.2 Filter by values matching a predicate or mask

You want to search for values in an Array that match a predicate condition.

6.2.1 Solution

my_values <- Array$create(c(1:5, NA))
my_values[my_values > 3]
## Array
## <int32>
## [
##   4,
##   5,
##   null
## ]

6.2.2 Discussion

You can refer to items in an Array using the square brackets [] like you can an R vector.

6.3 Compute Mean/Min/Max, etc value of an Array

You want to calculate the mean, minimum, or maximum of values in an array.

6.3.1 Solution

my_values <- Array$create(c(1:5, NA))
mean(my_values, na.rm = TRUE)
## Scalar
## 3

6.3.2 Discussion

Many base R generic functions such as mean(), min(), and max() have been mapped to their Arrow equivalents, and so can be called on Arrow Array objects in the same way. They will return Arrow objects themselves.

If you want to use an R function which does not have an Arrow mapping, you can use as.vector() to convert Arrow objects to base R vectors.

arrow_array <- Array$create(1:100)
# get Tukey's five-number summary
fivenum(as.vector(arrow_array))
## [1]   1.0  25.5  50.5  75.5 100.0

You can tell if a function is a standard S3 generic function by looking at the body of the function - S3 generic functions call UseMethod() to determine the appropriate version of that function to use for the object.

mean
## function (x, ...) 
## UseMethod("mean")
## <bytecode: 0x564a10424388>
## <environment: namespace:base>

You can also use isS3stdGeneric() to determine if a function is an S3 generic.

isS3stdGeneric("mean")
## mean 
## TRUE

If you find an S3 generic function which isn’t implemented for Arrow objects but you would like to be able to use, please open an issue on the project JIRA.

6.4 Count occurrences of elements in an Array

You want to count repeated values in an Array.

6.4.1 Solution

repeated_vals <- Array$create(c(1, 1, 2, 3, 3, 3, 3, 3))
value_counts(repeated_vals)
## StructArray
## <struct<values: double, counts: int64>>
## -- is_valid: all not null
## -- child 0 type: double
##   [
##     1,
##     2,
##     3
##   ]
## -- child 1 type: int64
##   [
##     2,
##     1,
##     5
##   ]

6.4.2 Discussion

Some functions in the Arrow R package do not have base R equivalents. In other cases, the base R equivalents are not generic functions so they cannot be called directly on Arrow Array objects.

For example, the value_counts() function in the Arrow R package is loosely equivalent to the base R function table(), which is not a generic function.

6.5 Apply arithmetic functions to Arrays.

You want to use the various arithmetic operators on Array objects.

6.5.1 Solution

num_array <- Array$create(1:10)
num_array + 10
## Array
## <double>
## [
##   11,
##   12,
##   13,
##   14,
##   15,
##   16,
##   17,
##   18,
##   19,
##   20
## ]

6.5.2 Discussion

You will get the same result if you pass in the value you’re adding as an Arrow object.

num_array + Scalar$create(10)
## Array
## <double>
## [
##   11,
##   12,
##   13,
##   14,
##   15,
##   16,
##   17,
##   18,
##   19,
##   20
## ]

6.6 Call Arrow compute functions directly on Arrays

You want to call an Arrow compute function directly on an Array.

6.6.1 Solution

first_100_numbers <- Array$create(1:100)

# Calculate the variance of 1 to 100, setting the delta degrees of freedom to 0.
call_function("variance", first_100_numbers, options = list(ddof = 0))
## Scalar
## 833.25

6.6.2 Discussion

You can use call_function() to call Arrow compute functions directly on Scalar, Array, and ChunkedArray objects. The returned object will be an Arrow object.

6.6.3 See also

For a more in-depth discussion of Arrow compute functions, see the section on using arrow functions in dplyr verbs in arrow