6 Manipulating Data - Arrays
6.1 Introduction
An Arrow Array is roughly equivalent to an R vector - it can be used to represent a single column of data, with all values having the same data type.
A number of base R functions which have S3 generic methods have been implemented
to work on Arrow Arrays; for example mean
, min
, and max
.
6.2 Filter by values matching a predicate or mask
You want to search for values in an Array that match a predicate condition.
6.3 Compute Mean/Min/Max, etc value of an Array
You want to calculate the mean, minimum, or maximum of values in an array.
6.3.2 Discussion
Many base R generic functions such as mean()
, min()
, and max()
have been
mapped to their Arrow equivalents, and so can be called on Arrow Array objects
in the same way. They will return Arrow objects themselves.
If you want to use an R function which does not have an Arrow mapping, you can
use as.vector()
to convert Arrow objects to base R vectors.
<- Array$create(1:100)
arrow_array # get Tukey's five-number summary
fivenum(as.vector(arrow_array))
## [1] 1.0 25.5 50.5 75.5 100.0
You can tell if a function is a standard S3 generic function by looking
at the body of the function - S3 generic functions call UseMethod()
to determine the appropriate version of that function to use for the object.
mean
## function (x, ...)
## UseMethod("mean")
## <bytecode: 0x564a10424388>
## <environment: namespace:base>
You can also use isS3stdGeneric()
to determine if a function is an S3 generic.
isS3stdGeneric("mean")
## mean
## TRUE
If you find an S3 generic function which isn’t implemented for Arrow objects but you would like to be able to use, please open an issue on the project JIRA.
6.4 Count occurrences of elements in an Array
You want to count repeated values in an Array.
6.4.1 Solution
<- Array$create(c(1, 1, 2, 3, 3, 3, 3, 3))
repeated_vals value_counts(repeated_vals)
## StructArray
## <struct<values: double, counts: int64>>
## -- is_valid: all not null
## -- child 0 type: double
## [
## 1,
## 2,
## 3
## ]
## -- child 1 type: int64
## [
## 2,
## 1,
## 5
## ]
6.4.2 Discussion
Some functions in the Arrow R package do not have base R equivalents. In other cases, the base R equivalents are not generic functions so they cannot be called directly on Arrow Array objects.
For example, the value_counts()
function in the Arrow R package is loosely
equivalent to the base R function table()
, which is not a generic function.
6.5 Apply arithmetic functions to Arrays.
You want to use the various arithmetic operators on Array objects.
6.6 Call Arrow compute functions directly on Arrays
You want to call an Arrow compute function directly on an Array.
6.6.1 Solution
<- Array$create(1:100)
first_100_numbers
# Calculate the variance of 1 to 100, setting the delta degrees of freedom to 0.
call_function("variance", first_100_numbers, options = list(ddof = 0))
## Scalar
## 833.25
6.6.2 Discussion
You can use call_function()
to call Arrow compute functions directly on
Scalar, Array, and ChunkedArray objects. The returned object will be an Arrow object.
6.6.3 See also
For a more in-depth discussion of Arrow compute functions, see the section on using arrow functions in dplyr verbs in arrow