The arrow package provides functionality allowing users to manipulate
tabular Arrow data (Table
and Dataset
objects)
with familiar dplyr syntax. To enable this functionality,
ensure that the arrow and dplyr packages are both loaded. In this
article we will take the starwars
data set included in
dplyr, convert it to an Arrow Table, and then analyze this data. Note
that, although these examples all use an in-memory Table
object, the same functionality works for an on-disk Dataset
object with only minor differences in behavior (documented later in the
article).
To get started let’s load the packages and create the data:
library(dplyr, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)
sw <- arrow_table(starwars, as_data_frame = FALSE)
One-table dplyr verbs
The arrow package provides support for the dplyr one-table verbs,
allowing users to construct data analysis pipelines in a familiar way.
The example below shows the use of filter()
,
rename()
, mutate()
, arrange()
and
select()
:
result <- sw %>%
filter(homeworld == "Tatooine") %>%
rename(height_cm = height, mass_kg = mass) %>%
mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
arrange(desc(birth_year)) %>%
select(name, height_in, mass_lbs)
It is important to note that arrow uses lazy evaluation to delay
computation until the result is explicitly requested. This speeds up
processing by enabling the Arrow C++ library to perform multiple
computations in one operation. As a consequence of this design choice,
we have not yet performed computations on the sw
data. The
result
variable is an object with class
arrow_dplyr_query
that represents all the computations to
be performed:
result
## Table (query)
## name: string
## height_in: double (divide(cast(height, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(2.54, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false})))
## mass_lbs: double (multiply_checked(mass, 2.2046))
##
## * Filter: (homeworld == "Tatooine")
## * Sorted by birth_year [desc]
## See $.data for the source Arrow object
To perform these computations and materialize the result, we call
compute()
or collect()
. The difference between
the two determines what kind of object will be returned. Calling
compute()
returns an Arrow Table, suitable for passing to
other arrow or dplyr functions:
compute(result)
## Table
## 10 rows x 3 columns
## $name <string>
## $height_in <double>
## $mass_lbs <double>
In contrast, collect()
returns an R data frame, suitable
for viewing or passing to other R functions for analysis or
visualization:
collect(result)
## # A tibble: 10 x 3
## name height_in mass_lbs
## <chr> <dbl> <dbl>
## 1 C-3PO 65.7 165.
## 2 Cliegg Lars 72.0 NA
## 3 Shmi Skywalker 64.2 NA
## 4 Owen Lars 70.1 265.
## 5 Beru Whitesun Lars 65.0 165.
## 6 Darth Vader 79.5 300.
## 7 Anakin Skywalker 74.0 185.
## 8 Biggs Darklighter 72.0 185.
## 9 Luke Skywalker 67.7 170.
## 10 R5-D4 38.2 70.5
The arrow package has broad support for single-table dplyr verbs,
including those that compute aggregates. For example, it supports
group_by()
and summarize()
, as well as
commonly-used convenience functions such as count()
:
## # A tibble: 38 x 2
## species mean_height
## <chr> <dbl>
## 1 Human 178
## 2 Droid 131.
## 3 Wookiee 231
## 4 Rodian 173
## 5 Hutt 175
## 6 NA 175
## 7 Yoda's species 66
## 8 Trandoshan 190
## 9 Mon Calamari 180
## 10 Ewok 88
## # i 28 more rows
## # A tibble: 3 x 2
## gender n
## <chr> <int>
## 1 masculine 66
## 2 feminine 17
## 3 NA 4
Note, however, that window functions such as ntile()
are
not yet supported.
Two-table dplyr verbs
Equality joins (e.g. left_join()
,
inner_join()
) are supported for joining multiple tables.
This is illustrated below:
jedi <- data.frame(
name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"),
jedi = c(FALSE, TRUE, TRUE)
)
sw %>%
select(1:3) %>%
right_join(jedi) %>%
collect()
## # A tibble: 3 x 4
## name height mass jedi
## <chr> <int> <dbl> <lgl>
## 1 Luke Skywalker 172 77 TRUE
## 2 C-3PO 167 75 FALSE
## 3 Obi-Wan Kenobi 182 77 TRUE
Expressions within dplyr verbs
Inside dplyr verbs, Arrow offers support for many functions and operators, with common functions mapped to their base R and tidyverse equivalents: you can find a list of supported functions within dplyr queries in the function documentation. If there are additional functions you would like to see implemented, please file an issue as described in the Getting help guidelines.
Registering custom bindings
The arrow package makes it possible for users to supply bindings for
custom functions in some situations using
register_scalar_function()
. To operate correctly, the
to-be-registered function must have context
as its first
argument, as required by the query engine. For example, suppose we
wanted to implement a function that converts a string to snake case (a
greatly simplified version of janitor::make_clean_names()
).
The function could be written as follows:
to_snake_name <- function(context, string) {
replace <- c(`'` = "", `"` = "", `-` = "", `\\.` = "_", ` ` = "_")
string %>%
stringr::str_replace_all(replace) %>%
stringr::str_to_lower() %>%
stringi::stri_trans_general(id = "Latin-ASCII")
}
To call this within an arrow/dplyr pipeline, it needs to be registered:
register_scalar_function(
name = "to_snake_name",
fun = to_snake_name,
in_type = utf8(),
out_type = utf8(),
auto_convert = TRUE
)
In this expression, the name
argument specifies the name
by which it will be recognized in the context of the arrow/dplyr
pipeline and fun
is the function itself. The
in_type
and out_type
arguments are used to
specify the expected data type for the input and output, and
auto_convert
specifies whether arrow should automatically
convert any R inputs to their Arrow equivalents.
Once registered, the following works:
## # A tibble: 87 x 2
## name snake_name
## <chr> <chr>
## 1 Luke Skywalker luke_skywalker
## 2 C-3PO c3po
## 3 R2-D2 r2d2
## 4 Darth Vader darth_vader
## 5 Leia Organa leia_organa
## 6 Owen Lars owen_lars
## 7 Beru Whitesun Lars beru_whitesun_lars
## 8 R5-D4 r5d4
## 9 Biggs Darklighter biggs_darklighter
## 10 Obi-Wan Kenobi obiwan_kenobi
## # i 77 more rows
To learn more, see
help("register_scalar_function", package = "arrow")
.
Handling unsupported expressions
For dplyr queries on Table objects, which are held in memory and
should usually be representable as data frames, if the arrow package
detects an unimplemented function within a dplyr verb, it automatically
calls collect()
to return the data as an R data frame
before processing that dplyr verb. As an example, neither
lm()
nor residuals()
are implemented, so if we
write code that computes the residuals for a linear regression model,
this automatic collection takes place:
sw %>%
filter(!is.na(height), !is.na(mass)) %>%
transmute(name, height, mass, res = residuals(lm(mass ~ height)))
## Warning: In residuals(lm(mass ~ height)):
## i Expression not supported in Arrow
## > Pulling data into R
## # A tibble: 59 x 4
## name height mass res
## <chr> <int> <dbl> <dbl>
## 1 Luke Skywalker 172 77 -18.8
## 2 C-3PO 167 75 -17.7
## 3 R2-D2 96 32 -16.4
## 4 Darth Vader 202 136 21.4
## 5 Leia Organa 150 49 -33.1
## 6 Owen Lars 178 120 20.4
## 7 Beru Whitesun Lars 165 75 -16.5
## 8 R5-D4 97 32 -17.0
## 9 Biggs Darklighter 183 84 -18.7
## 10 Obi-Wan Kenobi 182 77 -25.1
## # i 49 more rows
For queries on Dataset
objects – which can be larger
than memory – arrow is more conservative and always raises an error if
it detects an unsupported expression. To illustrate this behavior, we
can write the starwars
data to disk and then open it as a
Dataset. When we use the same pipeline on the Dataset, we obtain an
error:
# write and open starwars dataset
dataset_path <- tempfile()
write_dataset(starwars, dataset_path)
sw2 <- open_dataset(dataset_path)
# dplyr pipeline with unsupported expressions
sw2 %>%
filter(!is.na(height), !is.na(mass)) %>%
transmute(name, height, mass, res = residuals(lm(mass ~ height)))
## Error in `residuals()`:
## ! Expression not supported in Arrow
## > Call collect() first to pull data into R.
Calling collect()
in the middle of the pipeline fixes
the issue:
sw2 %>%
filter(!is.na(height), !is.na(mass)) %>%
collect() %>%
transmute(name, height, mass, res = residuals(lm(mass ~ height)))
## # A tibble: 59 x 4
## name height mass res
## <chr> <int> <dbl> <dbl>
## 1 Luke Skywalker 172 77 -18.8
## 2 C-3PO 167 75 -17.7
## 3 R2-D2 96 32 -16.4
## 4 Darth Vader 202 136 21.4
## 5 Leia Organa 150 49 -33.1
## 6 Owen Lars 178 120 20.4
## 7 Beru Whitesun Lars 165 75 -16.5
## 8 R5-D4 97 32 -17.0
## 9 Biggs Darklighter 183 84 -18.7
## 10 Obi-Wan Kenobi 182 77 -25.1
## # i 49 more rows
For some operations, you can use DuckDB. It supports Arrow natively, so
you can pass the Dataset
or query object to DuckDB without
paying a performance penalty using the helper function
to_duckdb()
and pass the object back to Arrow with
to_arrow()
:
sw %>%
select(1:4) %>%
filter(!is.na(hair_color)) %>%
to_duckdb() %>%
group_by(hair_color) %>%
filter(height < mean(height, na.rm = TRUE)) %>%
to_arrow() %>%
# perform other arrow operations...
collect()
## # A tibble: 28 x 4
## name height mass hair_color
## <chr> <int> <dbl> <chr>
## 1 Yoda 66 17 white
## 2 Watto 137 NA black
## 3 Shmi Skywalker 163 NA black
## 4 Eeth Koth 171 NA black
## 5 Luminara Unduli 170 56.2 black
## 6 Barriss Offee 166 50 black
## 7 Luke Skywalker 172 77 blond
## 8 Finis Valorum 170 NA blond
## 9 R4-P17 96 NA none
## 10 Lobot 175 79 none
## # i 18 more rows
Further reading
- To learn more about multi-file datasets, see the dataset article.
- To learn more about user-registered functions, see
help("register_scalar_function", package = "arrow")
. - To learn more about writing dplyr bindings as an arrow developer, see the article on writing bindings.