The arrow package provides functionality allowing users to manipulate tabular Arrow data (Table
and Dataset
objects) with familiar dplyr syntax. To enable this functionality, ensure that the arrow and dplyr packages are both loaded. In this article we will take the starwars
data set included in dplyr, convert it to an Arrow Table, and then analyze this data. Note that, although these examples all use an in-memory Table
object, the same functionality works for an on-disk Dataset
object with only minor differences in behavior (documented later in the article).
To get started let’s load the packages and create the data:
library(dplyr, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)
sw <- arrow_table(starwars, as_data_frame = FALSE)
One-table dplyr verbs
The arrow package provides support for the dplyr one-table verbs, allowing users to construct data analysis pipelines in a familiar way. The example below shows the use of filter()
, rename()
, mutate()
, arrange()
and select()
:
result <- sw %>%
filter(homeworld == "Tatooine") %>%
rename(height_cm = height, mass_kg = mass) %>%
mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
arrange(desc(birth_year)) %>%
select(name, height_in, mass_lbs)
It is important to note that arrow uses lazy evaluation to delay computation until the result is explicitly requested. This speeds up processing by enabling the Arrow C++ library to perform multiple computations in one operation. As a consequence of this design choice, we have not yet performed computations on the sw
data. The result
variable is an object with class arrow_dplyr_query
that represents all the computations to be performed:
result
## Table (query)
## name: string
## height_in: double (divide(cast(height, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(2.54, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false})))
## mass_lbs: double (multiply_checked(mass, 2.2046))
##
## * Filter: (homeworld == "Tatooine")
## * Sorted by birth_year [desc]
## See $.data for the source Arrow object
To perform these computations and materialize the result, we call compute()
or collect()
. The difference between the two determines what kind of object will be returned. Calling compute()
returns an Arrow Table, suitable for passing to other arrow or dplyr functions:
compute(result)
## Table
## 10 rows x 3 columns
## $name <string>
## $height_in <double>
## $mass_lbs <double>
In contrast, collect()
returns an R data frame, suitable for viewing or passing to other R functions for analysis or visualization:
collect(result)
## # A tibble: 10 x 3
## name height_in mass_lbs
## <chr> <dbl> <dbl>
## 1 C-3PO 65.7 165.
## 2 Cliegg Lars 72.0 NA
## 3 Shmi Skywalker 64.2 NA
## 4 Owen Lars 70.1 265.
## 5 Beru Whitesun Lars 65.0 165.
## 6 Darth Vader 79.5 300.
## 7 Anakin Skywalker 74.0 185.
## 8 Biggs Darklighter 72.0 185.
## 9 Luke Skywalker 67.7 170.
## 10 R5-D4 38.2 70.5
The arrow package has broad support for single-table dplyr verbs, including those that compute aggregates. For example, it supports group_by()
and summarize()
, as well as commonly-used convenience functions such as count()
:
## # A tibble: 38 x 2
## species mean_height
## <chr> <dbl>
## 1 Human 178
## 2 Droid 131.
## 3 Wookiee 231
## 4 Rodian 173
## 5 Hutt 175
## 6 NA 175
## 7 Yoda's species 66
## 8 Trandoshan 190
## 9 Mon Calamari 180
## 10 Ewok 88
## # i 28 more rows
## # A tibble: 3 x 2
## gender n
## <chr> <int>
## 1 masculine 66
## 2 feminine 17
## 3 NA 4
Note, however, that window functions such as ntile()
are not yet supported.
Two-table dplyr verbs
Equality joins (e.g. left_join()
, inner_join()
) are supported for joining multiple tables. This is illustrated below:
jedi <- data.frame(
name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"),
jedi = c(FALSE, TRUE, TRUE)
)
sw %>%
select(1:3) %>%
right_join(jedi) %>%
collect()
## # A tibble: 3 x 4
## name height mass jedi
## <chr> <int> <dbl> <lgl>
## 1 Luke Skywalker 172 77 TRUE
## 2 C-3PO 167 75 FALSE
## 3 Obi-Wan Kenobi 182 77 TRUE
Expressions within dplyr verbs
Inside dplyr verbs, Arrow offers support for many functions and operators, with common functions mapped to their base R and tidyverse equivalents: you can find a list of supported functions within dplyr queries in the function documentation. If there are additional functions you would like to see implemented, please file an issue as described in the Getting help guidelines.
Registering custom bindings
The arrow package makes it possible for users to supply bindings for custom functions in some situations using register_scalar_function()
. To operate correctly, the to-be-registered function must have context
as its first argument, as required by the query engine. For example, suppose we wanted to implement a function that converts a string to snake case (a greatly simplified version of janitor::make_clean_names()
). The function could be written as follows:
to_snake_name <- function(context, string) {
replace <- c(`'` = "", `"` = "", `-` = "", `\\.` = "_", ` ` = "_")
string %>%
stringr::str_replace_all(replace) %>%
stringr::str_to_lower() %>%
stringi::stri_trans_general(id = "Latin-ASCII")
}
To call this within an arrow/dplyr pipeline, it needs to be registered:
register_scalar_function(
name = "to_snake_name",
fun = to_snake_name,
in_type = utf8(),
out_type = utf8(),
auto_convert = TRUE
)
In this expression, the name
argument specifies the name by which it will be recognized in the context of the arrow/dplyr pipeline and fun
is the function itself. The in_type
and out_type
arguments are used to specify the expected data type for the input and output, and auto_convert
specifies whether arrow should automatically convert any R inputs to their Arrow equivalents.
Once registered, the following works:
## # A tibble: 87 x 2
## name snake_name
## <chr> <chr>
## 1 Luke Skywalker luke_skywalker
## 2 C-3PO c3po
## 3 R2-D2 r2d2
## 4 Darth Vader darth_vader
## 5 Leia Organa leia_organa
## 6 Owen Lars owen_lars
## 7 Beru Whitesun Lars beru_whitesun_lars
## 8 R5-D4 r5d4
## 9 Biggs Darklighter biggs_darklighter
## 10 Obi-Wan Kenobi obiwan_kenobi
## # i 77 more rows
To learn more, see help("register_scalar_function", package = "arrow")
.
Handling unsupported expressions
For dplyr queries on Table objects, which are held in memory and should usually be representable as data frames, if the arrow package detects an unimplemented function within a dplyr verb, it automatically calls collect()
to return the data as an R data frame before processing that dplyr verb. As an example, neither lm()
nor residuals()
are implemented, so if we write code that computes the residuals for a linear regression model, this automatic collection takes place:
sw %>%
filter(!is.na(height), !is.na(mass)) %>%
transmute(name, height, mass, res = residuals(lm(mass ~ height)))
## Warning: Expression residuals(lm(mass ~ height)) not supported in Arrow;
## pulling data into R
## # A tibble: 59 x 4
## name height mass res
## <chr> <int> <dbl> <dbl>
## 1 Luke Skywalker 172 77 -18.8
## 2 C-3PO 167 75 -17.7
## 3 R2-D2 96 32 -16.4
## 4 Darth Vader 202 136 21.4
## 5 Leia Organa 150 49 -33.1
## 6 Owen Lars 178 120 20.4
## 7 Beru Whitesun Lars 165 75 -16.5
## 8 R5-D4 97 32 -17.0
## 9 Biggs Darklighter 183 84 -18.7
## 10 Obi-Wan Kenobi 182 77 -25.1
## # i 49 more rows
For queries on Dataset
objects – which can be larger than memory – arrow is more conservative and always raises an error if it detects an unsupported expression. To illustrate this behavior, we can write the starwars
data to disk and then open it as a Dataset. When we use the same pipeline on the Dataset, we obtain an error:
# write and open starwars dataset
dataset_path <- tempfile()
write_dataset(starwars, dataset_path)
sw2 <- open_dataset(dataset_path)
# dplyr pipeline with unsupported expressions
sw2 %>%
filter(!is.na(height), !is.na(mass)) %>%
transmute(name, height, mass, res = residuals(lm(mass ~ height)))
## Error: Expression residuals(lm(mass ~ height)) not supported in Arrow
## Call collect() first to pull data into R.
Calling collect()
in the middle of the pipeline fixes the issue:
sw2 %>%
filter(!is.na(height), !is.na(mass)) %>%
collect() %>%
transmute(name, height, mass, res = residuals(lm(mass ~ height)))
## # A tibble: 59 x 4
## name height mass res
## <chr> <int> <dbl> <dbl>
## 1 Luke Skywalker 172 77 -18.8
## 2 C-3PO 167 75 -17.7
## 3 R2-D2 96 32 -16.4
## 4 Darth Vader 202 136 21.4
## 5 Leia Organa 150 49 -33.1
## 6 Owen Lars 178 120 20.4
## 7 Beru Whitesun Lars 165 75 -16.5
## 8 R5-D4 97 32 -17.0
## 9 Biggs Darklighter 183 84 -18.7
## 10 Obi-Wan Kenobi 182 77 -25.1
## # i 49 more rows
Because window functions are not supported, computing an aggregation like mean()
on a grouped table or within a rowwise opertation like filter()
is not supported:
sw %>%
select(1:4) %>%
filter(!is.na(hair_color)) %>%
group_by(hair_color) %>%
filter(height < mean(height, na.rm = TRUE))
## Warning: Expression height < mean(height, na.rm = TRUE) not supported in Arrow;
## pulling data into R
## # A tibble: 28 x 4
## # Groups: hair_color [5]
## name height mass hair_color
## <chr> <int> <dbl> <chr>
## 1 Luke Skywalker 172 77 blond
## 2 Leia Organa 150 49 brown
## 3 Beru Whitesun Lars 165 75 brown
## 4 Wedge Antilles 170 77 brown
## 5 Yoda 66 17 white
## 6 Lobot 175 79 none
## 7 Ackbar 180 83 none
## 8 Wicket Systri Warrick 88 20 brown
## 9 Nien Nunb 160 68 none
## 10 Finis Valorum 170 NA blond
## # i 18 more rows
This operation is sometimes referred to as a windowed aggregate and can be accomplished in Arrow by computing the aggregation separately, for example within a join operation:
sw %>%
select(1:4) %>%
filter(!is.na(hair_color)) %>%
left_join(
sw %>%
group_by(hair_color) %>%
summarize(mean_height = mean(height, na.rm = TRUE))
) %>%
filter(height < mean_height) %>%
select(!mean_height) %>%
collect()
## # A tibble: 28 x 4
## name height mass hair_color
## <chr> <int> <dbl> <chr>
## 1 Luke Skywalker 172 77 blond
## 2 Leia Organa 150 49 brown
## 3 Beru Whitesun Lars 165 75 brown
## 4 Wedge Antilles 170 77 brown
## 5 Yoda 66 17 white
## 6 Lobot 175 79 none
## 7 Ackbar 180 83 none
## 8 Wicket Systri Warrick 88 20 brown
## 9 Nien Nunb 160 68 none
## 10 Finis Valorum 170 NA blond
## # i 18 more rows
Alternatively, DuckDB supports Arrow natively, so you can pass the Table
object to DuckDB without paying a performance penalty using the helper function to_duckdb()
and pass the object back to Arrow with to_arrow()
:
sw %>%
select(1:4) %>%
filter(!is.na(hair_color)) %>%
to_duckdb() %>%
group_by(hair_color) %>%
filter(height < mean(height, na.rm = TRUE)) %>%
to_arrow() %>%
# perform other arrow operations...
collect()
## # A tibble: 28 x 4
## name height mass hair_color
## <chr> <int> <dbl> <chr>
## 1 "Luke Skywalker" 172 77 blond
## 2 "Finis Valorum" 170 NA blond
## 3 "Yoda" 66 17 white
## 4 "Leia Organa" 150 49 brown
## 5 "Beru Whitesun Lars" 165 75 brown
## 6 "Wedge Antilles" 170 77 brown
## 7 "Wicket Systri Warrick" 88 20 brown
## 8 "Cord\u00e9" 157 NA brown
## 9 "Dorm\u00e9" 165 NA brown
## 10 "Watto" 137 NA black
## # i 18 more rows
Further reading
- To learn more about multi-file datasets, see the dataset article.
- To learn more about user-registered functions, see
help("register_scalar_function", package = "arrow")
. - To learn more about writing dplyr bindings as an arrow developer, see the article on writing bindings.