7 Manipulating Data - Tables
7.1 Introduction
One of the aims of the Arrow project is to reduce duplication between different data frame implementations. The underlying implementation of a data frame is a conceptually different thing to the code- or the application programming interface (API)-that you write to work with it.
You may have seen this before in packages like dbplyr which allow you to use the dplyr API to interact with SQL databases.
The Arrow R package has been written so that the underlying Arrow Table-like objects can be manipulated using the dplyr API, which allows you to use dplyr verbs.
For example, here’s a short pipeline of data manipulation which uses dplyr exclusively:
library(dplyr)
%>%
starwars filter(species == "Human") %>%
mutate(height_ft = height/30.48) %>%
select(name, height_ft)
## # A tibble: 35 × 2
## name height_ft
## <chr> <dbl>
## 1 Luke Skywalker 5.64
## 2 Darth Vader 6.63
## 3 Leia Organa 4.92
## 4 Owen Lars 5.84
## 5 Beru Whitesun lars 5.41
## 6 Biggs Darklighter 6.00
## 7 Obi-Wan Kenobi 5.97
## 8 Anakin Skywalker 6.17
## 9 Wilhuff Tarkin 5.91
## 10 Han Solo 5.91
## # ℹ 25 more rows
And the same results as using Arrow with dplyr syntax:
arrow_table(starwars) %>%
filter(species == "Human") %>%
mutate(height_ft = height/30.48) %>%
select(name, height_ft) %>%
collect()
## # A tibble: 35 × 2
## name height_ft
## <chr> <dbl>
## 1 Luke Skywalker 5.64
## 2 Darth Vader 6.63
## 3 Leia Organa 4.92
## 4 Owen Lars 5.84
## 5 Beru Whitesun lars 5.41
## 6 Biggs Darklighter 6.00
## 7 Obi-Wan Kenobi 5.97
## 8 Anakin Skywalker 6.17
## 9 Wilhuff Tarkin 5.91
## 10 Han Solo 5.91
## # ℹ 25 more rows
You’ll notice we’ve used collect()
in the Arrow pipeline above. That’s because
one of the ways in which Arrow is efficient is that it works out the instructions
for the calculations it needs to perform (expressions) and only runs them
using Arrow once you actually pull the data into your R session. This means
instead of doing lots of separate operations, it does them all at once in a
more optimised way. This is called lazy evaluation.
It also means that you are able to manipulate data that is larger than you can fit into memory on the machine you’re running your code on, if you only pull data into R when you have selected the desired subset, or when using functions which can operate on chunks of data.
You can also have data which is split across multiple files. For example, you
might have files which are stored in multiple Parquet or Feather files,
partitioned across different directories. You can open partitioned or multi-file datasets
using open_dataset()
as discussed in a previous chapter, and then manipulate
this data using Arrow before even reading any of the data into R.
7.2 Use dplyr verbs in Arrow
You want to use a dplyr verb in Arrow.
7.2.1 Solution
library(dplyr)
arrow_table(starwars) %>%
filter(species == "Human", homeworld == "Tatooine") %>%
collect()
## # A tibble: 8 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
## 2 Darth Va… 202 136 none white yellow 41.9 male mascu…
## 3 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
## 4 Beru Whi… 165 75 brown light blue 47 fema… femin…
## 5 Biggs Da… 183 84 black light brown 24 male mascu…
## 6 Anakin S… 188 84 blond fair blue 41.9 male mascu…
## 7 Shmi Sky… 163 NA black fair brown 72 fema… femin…
## 8 Cliegg L… 183 NA brown fair blue 82 male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list<character>>,
## # vehicles <list<character>>, starships <list<character>>
7.2.3 See also
You can find examples of the various dplyr verbs in “Introduction to dplyr” -
run vignette("dplyr", package = "dplyr")
or view on
the pkgdown site.
You can see more information about using arrow_table()
to create Arrow Tables
and collect()
to view them as R data frames in Creating Arrow Objects.
7.3 Use R functions in dplyr verbs in Arrow
You want to use an R function inside a dplyr verb in Arrow.
7.3.1 Solution
arrow_table(starwars) %>%
filter(str_detect(name, "Darth")) %>%
collect()
## # A tibble: 2 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Darth Va… 202 136 none white yellow 41.9 male mascu…
## 2 Darth Ma… 175 80 none red yellow 54 male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list<character>>,
## # vehicles <list<character>>, starships <list<character>>
7.3.2 Discussion
The Arrow R package allows you to use dplyr verbs containing expressions which include base R and many tidyverse functions, but call Arrow functions under the hood. If you find any base R or tidyverse functions which you would like to see a mapping of in Arrow, please open an issue on the project JIRA.
The following packages (amongst some from others) have had many function bindings/mappings written in arrow:
If you try to call a function which does not have arrow mapping, the data will be pulled back into R, and you will see a warning message.
library(stringr)
arrow_table(starwars) %>%
mutate(name_split = str_split_fixed(name, " ", 2)) %>%
collect()
## Warning: Expression str_split_fixed(name, " ", 2) not supported in Arrow;
## pulling data into R
## # A tibble: 87 × 15
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 6 more variables: homeworld <chr>, species <chr>, films <list<character>>,
## # vehicles <list<character>>, starships <list<character>>,
## # name_split <chr[,2]>
7.4 Use Arrow functions in dplyr verbs in Arrow
You want to use a function which is implemented in Arrow’s C++ library but either:
- it doesn’t have a mapping to a base R or tidyverse equivalent, or
- it has a mapping but nevertheless you want to call the C++ function directly
7.4.1 Solution
arrow_table(starwars) %>%
select(name) %>%
mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, padding = "*"))) %>%
collect()
## # A tibble: 87 × 2
## name padded_name
## <chr> <chr>
## 1 Luke Skywalker Luke Skywalker
## 2 C-3PO *****C-3PO
## 3 R2-D2 *****R2-D2
## 4 Darth Vader Darth Vader
## 5 Leia Organa Leia Organa
## 6 Owen Lars *Owen Lars
## 7 Beru Whitesun lars Beru Whitesun lars
## 8 R5-D4 *****R5-D4
## 9 Biggs Darklighter Biggs Darklighter
## 10 Obi-Wan Kenobi Obi-Wan Kenobi
## # ℹ 77 more rows
7.4.2 Discussion
The vast majority of Arrow C++ compute functions have been mapped to their base R or tidyverse equivalents, and we strongly recommend that you use these mappings where possible, as the original functions are well documented and the mapped versions have been tested to ensure the results returned are as expected.
However, there may be circumstances in which you might want to use a compute function from the Arrow C++ library which does not have a base R or tidyverse equivalent.
You can find documentation of Arrow C++ compute functions in the C++ documention. This documentation lists all available compute functions, any associated options classes they need, and the valid data types that they can be used with.
You can list all available Arrow compute functions from R by calling
list_compute_functions()
.
list_compute_functions()
## [1] "abs" "abs_checked"
## [3] "acos" "acos_checked"
## [5] "add" "add_checked"
## [7] "all" "and"
## [9] "and_kleene" "and_not"
## [11] "and_not_kleene" "any"
## [13] "approximate_median" "array_filter"
## [15] "array_sort_indices" "array_take"
## [17] "ascii_capitalize" "ascii_center"
## [19] "ascii_is_alnum" "ascii_is_alpha"
## [21] "ascii_is_decimal" "ascii_is_lower"
## [23] "ascii_is_printable" "ascii_is_space"
## [25] "ascii_is_title" "ascii_is_upper"
## [27] "ascii_lower" "ascii_lpad"
## [29] "ascii_ltrim" "ascii_ltrim_whitespace"
## [31] "ascii_reverse" "ascii_rpad"
## [33] "ascii_rtrim" "ascii_rtrim_whitespace"
## [35] "ascii_split_whitespace" "ascii_swapcase"
## [37] "ascii_title" "ascii_trim"
## [39] "ascii_trim_whitespace" "ascii_upper"
## [41] "asin" "asin_checked"
## [43] "assume_timezone" "atan"
## [45] "atan2" "binary_join"
## [47] "binary_join_element_wise" "binary_length"
## [49] "binary_repeat" "binary_replace_slice"
## [51] "binary_reverse" "binary_slice"
## [53] "bit_wise_and" "bit_wise_not"
## [55] "bit_wise_or" "bit_wise_xor"
## [57] "case_when" "cast"
## [59] "ceil" "ceil_temporal"
## [61] "choose" "coalesce"
## [63] "cos" "cos_checked"
## [65] "count" "count_all"
## [67] "count_distinct" "count_substring"
## [69] "count_substring_regex" "cumulative_max"
## [71] "cumulative_min" "cumulative_prod"
## [73] "cumulative_prod_checked" "cumulative_sum"
## [75] "cumulative_sum_checked" "day"
## [77] "day_of_week" "day_of_year"
## [79] "day_time_interval_between" "days_between"
## [81] "dictionary_encode" "divide"
## [83] "divide_checked" "drop_null"
## [85] "ends_with" "equal"
## [87] "exp" "extract_regex"
## [89] "fill_null_backward" "fill_null_forward"
## [91] "filter" "find_substring"
## [93] "find_substring_regex" "first"
## [95] "first_last" "floor"
## [97] "floor_temporal" "greater"
## [99] "greater_equal" "hour"
## [101] "hours_between" "if_else"
## [103] "index" "index_in"
## [105] "index_in_meta_binary" "indices_nonzero"
## [107] "invert" "is_dst"
## [109] "is_finite" "is_in"
## [111] "is_in_meta_binary" "is_inf"
## [113] "is_leap_year" "is_nan"
## [115] "is_null" "is_valid"
## [117] "iso_calendar" "iso_week"
## [119] "iso_year" "last"
## [121] "less" "less_equal"
## [123] "list_element" "list_flatten"
## [125] "list_parent_indices" "list_slice"
## [127] "list_value_length" "ln"
## [129] "ln_checked" "local_timestamp"
## [131] "log10" "log10_checked"
## [133] "log1p" "log1p_checked"
## [135] "log2" "log2_checked"
## [137] "logb" "logb_checked"
## [139] "make_struct" "map_lookup"
## [141] "match_like" "match_substring"
## [143] "match_substring_regex" "max"
## [145] "max_element_wise" "mean"
## [147] "microsecond" "microseconds_between"
## [149] "millisecond" "milliseconds_between"
## [151] "min" "min_element_wise"
## [153] "min_max" "minute"
## [155] "minutes_between" "mode"
## [157] "month" "month_day_nano_interval_between"
## [159] "month_interval_between" "multiply"
## [161] "multiply_checked" "nanosecond"
## [163] "nanoseconds_between" "negate"
## [165] "negate_checked" "not_equal"
## [167] "or" "or_kleene"
## [169] "pairwise_diff" "pairwise_diff_checked"
## [171] "partition_nth_indices" "power"
## [173] "power_checked" "product"
## [175] "quantile" "quarter"
## [177] "quarters_between" "random"
## [179] "rank" "replace_substring"
## [181] "replace_substring_regex" "replace_with_mask"
## [183] "round" "round_binary"
## [185] "round_temporal" "round_to_multiple"
## [187] "run_end_decode" "run_end_encode"
## [189] "second" "seconds_between"
## [191] "select_k_unstable" "shift_left"
## [193] "shift_left_checked" "shift_right"
## [195] "shift_right_checked" "sign"
## [197] "sin" "sin_checked"
## [199] "sort_indices" "split_pattern"
## [201] "split_pattern_regex" "sqrt"
## [203] "sqrt_checked" "starts_with"
## [205] "stddev" "strftime"
## [207] "string_is_ascii" "strptime"
## [209] "struct_field" "subsecond"
## [211] "subtract" "subtract_checked"
## [213] "sum" "take"
## [215] "tan" "tan_checked"
## [217] "tdigest" "true_unless_null"
## [219] "trunc" "unique"
## [221] "us_week" "us_year"
## [223] "utf8_capitalize" "utf8_center"
## [225] "utf8_is_alnum" "utf8_is_alpha"
## [227] "utf8_is_decimal" "utf8_is_digit"
## [229] "utf8_is_lower" "utf8_is_numeric"
## [231] "utf8_is_printable" "utf8_is_space"
## [233] "utf8_is_title" "utf8_is_upper"
## [235] "utf8_length" "utf8_lower"
## [237] "utf8_lpad" "utf8_ltrim"
## [239] "utf8_ltrim_whitespace" "utf8_normalize"
## [241] "utf8_replace_slice" "utf8_reverse"
## [243] "utf8_rpad" "utf8_rtrim"
## [245] "utf8_rtrim_whitespace" "utf8_slice_codeunits"
## [247] "utf8_split_whitespace" "utf8_swapcase"
## [249] "utf8_title" "utf8_trim"
## [251] "utf8_trim_whitespace" "utf8_upper"
## [253] "value_counts" "variance"
## [255] "week" "weeks_between"
## [257] "xor" "year"
## [259] "year_month_day" "years_between"
The majority of functions here have been mapped to their base R or tidyverse equivalent and can be called within a dplyr query as usual. For functions which don’t have a base R or tidyverse equivalent, or you want to supply custom options, you can call them by prefixing their name with “arrow_”.
For example, base R’s is.na()
function is the equivalent of the Arrow C++
compute function is_null()
with the option nan_is_null
set to TRUE
.
A mapping between these functions (with nan_is_null
set to TRUE
) has been
created in arrow.
<- data.frame(x = c(1, 2, 3, NA, NaN))
demo_df
arrow_table(demo_df) %>%
mutate(y = is.na(x)) %>%
collect()
## # A tibble: 5 × 2
## x y
## <dbl> <lgl>
## 1 1 FALSE
## 2 2 FALSE
## 3 3 FALSE
## 4 NA TRUE
## 5 NaN TRUE
If you want to call Arrow’s is_null()
function but with nan_is_null
set to
FALSE
(so it returns TRUE
when a value being examined is NA
but FALSE
when the value being examined is NaN
), you must call is_null()
directly and
specify the option nan_is_null = FALSE
.
arrow_table(demo_df) %>%
mutate(y = arrow_is_null(x, options = list(nan_is_null = FALSE))) %>%
collect()
## # A tibble: 5 × 2
## x y
## <dbl> <lgl>
## 1 1 FALSE
## 2 2 FALSE
## 3 3 FALSE
## 4 NA TRUE
## 5 NaN FALSE
7.4.2.1 Compute functions with options
Although not all Arrow C++ compute functions require options to be specified, most do. For these functions to work in R, they must be linked up with the appropriate libarrow options C++ class via the R package’s C++ code. At the time of writing, all compute functions available in the development version of the Arrow R package had been associated with their options classes. However, as the Arrow C++ library’s functionality extends, compute functions may be added which do not yet have an R binding. If you find a C++ compute function which you wish to use from the R package, please open an issue on the Github project.
7.5 Compute Window Aggregates
You want to apply an aggregation (e.g. mean()
) on a grouped table or within a rowwise operation like filter()
:
7.5.1 Solution
arrow_table(starwars) %>%
select(1:4) %>%
filter(!is.na(hair_color)) %>%
left_join(
arrow_table(starwars) %>%
group_by(hair_color) %>%
summarize(mean_height = mean(height, na.rm = TRUE))
%>%
) filter(height < mean_height) %>%
select(!mean_height) %>%
collect()
## # A tibble: 29 × 4
## name height mass hair_color
## <chr> <int> <dbl> <chr>
## 1 Luke Skywalker 172 77 blond
## 2 Leia Organa 150 49 brown
## 3 Beru Whitesun lars 165 75 brown
## 4 Wedge Antilles 170 77 brown
## 5 Yoda 66 17 white
## 6 Lobot 175 79 none
## 7 Ackbar 180 83 none
## 8 Wicket Systri Warrick 88 20 brown
## 9 Nien Nunb 160 68 none
## 10 Finis Valorum 170 NA blond
## # ℹ 19 more rows
Or using to_duckdb()
arrow_table(starwars) %>%
select(1:4) %>%
filter(!is.na(hair_color)) %>%
to_duckdb() %>%
group_by(hair_color) %>%
filter(height < mean(height, na.rm = TRUE)) %>%
to_arrow() %>%
collect()
## # A tibble: 29 × 4
## name height mass hair_color
## <chr> <int> <dbl> <chr>
## 1 Luke Skywalker 172 77 blond
## 2 Finis Valorum 170 NA blond
## 3 Yoda 66 17 white
## 4 Leia Organa 150 49 brown
## 5 Beru Whitesun lars 165 75 brown
## 6 Wedge Antilles 170 77 brown
## 7 Wicket Systri Warrick 88 20 brown
## 8 Cordé 157 NA brown
## 9 Dormé 165 NA brown
## 10 Padmé Amidala 165 45 brown
## # ℹ 19 more rows
7.5.2 Discusson
Arrow does not support window functions, and pulls the data into R. For large tables, this sacrifices performance.
arrow_table(starwars) %>%
select(1:4) %>%
filter(!is.na(hair_color)) %>%
group_by(hair_color) %>%
filter(height < mean(height, na.rm = TRUE))
## Warning: Expression height < mean(height, na.rm = TRUE) not supported in Arrow;
## pulling data into R
## # A tibble: 29 × 4
## # Groups: hair_color [5]
## name height mass hair_color
## <chr> <int> <dbl> <chr>
## 1 Luke Skywalker 172 77 blond
## 2 Leia Organa 150 49 brown
## 3 Beru Whitesun lars 165 75 brown
## 4 Wedge Antilles 170 77 brown
## 5 Yoda 66 17 white
## 6 Lobot 175 79 none
## 7 Ackbar 180 83 none
## 8 Wicket Systri Warrick 88 20 brown
## 9 Nien Nunb 160 68 none
## 10 Finis Valorum 170 NA blond
## # ℹ 19 more rows
You can perform these window aggregate operations on Arrow tables by:
- Computing the aggregation separately, and joining the result
- Passing the data to DuckDB, and use the DuckDB query engine to perform the operations
Arrow supports zero-copy integration with DuckDB, and DuckDB can query Arrow datasets directly and stream query results back to Arrow. This integreation uses zero-copy streaming of data between DuckDB and Arrow and vice versa so that you can compose a query using both together, all the while not paying any cost to (re)serialize the data when you pass it back and forth. This is especially useful in cases where something is supported in one of Arrow or DuckDB query engines but not the other. You can find more information about this integration on the Arrow blog post.