This function allows you to write a dataset. By writing to more efficient binary storage formats, and by specifying relevant partitioning, you can make it much faster to read and query.
write_dataset( dataset, path, format = c("parquet", "feather", "arrow", "ipc", "csv"), partitioning = dplyr::group_vars(dataset), basename_template = paste0("part-{i}.", as.character(format)), hive_style = TRUE, existing_data_behavior = c("overwrite", "error", "delete_matching"), ... )
dataset | Dataset, RecordBatch, Table, |
---|---|
path | string path, URI, or |
format | a string identifier of the file format. Default is to use "parquet" (see FileFormat) |
partitioning |
|
basename_template | string template for the names of files to be written.
Must contain |
hive_style | logical: write partition segments as Hive-style
( |
existing_data_behavior | The behavior to use when there is already data in the destination directory. Must be one of overwrite, error, or delete_matching. When this is set to "overwrite" (the default) then any new files created will overwrite existing files. When this is set to "error" then the operation will fail if the destination directory is not empty. When this is set to "delete_matching" then the writer will delete any existing partitions if data is going to be written to those partitions and will leave alone partitions which data is not written to. |
... | additional format-specific arguments. For available Parquet
options, see
|
The input dataset
, invisibly
# You can write datasets partitioned by the values in a column (here: "cyl"). # This creates a structure of the form cyl=X/part-Z.parquet. one_level_tree <- tempfile() write_dataset(mtcars, one_level_tree, partitioning = "cyl") list.files(one_level_tree, recursive = TRUE) #> [1] "cyl=4/part-0.parquet" "cyl=6/part-0.parquet" "cyl=8/part-0.parquet" # You can also partition by the values in multiple columns # (here: "cyl" and "gear"). # This creates a structure of the form cyl=X/gear=Y/part-Z.parquet. two_levels_tree <- tempfile() write_dataset(mtcars, two_levels_tree, partitioning = c("cyl", "gear")) list.files(two_levels_tree, recursive = TRUE) #> [1] "cyl=4/gear=3/part-0.parquet" "cyl=4/gear=4/part-0.parquet" #> [3] "cyl=4/gear=5/part-0.parquet" "cyl=6/gear=3/part-0.parquet" #> [5] "cyl=6/gear=4/part-0.parquet" "cyl=6/gear=5/part-0.parquet" #> [7] "cyl=8/gear=3/part-0.parquet" "cyl=8/gear=5/part-0.parquet" # In the two previous examples we would have: # X = {4,6,8}, the number of cylinders. # Y = {3,4,5}, the number of forward gears. # Z = {0,1,2}, the number of saved parts, starting from 0. # You can obtain the same result as as the previous examples using arrow with # a dplyr pipeline. This will be the same as two_levels_tree above, but the # output directory will be different. library(dplyr) two_levels_tree_2 <- tempfile() mtcars %>% group_by(cyl, gear) %>% write_dataset(two_levels_tree_2) list.files(two_levels_tree_2, recursive = TRUE) #> [1] "cyl=4/gear=3/part-0.parquet" "cyl=4/gear=4/part-0.parquet" #> [3] "cyl=4/gear=5/part-0.parquet" "cyl=6/gear=3/part-0.parquet" #> [5] "cyl=6/gear=4/part-0.parquet" "cyl=6/gear=5/part-0.parquet" #> [7] "cyl=8/gear=3/part-0.parquet" "cyl=8/gear=5/part-0.parquet" # And you can also turn off the Hive-style directory naming where the column # name is included with the values by using `hive_style = FALSE`. # Write a structure X/Y/part-Z.parquet. two_levels_tree_no_hive <- tempfile() mtcars %>% group_by(cyl, gear) %>% write_dataset(two_levels_tree_no_hive, hive_style = FALSE) list.files(two_levels_tree_no_hive, recursive = TRUE) #> [1] "4/3/part-0.parquet" "4/4/part-0.parquet" "4/5/part-0.parquet" #> [4] "6/3/part-0.parquet" "6/4/part-0.parquet" "6/5/part-0.parquet" #> [7] "8/3/part-0.parquet" "8/5/part-0.parquet"