Write a dataset — write_dataset • Arrow R Package

This function allows you to write a dataset. By writing to more efficient binary storage formats, and by specifying relevant partitioning, you can make it much faster to read and query.

write_dataset(
  dataset,
  path,
  format = dataset$format,
  partitioning = dplyr::group_vars(dataset),
  basename_template = paste0("part-{i}.", as.character(format)),
  hive_style = TRUE,
  ...
)

Arguments

dataset	Dataset, RecordBatch, Table, `arrow_dplyr_query`, or `data.frame`. If an `arrow_dplyr_query` or `grouped_df`, `schema` and `partitioning` will be taken from the result of any `select()` and `group_by()` operations done on the dataset. `filter()` queries will be applied to restrict written rows. Note that `select()`-ed columns may not be renamed.
path	string path, URI, or `SubTreeFileSystem` referencing a directory to write to (directory will be created if it does not exist)
format	file format to write the dataset to. Currently supported formats are "feather" (aka "ipc") and "parquet". Default is to write to the same format as `dataset`.
partitioning	`Partitioning` or a character vector of columns to use as partition keys (to be written as path segments). Default is to use the current `group_by()` columns.
basename_template	string template for the names of files to be written. Must contain `"{i}"`, which will be replaced with an autoincremented integer to generate basenames of datafiles. For example, `"part-{i}.feather"` will yield `"part-0.feather", ...`.
hive_style	logical: write partition segments as Hive-style (`key1=value1/key2=value2/file.ext`) or as just bare values. Default is `TRUE`.
...	additional format-specific arguments. For available Parquet options, see `write_parquet()`. The available Feather options are `use_legacy_format` logical: write data formatted so that Arrow libraries versions 0.14 and lower can read it. Default is `FALSE`. You can also enable this by setting the environment variable `ARROW_PRE_0_15_IPC_FORMAT=1`. `metadata_version`: A string like "V5" or the equivalent integer indicating the Arrow IPC MetadataVersion. Default (NULL) will use the latest version, unless the environment variable `ARROW_PRE_1_0_METADATA_VERSION=1`, in which case it will be V4. `codec`: A Codec which will be used to compress body buffers of written files. Default (NULL) will not compress body buffers.

Value

The input dataset, invisibly