This function allows you to write a dataset. By writing to more efficient
binary storage formats, and by specifying relevant partitioning, you can
make it much faster to read and query.
write_dataset(
dataset,
path,
format = dataset$format,
partitioning = dplyr::group_vars(dataset),
basename_template = paste0("part-{i}.", as.character(format)),
hive_style = TRUE,
...
)
Arguments
| dataset |
Dataset, RecordBatch, Table, arrow_dplyr_query, or
data.frame. If an arrow_dplyr_query or grouped_df,
schema and partitioning will be taken from the result of any select()
and group_by() operations done on the dataset. filter() queries will be
applied to restrict written rows.
Note that select()-ed columns may not be renamed. |
| path |
string path, URI, or SubTreeFileSystem referencing a directory
to write to (directory will be created if it does not exist) |
| format |
file format to write the dataset to. Currently supported
formats are "feather" (aka "ipc") and "parquet". Default is to write to the
same format as dataset. |
| partitioning |
Partitioning or a character vector of columns to
use as partition keys (to be written as path segments). Default is to
use the current group_by() columns.
|
| basename_template |
string template for the names of files to be written.
Must contain "{i}", which will be replaced with an autoincremented
integer to generate basenames of datafiles. For example, "part-{i}.feather"
will yield "part-0.feather", .... |
| hive_style |
logical: write partition segments as Hive-style
(key1=value1/key2=value2/file.ext) or as just bare values. Default is TRUE. |
| ... |
additional format-specific arguments. For available Parquet
options, see write_parquet(). The available Feather options are
use_legacy_format logical: write data formatted so that Arrow libraries
versions 0.14 and lower can read it. Default is FALSE. You can also
enable this by setting the environment variable ARROW_PRE_0_15_IPC_FORMAT=1.
metadata_version: A string like "V5" or the equivalent integer indicating
the Arrow IPC MetadataVersion. Default (NULL) will use the latest version,
unless the environment variable ARROW_PRE_1_0_METADATA_VERSION=1, in
which case it will be V4.
codec: A Codec which will be used to compress body buffers of written
files. Default (NULL) will not compress body buffers.
|
Value
The input dataset, invisibly