This function allows you to write a dataset. By writing to more efficient
binary storage formats, and by specifying relevant partitioning, you can
make it much faster to read and query.
write_dataset(
dataset,
path,
format = dataset$format,
partitioning = dplyr::group_vars(dataset),
basename_template = paste0("part-{i}.", as.character(format)),
hive_style = TRUE,
...
)
Arguments
dataset |
Dataset, RecordBatch, Table, arrow_dplyr_query , or
data.frame . If an arrow_dplyr_query or grouped_df ,
schema and partitioning will be taken from the result of any select()
and group_by() operations done on the dataset. filter() queries will be
applied to restrict written rows.
Note that select() -ed columns may not be renamed. |
path |
string path, URI, or SubTreeFileSystem referencing a directory
to write to (directory will be created if it does not exist) |
format |
file format to write the dataset to. Currently supported
formats are "feather" (aka "ipc") and "parquet". Default is to write to the
same format as dataset . |
partitioning |
Partitioning or a character vector of columns to
use as partition keys (to be written as path segments). Default is to
use the current group_by() columns.
|
basename_template |
string template for the names of files to be written.
Must contain "{i}" , which will be replaced with an autoincremented
integer to generate basenames of datafiles. For example, "part-{i}.feather"
will yield "part-0.feather", ... . |
hive_style |
logical: write partition segments as Hive-style
(key1=value1/key2=value2/file.ext ) or as just bare values. Default is TRUE . |
... |
additional format-specific arguments. For available Parquet
options, see write_parquet() . The available Feather options are
use_legacy_format logical: write data formatted so that Arrow libraries
versions 0.14 and lower can read it. Default is FALSE . You can also
enable this by setting the environment variable ARROW_PRE_0_15_IPC_FORMAT=1 .
metadata_version : A string like "V5" or the equivalent integer indicating
the Arrow IPC MetadataVersion. Default (NULL) will use the latest version,
unless the environment variable ARROW_PRE_1_0_METADATA_VERSION=1 , in
which case it will be V4.
codec : A Codec which will be used to compress body buffers of written
files. Default (NULL) will not compress body buffers.
|
Value
The input dataset
, invisibly