A Dataset can constructed using one or more DatasetFactorys.
This function helps you construct a DatasetFactory
that you can pass to
open_dataset()
.
dataset_factory(
x,
filesystem = NULL,
format = c("parquet", "arrow", "ipc", "feather", "csv", "tsv", "text"),
partitioning = NULL,
hive_style = NA,
...
)
A string path to a directory containing data files, a vector of one
one or more string paths to data files, or a list of DatasetFactory
objects
whose datasets should be combined. If this argument is specified it will be
used to construct a UnionDatasetFactory
and other arguments will be
ignored.
A FileSystem object; if omitted, the FileSystem
will
be detected from x
A FileFormat object, or a string identifier of the format of
the files in x
. Currently supported values:
"parquet"
"ipc"/"arrow"/"feather", all aliases for each other; for Feather, note that only version 2 files are supported
"csv"/"text", aliases for the same thing (because comma is the default delimiter for text files
"tsv", equivalent to passing format = "text", delimiter = "\t"
Default is "parquet", unless a delimiter
is also specified, in which case
it is assumed to be "text".
One of
A Schema
, in which case the file paths relative to sources
will be
parsed, and path segments will be matched with the schema fields. For
example, schema(year = int16(), month = int8())
would create partitions
for file paths like "2019/01/file.parquet", "2019/02/file.parquet", etc.
A character vector that defines the field names corresponding to those
path segments (that is, you're providing the names that would correspond
to a Schema
but the types will be autodetected)
A HivePartitioning
or HivePartitioningFactory
, as returned
by hive_partition()
which parses explicit or autodetected fields from
Hive-style path segments
NULL
for no partitioning
Logical: if partitioning
is a character vector or a
Schema
, should it be interpreted as specifying Hive-style partitioning?
Default is NA
, which means to inspect the file paths for Hive-style
partitioning and behave accordingly.
Additional format-specific options, passed to
FileFormat$create()
. For CSV options, note that you can specify them either
with the Arrow C++ library naming ("delimiter", "quoting", etc.) or the
readr
-style naming used in read_csv_arrow()
("delim", "quote", etc.).
Not all readr
options are currently supported; please file an issue if you
encounter one that arrow
should support.
A DatasetFactory
object. Pass this to open_dataset()
,
in a list potentially with other DatasetFactory
objects, to create
a Dataset
.
If you would only have a single DatasetFactory
(for example, you have a
single directory containing Parquet files), you can call open_dataset()
directly. Use dataset_factory()
when you
want to combine different directories, file systems, or file formats.