Pass a Partitioning object to a FileSystemDatasetFactory's $create()
method to indicate how the file's paths should be interpreted to define
partitioning.
DirectoryPartitioning describes how to interpret raw path segments, in
order. For example, schema(year = int16(), month = int8()) would define
partitions for file paths like "2019/01/file.parquet",
"2019/02/file.parquet", etc. In this scheme NULL values will be skipped. In
the previous example: when writing a dataset if the month was NA (or
NULL), the files would be placed in "2019/file.parquet". When reading, the
rows in "2019/file.parquet" would return an NA for the month column. An
error will be raised if an outer directory is NULL and an inner directory
is not.
HivePartitioning is for Hive-style partitioning, which embeds field
names and values in path segments, such as
"/year=2019/month=2/data.parquet". Because fields are named in the path
segments, order does not matter. This partitioning scheme allows NULL
values. They will be replaced by a configurable null_fallback which
defaults to the string "__HIVE_DEFAULT_PARTITION__" when writing. When
reading, the null_fallback string will be replaced with NAs as
appropriate.
PartitioningFactory subclasses instruct the DatasetFactory to detect
partition features from the file paths.
Factory
Both DirectoryPartitioning$create() and HivePartitioning$create()
methods take a Schema as a single input argument. The helper
function hive_partition(...) is shorthand for
HivePartitioning$create(schema(...)).
With DirectoryPartitioningFactory$create(), you can provide just the
names of the path segments (in our example, c("year", "month")), and
the DatasetFactory will infer the data types for those partition variables.
HivePartitioningFactory$create() takes no arguments: both variable names
and their types can be inferred from the file paths. hive_partition() with
no arguments returns a HivePartitioningFactory.