Pass a Partitioning
object to a FileSystemDatasetFactory's $create()
method to indicate how the file's paths should be interpreted to define
partitioning.
DirectoryPartitioning
describes how to interpret raw path segments, in
order. For example, schema(year = int16(), month = int8())
would define
partitions for file paths like "2019/01/file.parquet",
"2019/02/file.parquet", etc.
HivePartitioning
is for Hive-style partitioning, which embeds field
names and values in path segments, such as
"/year=2019/month=2/data.parquet". Because fields are named in the path
segments, order does not matter.
PartitioningFactory
subclasses instruct the DatasetFactory
to detect
partition features from the file paths.
Both DirectoryPartitioning$create()
and HivePartitioning$create()
methods take a Schema as a single input argument. The helper
function hive_partition(...)
is shorthand for
HivePartitioning$create(schema(...))
.
With DirectoryPartitioningFactory$create()
, you can provide just the
names of the path segments (in our example, c("year", "month")
), and
the DatasetFactory
will infer the data types for those partition variables.
HivePartitioningFactory$create()
takes no arguments: both variable names
and their types can be inferred from the file paths. hive_partition()
with
no arguments returns a HivePartitioningFactory
.