pyarrow.dataset.partitioning#

pyarrow.dataset.partitioning(schema=None, field_names=None, flavor=None, dictionaries=None)[source]#

Specify a partitioning scheme.

The supported schemes include:

  • “DirectoryPartitioning”: this scheme expects one segment in the file path for each field in the specified schema (all fields are required to be present). For example given schema<year:int16, month:int8> the path “/2009/11” would be parsed to (“year”_ == 2009 and “month”_ == 11).

  • “HivePartitioning”: a scheme for “/$key=$value/” nested directories as found in Apache Hive. This is a multi-level, directory based partitioning scheme. Data is partitioned by static values of a particular column in the schema. Partition keys are represented in the form $key=$value in directory names. Field order is ignored, as are missing or unrecognized field names. For example, given schema<year:int16, month:int8, day:int8>, a possible path would be “/year=2009/month=11/day=15” (but the field order does not need to match).

  • “FilenamePartitioning”: this scheme expects the partitions will have filenames containing the field values separated by “_”. For example, given schema<year:int16, month:int8, day:int8>, a possible partition filename “2009_11_part-0.parquet” would be parsed to (“year”_ == 2009 and “month”_ == 11).

Parameters:
schemapyarrow.Schema, default None

The schema that describes the partitions present in the file path. If not specified, and field_names and/or flavor are specified, the schema will be inferred from the file path (and a PartitioningFactory is returned).

field_nameslist of str, default None

A list of strings (field names). If specified, the schema’s types are inferred from the file paths (only valid for DirectoryPartitioning).

flavorstr, default None

The default is DirectoryPartitioning. Specify flavor="hive" for a HivePartitioning, and flavor="filename" for a FilenamePartitioning.

dictionariesdict[str, Array]

If the type of any field of schema is a dictionary type, the corresponding entry of dictionaries must be an array containing every value which may be taken by the corresponding column or an error will be raised in parsing. Alternatively, pass infer to have Arrow discover the dictionary values, in which case a PartitioningFactory is returned.

Returns:
Partitioning or PartitioningFactory

The partitioning scheme

Examples

Specify the Schema for paths like “/2009/June”:

>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> part = ds.partitioning(pa.schema([("year", pa.int16()),
...                                   ("month", pa.string())]))

or let the types be inferred by only specifying the field names:

>>> part =  ds.partitioning(field_names=["year", "month"])

For paths like “/2009/June”, the year will be inferred as int32 while month will be inferred as string.

Specify a Schema with dictionary encoding, providing dictionary values:

>>> part = ds.partitioning(
...     pa.schema([
...         ("year", pa.int16()),
...         ("month", pa.dictionary(pa.int8(), pa.string()))
...     ]),
...     dictionaries={
...         "month": pa.array(["January", "February", "March"]),
...     })

Alternatively, specify a Schema with dictionary encoding, but have Arrow infer the dictionary values:

>>> part = ds.partitioning(
...     pa.schema([
...         ("year", pa.int16()),
...         ("month", pa.dictionary(pa.int8(), pa.string()))
...     ]),
...     dictionaries="infer")

Create a Hive scheme for a path like “/year=2009/month=11”:

>>> part = ds.partitioning(
...     pa.schema([("year", pa.int16()), ("month", pa.int8())]),
...     flavor="hive")

A Hive scheme can also be discovered from the directory structure (and types will be inferred):

>>> part = ds.partitioning(flavor="hive")