pyarrow.dataset.DirectoryPartitioning#
- class pyarrow.dataset.DirectoryPartitioning(Schema schema, dictionaries=None, segment_encoding=u'uri')#
Bases:
KeyValuePartitioning
A Partitioning based on a specified Schema.
The DirectoryPartitioning expects one segment in the file path for each field in the schema (all fields are required to be present). For example given schema<year:int16, month:int8> the path “/2009/11” would be parsed to (“year”_ == 2009 and “month”_ == 11).
- Parameters:
- schema
Schema
The schema that describes the partitions present in the file path.
- dictionaries
dict
[str
,Array
] If the type of any field of schema is a dictionary type, the corresponding entry of dictionaries must be an array containing every value which may be taken by the corresponding column or an error will be raised in parsing.
- segment_encoding
str
, default “uri” After splitting paths into segments, decode the segments. Valid values are “uri” (URI-decode segments) and “none” (leave as-is).
- schema
- Returns:
Examples
>>> from pyarrow.dataset import DirectoryPartitioning >>> partitioning = DirectoryPartitioning( ... pa.schema([("year", pa.int16()), ("month", pa.int8())])) >>> print(partitioning.parse("/2009/11/")) ((year == 2009) and (month == 11))
- __init__(*args, **kwargs)#
Methods
__init__
(*args, **kwargs)discover
([field_names, infer_dictionary, ...])Discover a DirectoryPartitioning.
format
(self, expr)Convert a filter expression into a tuple of (directory, filename) using the current partitioning scheme
parse
(self, path)Parse a path into a partition expression.
Attributes
The unique values for each partition field, if available.
The arrow Schema attached to the partitioning.
- dictionaries#
The unique values for each partition field, if available.
Those values are only available if the Partitioning object was created through dataset discovery from a PartitioningFactory, or if the dictionaries were manually specified in the constructor. If no dictionary field is available, this returns an empty list.
- static discover(field_names=None, infer_dictionary=False, max_partition_dictionary_size=0, schema=None, segment_encoding='uri')#
Discover a DirectoryPartitioning.
- Parameters:
- field_names
list
ofstr
The names to associate with the values from the subdirectory names. If schema is given, will be populated from the schema.
- infer_dictionarybool, default
False
When inferring a schema for partition fields, yield dictionary encoded types instead of plain types. This can be more efficient when materializing virtual columns, and Expressions parsed by the finished Partitioning will include dictionaries of all unique inspected values for each field.
- max_partition_dictionary_size
int
, default 0 Synonymous with infer_dictionary for backwards compatibility with 1.0: setting this to -1 or None is equivalent to passing infer_dictionary=True.
- schema
Schema
, defaultNone
Use this schema instead of inferring a schema from partition values. Partition values will be validated against this schema before accumulation into the Partitioning’s dictionary.
- segment_encoding
str
, default “uri” After splitting paths into segments, decode the segments. Valid values are “uri” (URI-decode segments) and “none” (leave as-is).
- field_names
- Returns:
PartitioningFactory
To be used in the FileSystemFactoryOptions.
- format(self, expr)#
Convert a filter expression into a tuple of (directory, filename) using the current partitioning scheme
- Parameters:
- Returns:
Examples
Specify the Schema for paths like “/2009/June”:
>>> import pyarrow as pa >>> import pyarrow.dataset as ds >>> import pyarrow.compute as pc >>> part = ds.partitioning(pa.schema([("year", pa.int16()), ... ("month", pa.string())])) >>> part.format( ... (pc.field("year") == 1862) & (pc.field("month") == "Jan") ... ) ('1862/Jan', '')
- schema#
The arrow Schema attached to the partitioning.