pyarrow.dataset.HivePartitioning¶
-
class
pyarrow.dataset.
HivePartitioning
¶ Bases:
pyarrow._dataset.Partitioning
A Partitioning for “/$key=$value/” nested directories as found in Apache Hive.
Multi-level, directory based partitioning scheme originating from Apache Hive with all data files stored in the leaf directories. Data is partitioned by static values of a particular column in the schema. Partition keys are represented in the form $key=$value in directory names. Field order is ignored, as are missing or unrecognized field names.
For example, given schema<year:int16, month:int8, day:int8>, a possible path would be “/year=2009/month=11/day=15”.
- Parameters
schema (Schema) – The schema that describes the partitions present in the file path.
- Returns
HivePartitioning
Examples
>>> from pyarrow.dataset import HivePartitioning >>> partitioning = HivePartitioning( ... pa.schema([("year", pa.int16()), ("month", pa.int8())])) >>> print(partitioning.parse("/year=2009/month=11")) ((year == 2009:int16) and (month == 11:int8))
-
__init__
(*args, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
(*args, **kwargs)Initialize self.
Discover a HivePartitioning.
Attributes
The arrow Schema attached to the partitioning.
-
static
discover
()¶ Discover a HivePartitioning.
- Parameters
infer_dictionary (bool, default False) – When inferring a schema for partition fields, yield dictionary encoded types instead of plain. This can be more efficient when materializing virtual columns, and Expressions parsed by the finished Partitioning will include dictionaries of all unique inspected values for each field.
max_partition_dictionary_size (int, default 0) – Synonymous with infer_dictionary for backwards compatibility with 1.0: setting this to -1 or None is equivalent to passing infer_dictionary=True.
- Returns
PartitioningFactory – To be used in the FileSystemFactoryOptions.
-
parse
()¶
-
schema
¶ The arrow Schema attached to the partitioning.