pyarrow.dataset.parquet_dataset

pyarrow.dataset.parquet_dataset(metadata_path, schema=None, filesystem=None, format=None, partitioning=None, partition_base_dir=None)[source]

Create a FileSystemDataset from a _metadata file created via pyarrrow.parquet.write_metadata.

Parameters
  • metadata_path (path,) – Path pointing to a single file parquet metadata file

  • schema (Schema, optional) – Optionally provide the Schema for the Dataset, in which case it will not be inferred from the source.

  • filesystem (FileSystem or URI string, default None) – If a single path is given as source and filesystem is None, then the filesystem will be inferred from the path. If an URI string is passed, then a filesystem object is constructed using the URI’s optional path component as a directory prefix. See the examples below. Note that the URIs on Windows must follow ‘file:///C:…’ or ‘file:/C:…’ patterns.

  • format (ParquetFileFormat) – An instance of a ParquetFileFormat if special options needs to be passed.

  • partitioning (Partitioning, PartitioningFactory, str, list of str) – The partitioning scheme specified with the partitioning() function. A flavor string can be used as shortcut, and with a list of field names a DirectionaryPartitioning will be inferred.

  • partition_base_dir (str, optional) – For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir. Files not matching the partition_base_dir prefix will be skipped for partitioning discovery. The ignored files will still be part of the Dataset, but will not have partition information.

Returns

FileSystemDataset