pyarrow.dataset.parquet_dataset#

pyarrow.dataset.parquet_dataset(metadata_path, schema=None, filesystem=None, format=None, partitioning=None, partition_base_dir=None)[source]#

Create a FileSystemDataset from a _metadata file created via pyarrow.parquet.write_metadata.

Parameters:
metadata_pathpath,

Path pointing to a single file parquet metadata file

schemaSchema, optional

Optionally provide the Schema for the Dataset, in which case it will not be inferred from the source.

filesystemFileSystem or URI str, default None

If a single path is given as source and filesystem is None, then the filesystem will be inferred from the path. If an URI string is passed, then a filesystem object is constructed using the URI’s optional path component as a directory prefix. See the examples below. Note that the URIs on Windows must follow ‘file:///C:…’ or ‘file:/C:…’ patterns.

formatParquetFileFormat

An instance of a ParquetFileFormat if special options needs to be passed.

partitioningPartitioning, PartitioningFactory, str, list of str

The partitioning scheme specified with the partitioning() function. A flavor string can be used as shortcut, and with a list of field names a DirectoryPartitioning will be inferred.

partition_base_dirstr, optional

For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir. Files not matching the partition_base_dir prefix will be skipped for partitioning discovery. The ignored files will still be part of the Dataset, but will not have partition information.

Returns:
FileSystemDataset

The dataset corresponding to the given metadata