pyarrow.dataset.parquet_dataset#

pyarrow.dataset.parquet_dataset(metadata_path, schema=None, filesystem=None, format=None, partitioning=None, partition_base_dir=None)[source]#

Create a FileSystemDataset from a _metadata file created via pyarrrow.parquet.write_metadata.

Parameters:

metadata_pathpath,: Path pointing to a single file parquet metadata file
schemaSchema, optional: Optionally provide the Schema for the Dataset, in which case it will not be inferred from the source.
filesystemFileSystem or URI str, default None: If a single path is given as source and filesystem is None, then the filesystem will be inferred from the path. If an URI string is passed, then a filesystem object is constructed using the URI’s optional path component as a directory prefix. See the examples below. Note that the URIs on Windows must follow ‘file:///C:…’ or ‘file:/C:…’ patterns.
formatParquetFileFormat: An instance of a ParquetFileFormat if special options needs to be passed.
partitioningPartitioning, PartitioningFactory, str, list of str: The partitioning scheme specified with the partitioning() function. A flavor string can be used as shortcut, and with a list of field names a DirectionaryPartitioning will be inferred.
partition_base_dirstr, optional: For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir. Files not matching the partition_base_dir prefix will be skipped for partitioning discovery. The ignored files will still be part of the Dataset, but will not have partition information.

Returns:

FileSystemDataset: The dataset corresponding to the given metadata