pyarrow.parquet.ParquetDataset

class pyarrow.parquet.ParquetDataset(path_or_paths, filesystem=None, schema=None, metadata=None, split_row_groups=False, validate_schema=True)[source]

Bases: object

Encapsulates details of reading a complete Parquet dataset possibly consisting of multiple files and partitions in subdirectories

Parameters:
  • path_or_paths (str or List[str]) – A directory name, single file name, or list of file names
  • filesystem (FileSystem, default None) – If nothing passed, paths assumed to be found in the local on-disk filesystem
  • metadata (pyarrow.parquet.FileMetaData) – Use metadata obtained elsewhere to validate file schemas
  • schema (pyarrow.parquet.Schema) – Use schema obtained elsewhere to validate file schemas. Alternative to metadata parameter
  • split_row_groups (boolean, default False) – Divide files into pieces for each row group in the file
  • validate_schema (boolean, default True) – Check that individual file schemas are all the same / compatible
__init__(path_or_paths, filesystem=None, schema=None, metadata=None, split_row_groups=False, validate_schema=True)[source]

Methods

__init__(path_or_paths[, filesystem, …])
read([columns, nthreads, use_pandas_metadata]) Read multiple Parquet files as a single pyarrow.Table
read_pandas(**kwargs) Read dataset including pandas metadata, if any.
validate_schemas()
read(columns=None, nthreads=1, use_pandas_metadata=False)[source]

Read multiple Parquet files as a single pyarrow.Table

Parameters:
  • columns (List[str]) – Names of columns to read from the file
  • nthreads (int, default 1) – Number of columns to read in parallel. Requires that the underlying file source is threadsafe
  • use_pandas_metadata (bool, default False) – Passed through to each dataset piece
Returns:

pyarrow.Table – Content of the file as a table (of columns)

read_pandas(**kwargs)[source]

Read dataset including pandas metadata, if any. Other arguments passed through to ParquetDataset.read, see docstring for further details

Returns:pyarrow.Table – Content of the file as a table (of columns)
validate_schemas()[source]