pyarrow.parquet.ParquetDataset

class pyarrow.parquet.ParquetDataset(path_or_paths, filesystem=None, schema=None, metadata=None, split_row_groups=False, validate_schema=True, filters=None, metadata_nthreads=1, memory_map=True)[source]

Bases: object

Encapsulates details of reading a complete Parquet dataset possibly consisting of multiple files and partitions in subdirectories

Parameters
  • path_or_paths (str or List[str]) – A directory name, single file name, or list of file names

  • filesystem (FileSystem, default None) – If nothing passed, paths assumed to be found in the local on-disk filesystem

  • metadata (pyarrow.parquet.FileMetaData) – Use metadata obtained elsewhere to validate file schemas

  • schema (pyarrow.parquet.Schema) – Use schema obtained elsewhere to validate file schemas. Alternative to metadata parameter

  • split_row_groups (boolean, default False) – Divide files into pieces for each row group in the file

  • validate_schema (boolean, default True) – Check that individual file schemas are all the same / compatible

  • filters (List[Tuple] or List[List[Tuple]] or None (default)) –

    List of filters to apply, like [[('x', '=', 0), ...], ...]. This implements partition-level (hive) filtering only, i.e., to prevent the loading of some files of the dataset.

    Predicates are expressed in disjunctive normal form (DNF). This means that the innermost tuple describe a single column predicate. These inner predicate make are all combined with a conjunction (AND) into a larger predicate. The most outer list then combines all filters with a disjunction (OR). By this, we should be able to express all kinds of filters that are possible using boolean logic.

    This function also supports passing in as List[Tuple]. These predicates are evaluated as a conjunction. To express OR in predictates, one must use the (preferred) List[List[Tuple]] notation.

  • metadata_nthreads (int, default 1) – How many threads to allow the thread pool which is used to read the dataset metadata. Increasing this is helpful to read partitioned datasets.

  • memory_map (boolean, default True) – If the source is a file path, use a memory map to read each file in the dataset if possible, which can improve performance in some environments

__init__(path_or_paths, filesystem=None, schema=None, metadata=None, split_row_groups=False, validate_schema=True, filters=None, metadata_nthreads=1, memory_map=True)[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(path_or_paths[, filesystem, …])

Initialize self.

equals(other)

read([columns, use_threads, use_pandas_metadata])

Read multiple Parquet files as a single pyarrow.Table

read_pandas(**kwargs)

Read dataset including pandas metadata, if any.

validate_schemas()

equals(other)[source]
read(columns=None, use_threads=True, use_pandas_metadata=False)[source]

Read multiple Parquet files as a single pyarrow.Table

Parameters
  • columns (List[str]) – Names of columns to read from the file

  • use_threads (boolean, default True) – Perform multi-threaded column reads

  • use_pandas_metadata (bool, default False) – Passed through to each dataset piece

Returns

pyarrow.Table – Content of the file as a table (of columns)

read_pandas(**kwargs)[source]

Read dataset including pandas metadata, if any. Other arguments passed through to ParquetDataset.read, see docstring for further details

Returns

pyarrow.Table – Content of the file as a table (of columns)

validate_schemas()[source]