pyarrow.parquet.ParquetDataset

class pyarrow.parquet.ParquetDataset(path_or_paths=None, filesystem=None, schema=None, metadata=None, split_row_groups=False, validate_schema=True, filters=None, metadata_nthreads=1, read_dictionary=None, memory_map=False, buffer_size=0, partitioning='hive', use_legacy_dataset=None, pre_buffer=True, coerce_int96_timestamp_unit=None)[source]

Bases: object

Encapsulates details of reading a complete Parquet dataset possibly consisting of multiple files and partitions in subdirectories.

Parameters
  • path_or_paths (str or List[str]) – A directory name, single file name, or list of file names.

  • filesystem (FileSystem, default None) – If nothing passed, paths assumed to be found in the local on-disk filesystem.

  • metadata (pyarrow.parquet.FileMetaData) – Use metadata obtained elsewhere to validate file schemas.

  • schema (pyarrow.parquet.Schema) – Use schema obtained elsewhere to validate file schemas. Alternative to metadata parameter.

  • split_row_groups (bool, default False) – Divide files into pieces for each row group in the file.

  • validate_schema (bool, default True) – Check that individual file schemas are all the same / compatible.

  • filters (List[Tuple] or List[List[Tuple]] or None (default)) –

    Rows which do not match the filter predicate will be removed from scanned data. Partition keys embedded in a nested directory structure will be exploited to avoid loading files at all if they contain no matching rows. If use_legacy_dataset is True, filters can only reference partition keys and only a hive-style directory structure is supported. When setting use_legacy_dataset to False, also within-file level filtering and different partitioning schemes are supported.

    Predicates are expressed in disjunctive normal form (DNF), like [[('x', '=', 0), ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the most outer list combines these filters as a disjunction (OR).

    Predicates may also be passed as List[Tuple]. This form is interpreted as a single conjunction. To express OR in predicates, one must use the (preferred) List[List[Tuple]] notation.

    Each tuple has format: (key, op, value) and compares the key with the value. The supported op are: = or ==, !=, <, >, <=, >=, in and not in. If the op is in or not in, the value must be a collection such as a list, a set or a tuple.

    Examples:

    ('x', '=', 0)
    ('y', 'in', ['a', 'b', 'c'])
    ('z', 'not in', {'a','b'})
    

metadata_nthreadsint, default 1

How many threads to allow the thread pool which is used to read the dataset metadata. Increasing this is helpful to read partitioned datasets.

read_dictionarylist, default None

List of names or column paths (for nested types) to read directly as DictionaryArray. Only supported for BYTE_ARRAY storage. To read a flat column as dictionary-encoded pass the column name. For nested types, you must pass the full column “path”, which could be something like level1.level2.list.item. Refer to the Parquet file’s schema to obtain the paths.

memory_mapbool, default False

If the source is a file path, use a memory map to read file, which can improve performance in some environments.

buffer_sizeint, default 0

If positive, perform read buffering when deserializing individual column chunks. Otherwise IO calls are unbuffered.

partitioningPartitioning or str or list of str, default “hive”

The partitioning scheme for a partitioned dataset. The default of “hive” assumes directory names with key=value pairs like “/year=2009/month=11”. In addition, a scheme like “/2009/11” is also supported, in which case you need to specify the field names or a full schema. See the pyarrow.dataset.partitioning() function for more details.

use_legacy_datasetbool, default True

Set to False to enable the new code path (experimental, using the new Arrow Dataset API). Among other things, this allows to pass filters for all columns and not only the partition keys, enables different partitioning schemes, etc.

pre_bufferbool, default True

Coalesce and issue file reads in parallel to improve performance on high-latency filesystems (e.g. S3). If True, Arrow will use a background I/O thread pool. This option is only supported for use_legacy_dataset=False. If using a filesystem layer that itself performs readahead (e.g. fsspec’s S3FS), disable readahead for best results.

coerce_int96_timestamp_unitstr, default None.

Cast timestamps that are stored in INT96 format to a particular resolution (e.g. ‘ms’). Setting to None is equivalent to ‘ns’ and therefore INT96 timestamps will be infered as timestamps in nanoseconds.

__init__(path_or_paths, filesystem=None, schema=None, metadata=None, split_row_groups=False, validate_schema=True, filters=None, metadata_nthreads=1, read_dictionary=None, memory_map=False, buffer_size=0, partitioning='hive', use_legacy_dataset=True, pre_buffer=True, coerce_int96_timestamp_unit=None)[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(path_or_paths[, filesystem, …])

Initialize self.

equals(other)

read([columns, use_threads, use_pandas_metadata])

Read multiple Parquet files as a single pyarrow.Table.

read_pandas(**kwargs)

Read dataset including pandas metadata, if any.

validate_schemas()

Attributes

buffer_size

common_metadata

attrgetter(attr, …) –> attrgetter object

fs

memory_map

partitions

pieces

read_dictionary

property buffer_size
property common_metadata

attrgetter(attr, …) –> attrgetter object

Return a callable object that fetches the given attribute(s) from its operand. After f = attrgetter(‘name’), the call f(r) returns r.name. After g = attrgetter(‘name’, ‘date’), the call g(r) returns (r.name, r.date). After h = attrgetter(‘name.first’, ‘name.last’), the call h(r) returns (r.name.first, r.name.last).

equals(other)[source]
property fs
property memory_map
property partitions
property pieces
read(columns=None, use_threads=True, use_pandas_metadata=False)[source]

Read multiple Parquet files as a single pyarrow.Table.

Parameters
  • columns (List[str]) – Names of columns to read from the file.

  • use_threads (bool, default True) – Perform multi-threaded column reads

  • use_pandas_metadata (bool, default False) – Passed through to each dataset piece.

Returns

pyarrow.Table – Content of the file as a table (of columns).

property read_dictionary
read_pandas(**kwargs)[source]

Read dataset including pandas metadata, if any. Other arguments passed through to ParquetDataset.read, see docstring for further details.

Parameters

**kwargs (optional) – All additional options to pass to the reader.

Returns

pyarrow.Table – Content of the file as a table (of columns).

validate_schemas()[source]