pyarrow.parquet.ParquetDataset#
- class pyarrow.parquet.ParquetDataset(path_or_paths=None, filesystem=None, schema=None, metadata=None, split_row_groups=False, validate_schema=True, filters=None, metadata_nthreads=None, read_dictionary=None, memory_map=False, buffer_size=0, partitioning='hive', use_legacy_dataset=None, pre_buffer=True, coerce_int96_timestamp_unit=None, thrift_string_size_limit=None, thrift_container_size_limit=None)[source]#
Bases:
object
Encapsulates details of reading a complete Parquet dataset possibly consisting of multiple files and partitions in subdirectories.
- Parameters:
- path_or_paths
str
orList
[str
] A directory name, single file name, or list of file names.
- filesystem
FileSystem
, defaultNone
If nothing passed, will be inferred based on path. Path will try to be found in the local on-disk filesystem otherwise it will be parsed as an URI to determine the filesystem.
- schema
pyarrow.parquet.Schema
Use schema obtained elsewhere to validate file schemas. Alternative to metadata parameter.
- metadata
pyarrow.parquet.FileMetaData
Use metadata obtained elsewhere to validate file schemas.
- split_row_groupsbool, default
False
Divide files into pieces for each row group in the file.
- validate_schemabool, default
True
Check that individual file schemas are all the same / compatible.
- filters
pyarrow.compute.Expression
orList
[Tuple
] orList
[List
[Tuple
]], defaultNone
Rows which do not match the filter predicate will be removed from scanned data. Partition keys embedded in a nested directory structure will be exploited to avoid loading files at all if they contain no matching rows. If use_legacy_dataset is True, filters can only reference partition keys and only a hive-style directory structure is supported. When setting use_legacy_dataset to False, also within-file level filtering and different partitioning schemes are supported.
Predicates are expressed using an
Expression
or using the disjunctive normal form (DNF), like[[('x', '=', 0), ...], ...]
. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the most outer list combines these filters as a disjunction (OR).Predicates may also be passed as List[Tuple]. This form is interpreted as a single conjunction. To express OR in predicates, one must use the (preferred) List[List[Tuple]] notation.
Each tuple has format: (
key
,op
,value
) and compares thekey
with thevalue
. The supportedop
are:=
or==
,!=
,<
,>
,<=
,>=
,in
andnot in
. If theop
isin
ornot in
, thevalue
must be a collection such as alist
, aset
or atuple
.Examples:
Using the
Expression
API:import pyarrow.compute as pc pc.field('x') = 0 pc.field('y').isin(['a', 'b', 'c']) ~pc.field('y').isin({'a', 'b'})
Using the DNF format:
('x', '=', 0) ('y', 'in', ['a', 'b', 'c']) ('z', 'not in', {'a','b'})
- metadata_nthreads
int
, default 1 How many threads to allow the thread pool which is used to read the dataset metadata. Increasing this is helpful to read partitioned datasets.
- read_dictionary
list
, defaultNone
List of names or column paths (for nested types) to read directly as DictionaryArray. Only supported for BYTE_ARRAY storage. To read a flat column as dictionary-encoded pass the column name. For nested types, you must pass the full column “path”, which could be something like level1.level2.list.item. Refer to the Parquet file’s schema to obtain the paths.
- memory_mapbool, default
False
If the source is a file path, use a memory map to read file, which can improve performance in some environments.
- buffer_size
int
, default 0 If positive, perform read buffering when deserializing individual column chunks. Otherwise IO calls are unbuffered.
- partitioning
pyarrow.dataset.Partitioning
orstr
orlist
ofstr
, default “hive” The partitioning scheme for a partitioned dataset. The default of “hive” assumes directory names with key=value pairs like “/year=2009/month=11”. In addition, a scheme like “/2009/11” is also supported, in which case you need to specify the field names or a full schema. See the
pyarrow.dataset.partitioning()
function for more details.- use_legacy_datasetbool, default
False
Set to False to enable the new code path (using the new Arrow Dataset API). Among other things, this allows to pass filters for all columns and not only the partition keys, enables different partitioning schemes, etc.
- pre_bufferbool, default
True
Coalesce and issue file reads in parallel to improve performance on high-latency filesystems (e.g. S3, GCS). If True, Arrow will use a background I/O thread pool. This option is only supported for use_legacy_dataset=False. If using a filesystem layer that itself performs readahead (e.g. fsspec’s S3FS), disable readahead for best results. Set to False if you want to prioritize minimal memory usage over maximum speed.
- coerce_int96_timestamp_unit
str
, defaultNone
Cast timestamps that are stored in INT96 format to a particular resolution (e.g. ‘ms’). Setting to None is equivalent to ‘ns’ and therefore INT96 timestamps will be inferred as timestamps in nanoseconds.
- thrift_string_size_limit
int
, defaultNone
If not None, override the maximum total string size allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files.
- thrift_container_size_limit
int
, defaultNone
If not None, override the maximum total size of containers allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files.
- path_or_paths
Examples
Generate an example PyArrow Table and write it to a partitioned dataset:
>>> import pyarrow as pa >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq >>> pq.write_to_dataset(table, root_path='dataset_name', ... partition_cols=['year'], ... use_legacy_dataset=False)
create a ParquetDataset object from the dataset source:
>>> dataset = pq.ParquetDataset('dataset_name/', use_legacy_dataset=False)
and read the data:
>>> dataset.read().to_pandas() n_legs animal year 0 5 Brittle stars 2019 1 2 Flamingo 2020 2 4 Dog 2021 3 100 Centipede 2021 4 2 Parrot 2022 5 4 Horse 2022
create a ParquetDataset object with filter:
>>> dataset = pq.ParquetDataset('dataset_name/', use_legacy_dataset=False, ... filters=[('n_legs','=',4)]) >>> dataset.read().to_pandas() n_legs animal year 0 4 Dog 2021 1 4 Horse 2022
- __init__(path_or_paths, filesystem=None, schema=None, metadata=None, split_row_groups=False, validate_schema=True, filters=None, metadata_nthreads=None, read_dictionary=None, memory_map=False, buffer_size=0, partitioning='hive', use_legacy_dataset=None, pre_buffer=True, coerce_int96_timestamp_unit=None, thrift_string_size_limit=None, thrift_container_size_limit=None)[source]#
Methods
__init__
(path_or_paths[, filesystem, ...])equals
(other)read
([columns, use_threads, use_pandas_metadata])Read multiple Parquet files as a single pyarrow.Table.
read_pandas
(**kwargs)Read dataset including pandas metadata, if any.
Attributes
DEPRECATED
DEPRECATED
DEPRECATED
A list of absolute Parquet file paths in the Dataset source.
The filesystem type of the Dataset source.
A list of the Dataset source fragments or pieces with absolute file paths.
DEPRECATED
DEPRECATED
DEPRECATED
DEPRECATED
The partitioning of the Dataset source, if discovered.
DEPRECATED
DEPRECATED
DEPRECATED
- property buffer_size#
DEPRECATED
- property common_metadata#
DEPRECATED
- property common_metadata_path#
DEPRECATED
- property files#
A list of absolute Parquet file paths in the Dataset source. To use this property set ‘use_legacy_dataset=False’ while constructing ParquetDataset object.
Examples
Generate an example dataset:
>>> import pyarrow as pa >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_to_dataset(table, root_path='dataset_name_files', ... partition_cols=['year'], ... use_legacy_dataset=False) >>> dataset = pq.ParquetDataset('dataset_name_files/', ... use_legacy_dataset=False)
List the files:
>>> dataset.files ['dataset_name_files/year=2019/...-0.parquet', ...
- property filesystem#
The filesystem type of the Dataset source. To use this property set ‘use_legacy_dataset=False’ while constructing ParquetDataset object.
- property fragments#
A list of the Dataset source fragments or pieces with absolute file paths. To use this property set ‘use_legacy_dataset=False’ while constructing ParquetDataset object.
Examples
Generate an example dataset:
>>> import pyarrow as pa >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_to_dataset(table, root_path='dataset_name_fragments', ... partition_cols=['year'], ... use_legacy_dataset=False) >>> dataset = pq.ParquetDataset('dataset_name_fragments/', ... use_legacy_dataset=False)
List the fragments:
>>> dataset.fragments [<pyarrow.dataset.ParquetFileFragment path=dataset_name_fragments/...
- property fs#
DEPRECATED
- property memory_map#
DEPRECATED
- property metadata#
DEPRECATED
- property metadata_path#
DEPRECATED
- property partitioning#
The partitioning of the Dataset source, if discovered. To use this property set ‘use_legacy_dataset=False’ while constructing ParquetDataset object.
- property partitions#
DEPRECATED
- property pieces#
DEPRECATED
- read(columns=None, use_threads=True, use_pandas_metadata=False)[source]#
Read multiple Parquet files as a single pyarrow.Table.
- Parameters:
- Returns:
pyarrow.Table
Content of the file as a table (of columns).
Examples
Generate an example dataset:
>>> import pyarrow as pa >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_to_dataset(table, root_path='dataset_name_read', ... partition_cols=['year'], ... use_legacy_dataset=False) >>> dataset = pq.ParquetDataset('dataset_name_read/', ... use_legacy_dataset=False)
Read multiple Parquet files as a single pyarrow.Table:
>>> dataset.read(columns=["n_legs"]) pyarrow.Table n_legs: int64 ---- n_legs: [[5],[2],[4,100],[2,4]]
- property read_dictionary#
DEPRECATED
- read_pandas(**kwargs)[source]#
Read dataset including pandas metadata, if any. Other arguments passed through to ParquetDataset.read, see docstring for further details.
- Parameters:
- **kwargsoptional
All additional options to pass to the reader.
- Returns:
pyarrow.Table
Content of the file as a table (of columns).
Examples
Generate an example PyArrow Table and write it to a partitioned dataset:
>>> import pyarrow as pa >>> import pandas as pd >>> df = pd.DataFrame({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> table = pa.Table.from_pandas(df) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'table.parquet') >>> dataset = pq.ParquetDataset('table.parquet', ... use_legacy_dataset=False)
Read dataset including pandas metadata:
>>> dataset.read_pandas(columns=["n_legs"]) pyarrow.Table n_legs: int64 ---- n_legs: [[2,2,4,4,5,100]]
Select pandas metadata:
>>> dataset.read_pandas(columns=["n_legs"]).schema.pandas_metadata {'index_columns': [{'kind': 'range', 'name': None, 'start': 0, ...}
- property schema#