pyarrow.parquet.ParquetFile

class pyarrow.parquet.ParquetFile(source, metadata=None, common_metadata=None, read_dictionary=None, memory_map=False, buffer_size=0, pre_buffer=False, coerce_int96_timestamp_unit=None)[source]

Bases: object

Reader interface for a single Parquet file.

Parameters
sourcestr, pathlib.Path, pyarrow.NativeFile, or file-like object

Readable source. For passing bytes or buffer-like file containing a Parquet file, use pyarrow.BufferReader.

metadataFileMetaData, default None

Use existing metadata object, rather than reading from file.

common_metadataFileMetaData, default None

Will be used in reads for pandas schema metadata if not found in the main file’s metadata, no other uses at the moment.

memory_mapbool, default False

If the source is a file path, use a memory map to read file, which can improve performance in some environments.

buffer_sizeint, default 0

If positive, perform read buffering when deserializing individual column chunks. Otherwise IO calls are unbuffered.

pre_bufferbool, default False

Coalesce and issue file reads in parallel to improve performance on high-latency filesystems (e.g. S3). If True, Arrow will use a background I/O thread pool.

read_dictionarylist

List of column names to read directly as DictionaryArray.

coerce_int96_timestamp_unitstr, default None.

Cast timestamps that are stored in INT96 format to a particular resolution (e.g. ‘ms’). Setting to None is equivalent to ‘ns’ and therefore INT96 timestamps will be infered as timestamps in nanoseconds.

__init__(source, metadata=None, common_metadata=None, read_dictionary=None, memory_map=False, buffer_size=0, pre_buffer=False, coerce_int96_timestamp_unit=None)[source]

Methods

__init__(source[, metadata, ...])

iter_batches([batch_size, row_groups, ...])

Read streaming batches from a Parquet file

read([columns, use_threads, use_pandas_metadata])

Read a Table from Parquet format,

read_row_group(i[, columns, use_threads, ...])

Read a single row group from a Parquet file.

read_row_groups(row_groups[, columns, ...])

Read a multiple row groups from a Parquet file.

scan_contents([columns, batch_size])

Read contents of file for the given columns and batch size.

Attributes

metadata

num_row_groups

schema

Return the Parquet schema, unconverted to Arrow types

schema_arrow

Return the inferred Arrow schema, converted from the whole Parquet file's schema

iter_batches(batch_size=65536, row_groups=None, columns=None, use_threads=True, use_pandas_metadata=False)[source]

Read streaming batches from a Parquet file

Parameters
batch_sizeint, default 64K

Maximum number of records to yield per batch. Batches may be smaller if there aren’t enough rows in the file.

row_groupslist

Only these row groups will be read from the file.

columnslist

If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.

use_threadsbool, default True

Perform multi-threaded column reads.

use_pandas_metadatabool, default False

If True and file has custom pandas schema metadata, ensure that index columns are also loaded.

Returns
iterator of pyarrow.RecordBatch

Contents of each batch as a record batch

property metadata
property num_row_groups
read(columns=None, use_threads=True, use_pandas_metadata=False)[source]

Read a Table from Parquet format,

Parameters
columnslist

If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.

use_threadsbool, default True

Perform multi-threaded column reads.

use_pandas_metadatabool, default False

If True and file has custom pandas schema metadata, ensure that index columns are also loaded.

Returns
pyarrow.table.Table

Content of the file as a table (of columns).

read_row_group(i, columns=None, use_threads=True, use_pandas_metadata=False)[source]

Read a single row group from a Parquet file.

Parameters
iint

Index of the individual row group that we want to read.

columnslist

If not None, only these columns will be read from the row group. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.

use_threadsbool, default True

Perform multi-threaded column reads.

use_pandas_metadatabool, default False

If True and file has custom pandas schema metadata, ensure that index columns are also loaded.

Returns
pyarrow.table.Table

Content of the row group as a table (of columns)

read_row_groups(row_groups, columns=None, use_threads=True, use_pandas_metadata=False)[source]

Read a multiple row groups from a Parquet file.

Parameters
row_groupslist

Only these row groups will be read from the file.

columnslist

If not None, only these columns will be read from the row group. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.

use_threadsbool, default True

Perform multi-threaded column reads.

use_pandas_metadatabool, default False

If True and file has custom pandas schema metadata, ensure that index columns are also loaded.

Returns
pyarrow.table.Table

Content of the row groups as a table (of columns).

scan_contents(columns=None, batch_size=65536)[source]

Read contents of file for the given columns and batch size.

Parameters
columnslist of integers, default None

Select columns to read, if None scan all columns.

batch_sizeint, default 64K

Number of rows to read at a time internally.

Returns
num_rowsnumber of rows in file

Notes

This function’s primary purpose is benchmarking. The scan is executed on a single thread.

property schema

Return the Parquet schema, unconverted to Arrow types

property schema_arrow

Return the inferred Arrow schema, converted from the whole Parquet file’s schema