pyarrow.parquet.ParquetFile#

class pyarrow.parquet.ParquetFile(source, *, metadata=None, common_metadata=None, read_dictionary=None, memory_map=False, buffer_size=0, pre_buffer=False, coerce_int96_timestamp_unit=None, decryption_properties=None, thrift_string_size_limit=None, thrift_container_size_limit=None, filesystem=None, page_checksum_verification=False)[source]#

Bases: object

Reader interface for a single Parquet file.

Parameters:
sourcestr, pathlib.Path, pyarrow.NativeFile, or file-like object

Readable source. For passing bytes or buffer-like file containing a Parquet file, use pyarrow.BufferReader.

metadataFileMetaData, default None

Use existing metadata object, rather than reading from file.

common_metadataFileMetaData, default None

Will be used in reads for pandas schema metadata if not found in the main file’s metadata, no other uses at the moment.

read_dictionarylist

List of column names to read directly as DictionaryArray.

memory_mapbool, default False

If the source is a file path, use a memory map to read file, which can improve performance in some environments.

buffer_sizeint, default 0

If positive, perform read buffering when deserializing individual column chunks. Otherwise IO calls are unbuffered.

pre_bufferbool, default False

Coalesce and issue file reads in parallel to improve performance on high-latency filesystems (e.g. S3). If True, Arrow will use a background I/O thread pool.

coerce_int96_timestamp_unitstr, default None

Cast timestamps that are stored in INT96 format to a particular resolution (e.g. ‘ms’). Setting to None is equivalent to ‘ns’ and therefore INT96 timestamps will be inferred as timestamps in nanoseconds.

decryption_propertiesFileDecryptionProperties, default None

File decryption properties for Parquet Modular Encryption.

thrift_string_size_limitint, default None

If not None, override the maximum total string size allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files.

thrift_container_size_limitint, default None

If not None, override the maximum total size of containers allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files.

filesystemFileSystem, default None

If nothing passed, will be inferred based on path. Path will try to be found in the local on-disk filesystem otherwise it will be parsed as an URI to determine the filesystem.

page_checksum_verificationbool, default False

If True, verify the checksum for each page read from the file.

Examples

Generate an example PyArrow Table and write it to Parquet file:

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')

Create a ParquetFile object from the Parquet file:

>>> parquet_file = pq.ParquetFile('example.parquet')

Read the data:

>>> parquet_file.read()
pyarrow.Table
n_legs: int64
animal: string
----
n_legs: [[2,2,4,4,5,100]]
animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]

Create a ParquetFile object with “animal” column as DictionaryArray:

>>> parquet_file = pq.ParquetFile('example.parquet',
...                               read_dictionary=["animal"])
>>> parquet_file.read()
pyarrow.Table
n_legs: int64
animal: dictionary<values=string, indices=int32, ordered=0>
----
n_legs: [[2,2,4,4,5,100]]
animal: [  -- dictionary:
["Flamingo","Parrot",...,"Brittle stars","Centipede"]  -- indices:
[0,1,2,3,4,5]]
__init__(source, *, metadata=None, common_metadata=None, read_dictionary=None, memory_map=False, buffer_size=0, pre_buffer=False, coerce_int96_timestamp_unit=None, decryption_properties=None, thrift_string_size_limit=None, thrift_container_size_limit=None, filesystem=None, page_checksum_verification=False)[source]#

Methods

__init__(source, *[, metadata, ...])

close([force])

iter_batches([batch_size, row_groups, ...])

Read streaming batches from a Parquet file.

read([columns, use_threads, use_pandas_metadata])

Read a Table from Parquet format.

read_row_group(i[, columns, use_threads, ...])

Read a single row group from a Parquet file.

read_row_groups(row_groups[, columns, ...])

Read a multiple row groups from a Parquet file.

scan_contents([columns, batch_size])

Read contents of file for the given columns and batch size.

Attributes

closed

metadata

Return the Parquet metadata.

num_row_groups

Return the number of row groups of the Parquet file.

schema

Return the Parquet schema, unconverted to Arrow types

schema_arrow

Return the inferred Arrow schema, converted from the whole Parquet file's schema

close(force: bool = False)[source]#
property closed: bool#
iter_batches(batch_size=65536, row_groups=None, columns=None, use_threads=True, use_pandas_metadata=False)[source]#

Read streaming batches from a Parquet file.

Parameters:
batch_sizeint, default 64K

Maximum number of records to yield per batch. Batches may be smaller if there aren’t enough rows in the file.

row_groupslist

Only these row groups will be read from the file.

columnslist

If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.

use_threadsbool, default True

Perform multi-threaded column reads.

use_pandas_metadatabool, default False

If True and file has custom pandas schema metadata, ensure that index columns are also loaded.

Yields:
pyarrow.RecordBatch

Contents of each batch as a record batch

Examples

Generate an example Parquet file:

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')
>>> parquet_file = pq.ParquetFile('example.parquet')
>>> for i in parquet_file.iter_batches():
...     print("RecordBatch")
...     print(i.to_pandas())
...
RecordBatch
   n_legs         animal
0       2       Flamingo
1       2         Parrot
2       4            Dog
3       4          Horse
4       5  Brittle stars
5     100      Centipede
property metadata#

Return the Parquet metadata.

property num_row_groups#

Return the number of row groups of the Parquet file.

Examples

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')
>>> parquet_file = pq.ParquetFile('example.parquet')
>>> parquet_file.num_row_groups
1
read(columns=None, use_threads=True, use_pandas_metadata=False)[source]#

Read a Table from Parquet format.

Parameters:
columnslist

If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.

use_threadsbool, default True

Perform multi-threaded column reads.

use_pandas_metadatabool, default False

If True and file has custom pandas schema metadata, ensure that index columns are also loaded.

Returns:
pyarrow.table.Table

Content of the file as a table (of columns).

Examples

Generate an example Parquet file:

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')
>>> parquet_file = pq.ParquetFile('example.parquet')

Read a Table:

>>> parquet_file.read(columns=["animal"])
pyarrow.Table
animal: string
----
animal: [["Flamingo","Parrot",...,"Brittle stars","Centipede"]]
read_row_group(i, columns=None, use_threads=True, use_pandas_metadata=False)[source]#

Read a single row group from a Parquet file.

Parameters:
iint

Index of the individual row group that we want to read.

columnslist

If not None, only these columns will be read from the row group. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.

use_threadsbool, default True

Perform multi-threaded column reads.

use_pandas_metadatabool, default False

If True and file has custom pandas schema metadata, ensure that index columns are also loaded.

Returns:
pyarrow.table.Table

Content of the row group as a table (of columns)

Examples

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')
>>> parquet_file = pq.ParquetFile('example.parquet')
>>> parquet_file.read_row_group(0)
pyarrow.Table
n_legs: int64
animal: string
----
n_legs: [[2,2,4,4,5,100]]
animal: [["Flamingo","Parrot",...,"Brittle stars","Centipede"]]
read_row_groups(row_groups, columns=None, use_threads=True, use_pandas_metadata=False)[source]#

Read a multiple row groups from a Parquet file.

Parameters:
row_groupslist

Only these row groups will be read from the file.

columnslist

If not None, only these columns will be read from the row group. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.

use_threadsbool, default True

Perform multi-threaded column reads.

use_pandas_metadatabool, default False

If True and file has custom pandas schema metadata, ensure that index columns are also loaded.

Returns:
pyarrow.table.Table

Content of the row groups as a table (of columns).

Examples

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')
>>> parquet_file = pq.ParquetFile('example.parquet')
>>> parquet_file.read_row_groups([0,0])
pyarrow.Table
n_legs: int64
animal: string
----
n_legs: [[2,2,4,4,5,...,2,4,4,5,100]]
animal: [["Flamingo","Parrot","Dog",...,"Brittle stars","Centipede"]]
scan_contents(columns=None, batch_size=65536)[source]#

Read contents of file for the given columns and batch size.

Parameters:
columnslist of integers, default None

Select columns to read, if None scan all columns.

batch_sizeint, default 64K

Number of rows to read at a time internally.

Returns:
num_rowsint

Number of rows in file

Notes

This function’s primary purpose is benchmarking. The scan is executed on a single thread.

Examples

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')
>>> parquet_file = pq.ParquetFile('example.parquet')
>>> parquet_file.scan_contents()
6
property schema#

Return the Parquet schema, unconverted to Arrow types

property schema_arrow#

Return the inferred Arrow schema, converted from the whole Parquet file’s schema

Examples

Generate an example Parquet file:

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')
>>> parquet_file = pq.ParquetFile('example.parquet')

Read the Arrow schema:

>>> parquet_file.schema_arrow
n_legs: int64
animal: string