pyarrow.parquet.ParquetFile¶

class pyarrow.parquet.ParquetFile(source, *, metadata=None, common_metadata=None, read_dictionary=None, memory_map=False, buffer_size=0, pre_buffer=False, coerce_int96_timestamp_unit=None, decryption_properties=None, thrift_string_size_limit=None, thrift_container_size_limit=None, filesystem=None)[source]¶

Bases: object

Reader interface for a single Parquet file.

Parameters:

sourcestr, pathlib.Path, pyarrow.NativeFile, or file-like object: Readable source. For passing bytes or buffer-like file containing a Parquet file, use pyarrow.BufferReader.
metadataFileMetaData, default None: Use existing metadata object, rather than reading from file.
common_metadataFileMetaData, default None: Will be used in reads for pandas schema metadata if not found in the main file’s metadata, no other uses at the moment.
read_dictionarylist: List of column names to read directly as DictionaryArray.
memory_mapbool, default False: If the source is a file path, use a memory map to read file, which can improve performance in some environments.
buffer_sizeint, default 0: If positive, perform read buffering when deserializing individual column chunks. Otherwise IO calls are unbuffered.
pre_bufferbool, default False: Coalesce and issue file reads in parallel to improve performance on high-latency filesystems (e.g. S3). If True, Arrow will use a background I/O thread pool.
coerce_int96_timestamp_unitstr, default None: Cast timestamps that are stored in INT96 format to a particular resolution (e.g. ‘ms’). Setting to None is equivalent to ‘ns’ and therefore INT96 timestamps will be inferred as timestamps in nanoseconds.
decryption_propertiesFileDecryptionProperties, default None: File decryption properties for Parquet Modular Encryption.
thrift_string_size_limitint, default None: If not None, override the maximum total string size allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files.
thrift_container_size_limitint, default None: If not None, override the maximum total size of containers allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files.
filesystemFileSystem, default None: If nothing passed, will be inferred based on path. Path will try to be found in the local on-disk filesystem otherwise it will be parsed as an URI to determine the filesystem.

Examples

Generate an example PyArrow Table and write it to Parquet file:

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})

>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')

Create a ParquetFile object from the Parquet file:

>>> parquet_file = pq.ParquetFile('example.parquet')

Read the data:

>>> parquet_file.read()
pyarrow.Table
n_legs: int64
animal: string
----
n_legs: [[2,2,4,4,5,100]]
animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]

Create a ParquetFile object with “animal” column as DictionaryArray:

>>> parquet_file = pq.ParquetFile('example.parquet',
...                               read_dictionary=["animal"])
>>> parquet_file.read()
pyarrow.Table
n_legs: int64
animal: dictionary<values=string, indices=int32, ordered=0>
----
n_legs: [[2,2,4,4,5,100]]
animal: [  -- dictionary:
["Flamingo","Parrot",...,"Brittle stars","Centipede"]  -- indices:
[0,1,2,3,4,5]]

__init__(source, *, metadata=None, common_metadata=None, read_dictionary=None, memory_map=False, buffer_size=0, pre_buffer=False, coerce_int96_timestamp_unit=None, decryption_properties=None, thrift_string_size_limit=None, thrift_container_size_limit=None, filesystem=None)[source]¶

Methods

`__init__`(source, *[, metadata, ...])
`close`([force])
`iter_batches`([batch_size, row_groups, ...])	Read streaming batches from a Parquet file.
`read`([columns, use_threads, use_pandas_metadata])	Read a Table from Parquet format.
`read_row_group`(i[, columns, use_threads, ...])	Read a single row group from a Parquet file.
`read_row_groups`(row_groups[, columns, ...])	Read a multiple row groups from a Parquet file.
`scan_contents`([columns, batch_size])	Read contents of file for the given columns and batch size.

Attributes

`closed`
`metadata`	Return the Parquet metadata.
`num_row_groups`	Return the number of row groups of the Parquet file.
`schema`	Return the Parquet schema, unconverted to Arrow types
`schema_arrow`	Return the inferred Arrow schema, converted from the whole Parquet file's schema

close(force: bool = False)[source]¶

property closed: bool¶

iter_batches(batch_size=65536, row_groups=None, columns=None, use_threads=True, use_pandas_metadata=False)[source]¶

Read streaming batches from a Parquet file.

Parameters:

batch_sizeint, default 64K: Maximum number of records to yield per batch. Batches may be smaller if there aren’t enough rows in the file.
row_groupslist: Only these row groups will be read from the file.
columnslist: If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.
use_threadsbool, default True: Perform multi-threaded column reads.
use_pandas_metadatabool, default False: If True and file has custom pandas schema metadata, ensure that index columns are also loaded.

Yields:

pyarrow.RecordBatch: Contents of each batch as a record batch

Examples

Generate an example Parquet file:

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')
>>> parquet_file = pq.ParquetFile('example.parquet')
>>> for i in parquet_file.iter_batches():
...     print("RecordBatch")
...     print(i.to_pandas())
...
RecordBatch
   n_legs         animal
0       2       Flamingo
1       2         Parrot
2       4            Dog
3       4          Horse
4       5  Brittle stars
5     100      Centipede

property metadata¶: Return the Parquet metadata.

property num_row_groups¶

Return the number of row groups of the Parquet file.

Examples

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')
>>> parquet_file = pq.ParquetFile('example.parquet')

>>> parquet_file.num_row_groups
1

read(columns=None, use_threads=True, use_pandas_metadata=False)[source]¶

Read a Table from Parquet format.

Parameters:

columnslist: If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.
use_threadsbool, default True: Perform multi-threaded column reads.
use_pandas_metadatabool, default False: If True and file has custom pandas schema metadata, ensure that index columns are also loaded.

Returns:

pyarrow.table.Table: Content of the file as a table (of columns).

Examples

Generate an example Parquet file:

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')
>>> parquet_file = pq.ParquetFile('example.parquet')

Read a Table:

>>> parquet_file.read(columns=["animal"])
pyarrow.Table
animal: string
----
animal: [["Flamingo","Parrot",...,"Brittle stars","Centipede"]]

read_row_group(i, columns=None, use_threads=True, use_pandas_metadata=False)[source]¶

Read a single row group from a Parquet file.

Parameters:

iint: Index of the individual row group that we want to read.
columnslist: If not None, only these columns will be read from the row group. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.
use_threadsbool, default True: Perform multi-threaded column reads.
use_pandas_metadatabool, default False: If True and file has custom pandas schema metadata, ensure that index columns are also loaded.

Returns:

pyarrow.table.Table: Content of the row group as a table (of columns)

Examples

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')
>>> parquet_file = pq.ParquetFile('example.parquet')

>>> parquet_file.read_row_group(0)
pyarrow.Table
n_legs: int64
animal: string
----
n_legs: [[2,2,4,4,5,100]]
animal: [["Flamingo","Parrot",...,"Brittle stars","Centipede"]]

read_row_groups(row_groups, columns=None, use_threads=True, use_pandas_metadata=False)[source]¶

Read a multiple row groups from a Parquet file.

Parameters:

row_groupslist: Only these row groups will be read from the file.
columnslist: If not None, only these columns will be read from the row group. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.
use_threadsbool, default True: Perform multi-threaded column reads.
use_pandas_metadatabool, default False: If True and file has custom pandas schema metadata, ensure that index columns are also loaded.

Returns:

pyarrow.table.Table: Content of the row groups as a table (of columns).

Examples

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')
>>> parquet_file = pq.ParquetFile('example.parquet')

>>> parquet_file.read_row_groups([0,0])
pyarrow.Table
n_legs: int64
animal: string
----
n_legs: [[2,2,4,4,5,...,2,4,4,5,100]]
animal: [["Flamingo","Parrot","Dog",...,"Brittle stars","Centipede"]]

scan_contents(columns=None, batch_size=65536)[source]¶

Read contents of file for the given columns and batch size.

Parameters:

columnslist of integers, default None: Select columns to read, if None scan all columns.
batch_sizeint, default 64K: Number of rows to read at a time internally.

Returns:

num_rowsint: Number of rows in file

Notes

This function’s primary purpose is benchmarking. The scan is executed on a single thread.

Examples

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')
>>> parquet_file = pq.ParquetFile('example.parquet')

>>> parquet_file.scan_contents()
6

property schema¶: Return the Parquet schema, unconverted to Arrow types

property schema_arrow¶

Return the inferred Arrow schema, converted from the whole Parquet file’s schema

Examples

Generate an example Parquet file:

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, 'example.parquet')
>>> parquet_file = pq.ParquetFile('example.parquet')

Read the Arrow schema:

>>> parquet_file.schema_arrow
n_legs: int64
animal: string

pyarrow.parquet.ParquetDataset

pyarrow.parquet.ParquetWriter