pyarrow.parquet.ParquetFile¶
-
class
pyarrow.parquet.
ParquetFile
(source, metadata=None, common_metadata=None, read_dictionary=None, memory_map=False, buffer_size=0, pre_buffer=False, coerce_int96_timestamp_unit=None)[source]¶ Bases:
object
Reader interface for a single Parquet file.
- Parameters
source (str, pathlib.Path, pyarrow.NativeFile, or file-like object) – Readable source. For passing bytes or buffer-like file containing a Parquet file, use pyarrow.BufferReader.
metadata (FileMetaData, default None) – Use existing metadata object, rather than reading from file.
common_metadata (FileMetaData, default None) – Will be used in reads for pandas schema metadata if not found in the main file’s metadata, no other uses at the moment.
memory_map (bool, default False) – If the source is a file path, use a memory map to read file, which can improve performance in some environments.
buffer_size (int, default 0) – If positive, perform read buffering when deserializing individual column chunks. Otherwise IO calls are unbuffered.
pre_buffer (bool, default False) – Coalesce and issue file reads in parallel to improve performance on high-latency filesystems (e.g. S3). If True, Arrow will use a background I/O thread pool.
read_dictionary (list) – List of column names to read directly as DictionaryArray.
coerce_int96_timestamp_unit (str, default None.) – Cast timestamps that are stored in INT96 format to a particular resolution (e.g. ‘ms’). Setting to None is equivalent to ‘ns’ and therefore INT96 timestamps will be infered as timestamps in nanoseconds.
-
__init__
(source, metadata=None, common_metadata=None, read_dictionary=None, memory_map=False, buffer_size=0, pre_buffer=False, coerce_int96_timestamp_unit=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
(source[, metadata, …])Initialize self.
iter_batches
([batch_size, row_groups, …])Read streaming batches from a Parquet file
read
([columns, use_threads, use_pandas_metadata])Read a Table from Parquet format,
read_row_group
(i[, columns, use_threads, …])Read a single row group from a Parquet file.
read_row_groups
(row_groups[, columns, …])Read a multiple row groups from a Parquet file.
scan_contents
([columns, batch_size])Read contents of file for the given columns and batch size.
Attributes
Return the Parquet schema, unconverted to Arrow types
Return the inferred Arrow schema, converted from the whole Parquet file’s schema
-
iter_batches
(batch_size=65536, row_groups=None, columns=None, use_threads=True, use_pandas_metadata=False)[source]¶ Read streaming batches from a Parquet file
- Parameters
batch_size (int, default 64K) – Maximum number of records to yield per batch. Batches may be smaller if there aren’t enough rows in the file.
row_groups (list) – Only these row groups will be read from the file.
columns (list) – If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.
use_threads (boolean, default True) – Perform multi-threaded column reads.
use_pandas_metadata (boolean, default False) – If True and file has custom pandas schema metadata, ensure that index columns are also loaded.
- Returns
iterator of pyarrow.RecordBatch – Contents of each batch as a record batch
-
property
metadata
¶
-
property
num_row_groups
¶
-
read
(columns=None, use_threads=True, use_pandas_metadata=False)[source]¶ Read a Table from Parquet format,
- Parameters
columns (list) – If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.
use_threads (bool, default True) – Perform multi-threaded column reads.
use_pandas_metadata (bool, default False) – If True and file has custom pandas schema metadata, ensure that index columns are also loaded.
- Returns
pyarrow.table.Table – Content of the file as a table (of columns).
-
read_row_group
(i, columns=None, use_threads=True, use_pandas_metadata=False)[source]¶ Read a single row group from a Parquet file.
- Parameters
i (int) – Index of the individual row group that we want to read.
columns (list) – If not None, only these columns will be read from the row group. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.
use_threads (bool, default True) – Perform multi-threaded column reads.
use_pandas_metadata (bool, default False) – If True and file has custom pandas schema metadata, ensure that index columns are also loaded.
- Returns
pyarrow.table.Table – Content of the row group as a table (of columns)
-
read_row_groups
(row_groups, columns=None, use_threads=True, use_pandas_metadata=False)[source]¶ Read a multiple row groups from a Parquet file.
- Parameters
row_groups (list) – Only these row groups will be read from the file.
columns (list) – If not None, only these columns will be read from the row group. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.
use_threads (bool, default True) – Perform multi-threaded column reads.
use_pandas_metadata (bool, default False) – If True and file has custom pandas schema metadata, ensure that index columns are also loaded.
- Returns
pyarrow.table.Table – Content of the row groups as a table (of columns).
-
scan_contents
(columns=None, batch_size=65536)[source]¶ Read contents of file for the given columns and batch size.
Notes
This function’s primary purpose is benchmarking. The scan is executed on a single thread.
- Parameters
columns (list of integers, default None) – Select columns to read, if None scan all columns.
batch_size (int, default 64K) – Number of rows to read at a time internally.
- Returns
num_rows (number of rows in file)
-
property
schema
¶ Return the Parquet schema, unconverted to Arrow types
-
property
schema_arrow
¶ Return the inferred Arrow schema, converted from the whole Parquet file’s schema