pyarrow.parquet.ParquetFile¶
- 
class pyarrow.parquet.ParquetFile(source, metadata=None, common_metadata=None, read_dictionary=None, memory_map=False, buffer_size=0, pre_buffer=False, coerce_int96_timestamp_unit=None)[source]¶
- Bases: - object- Reader interface for a single Parquet file. - Parameters
- source (str, pathlib.Path, pyarrow.NativeFile, or file-like object) – Readable source. For passing bytes or buffer-like file containing a Parquet file, use pyarrow.BufferReader. 
- metadata (FileMetaData, default None) – Use existing metadata object, rather than reading from file. 
- common_metadata (FileMetaData, default None) – Will be used in reads for pandas schema metadata if not found in the main file’s metadata, no other uses at the moment. 
- memory_map (bool, default False) – If the source is a file path, use a memory map to read file, which can improve performance in some environments. 
- buffer_size (int, default 0) – If positive, perform read buffering when deserializing individual column chunks. Otherwise IO calls are unbuffered. 
- pre_buffer (bool, default False) – Coalesce and issue file reads in parallel to improve performance on high-latency filesystems (e.g. S3). If True, Arrow will use a background I/O thread pool. 
- read_dictionary (list) – List of column names to read directly as DictionaryArray. 
- coerce_int96_timestamp_unit (str, default None.) – Cast timestamps that are stored in INT96 format to a particular resolution (e.g. ‘ms’). Setting to None is equivalent to ‘ns’ and therefore INT96 timestamps will be infered as timestamps in nanoseconds. 
 
 - 
__init__(source, metadata=None, common_metadata=None, read_dictionary=None, memory_map=False, buffer_size=0, pre_buffer=False, coerce_int96_timestamp_unit=None)[source]¶
- Initialize self. See help(type(self)) for accurate signature. 
 - Methods - __init__(source[, metadata, …])- Initialize self. - iter_batches([batch_size, row_groups, …])- Read streaming batches from a Parquet file - read([columns, use_threads, use_pandas_metadata])- Read a Table from Parquet format, - read_row_group(i[, columns, use_threads, …])- Read a single row group from a Parquet file. - read_row_groups(row_groups[, columns, …])- Read a multiple row groups from a Parquet file. - scan_contents([columns, batch_size])- Read contents of file for the given columns and batch size. - Attributes - Return the Parquet schema, unconverted to Arrow types - Return the inferred Arrow schema, converted from the whole Parquet file’s schema - 
iter_batches(batch_size=65536, row_groups=None, columns=None, use_threads=True, use_pandas_metadata=False)[source]¶
- Read streaming batches from a Parquet file - Parameters
- batch_size (int, default 64K) – Maximum number of records to yield per batch. Batches may be smaller if there aren’t enough rows in the file. 
- row_groups (list) – Only these row groups will be read from the file. 
- columns (list) – If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’. 
- use_threads (boolean, default True) – Perform multi-threaded column reads. 
- use_pandas_metadata (boolean, default False) – If True and file has custom pandas schema metadata, ensure that index columns are also loaded. 
 
- Returns
- iterator of pyarrow.RecordBatch – Contents of each batch as a record batch 
 
 - 
property metadata¶
 - 
property num_row_groups¶
 - 
read(columns=None, use_threads=True, use_pandas_metadata=False)[source]¶
- Read a Table from Parquet format, - Parameters
- columns (list) – If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’. 
- use_threads (bool, default True) – Perform multi-threaded column reads. 
- use_pandas_metadata (bool, default False) – If True and file has custom pandas schema metadata, ensure that index columns are also loaded. 
 
- Returns
- pyarrow.table.Table – Content of the file as a table (of columns). 
 
 - 
read_row_group(i, columns=None, use_threads=True, use_pandas_metadata=False)[source]¶
- Read a single row group from a Parquet file. - Parameters
- i (int) – Index of the individual row group that we want to read. 
- columns (list) – If not None, only these columns will be read from the row group. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’. 
- use_threads (bool, default True) – Perform multi-threaded column reads. 
- use_pandas_metadata (bool, default False) – If True and file has custom pandas schema metadata, ensure that index columns are also loaded. 
 
- Returns
- pyarrow.table.Table – Content of the row group as a table (of columns) 
 
 - 
read_row_groups(row_groups, columns=None, use_threads=True, use_pandas_metadata=False)[source]¶
- Read a multiple row groups from a Parquet file. - Parameters
- row_groups (list) – Only these row groups will be read from the file. 
- columns (list) – If not None, only these columns will be read from the row group. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’. 
- use_threads (bool, default True) – Perform multi-threaded column reads. 
- use_pandas_metadata (bool, default False) – If True and file has custom pandas schema metadata, ensure that index columns are also loaded. 
 
- Returns
- pyarrow.table.Table – Content of the row groups as a table (of columns). 
 
 - 
scan_contents(columns=None, batch_size=65536)[source]¶
- Read contents of file for the given columns and batch size. - Notes - This function’s primary purpose is benchmarking. The scan is executed on a single thread. - Parameters
- columns (list of integers, default None) – Select columns to read, if None scan all columns. 
- batch_size (int, default 64K) – Number of rows to read at a time internally. 
 
- Returns
- num_rows (number of rows in file) 
 
 - 
property schema¶
- Return the Parquet schema, unconverted to Arrow types 
 - 
property schema_arrow¶
- Return the inferred Arrow schema, converted from the whole Parquet file’s schema 
 
