pyarrow.parquet.ParquetFile

class pyarrow.parquet.ParquetFile(source, metadata=None, common_metadata=None)[source]

Bases: object

Reader interface for a single Parquet file

Parameters:
  • source (str or pyarrow.io.NativeFile) – Readable source. For passing Python file objects or byte buffers, see pyarrow.io.PythonFileInterface or pyarrow.io.BufferReader.
  • metadata (ParquetFileMetadata, default None) – Use existing metadata object, rather than reading from file.
  • common_metadata (ParquetFileMetadata, default None) – Will be used in reads for pandas schema metadata if not found in the main file’s metadata, no other uses at the moment
__init__(source, metadata=None, common_metadata=None)[source]

Methods

__init__(source[, metadata, common_metadata])
read([columns, nthreads, use_pandas_metadata]) Read a Table from Parquet format
read_row_group(i[, columns, nthreads, …]) Read a single row group from a Parquet file
scan_contents([columns, batch_size]) Read contents of file with a single thread for indicated columns and batch size.
metadata
num_row_groups
read(columns=None, nthreads=1, use_pandas_metadata=False)[source]

Read a Table from Parquet format

Parameters:
  • columns (list) – If not None, only these columns will be read from the file.
  • nthreads (int, default 1) – Number of columns to read in parallel. If > 1, requires that the underlying file source is threadsafe
  • use_pandas_metadata (boolean, default False) – If True and file has custom pandas schema metadata, ensure that index columns are also loaded
Returns:

pyarrow.table.Table – Content of the file as a table (of columns)

read_row_group(i, columns=None, nthreads=1, use_pandas_metadata=False)[source]

Read a single row group from a Parquet file

Parameters:
  • columns (list) – If not None, only these columns will be read from the row group.
  • nthreads (int, default 1) – Number of columns to read in parallel. If > 1, requires that the underlying file source is threadsafe
  • use_pandas_metadata (boolean, default False) – If True and file has custom pandas schema metadata, ensure that index columns are also loaded
Returns:

pyarrow.table.Table – Content of the row group as a table (of columns)

scan_contents(columns=None, batch_size=65536)[source]

Read contents of file with a single thread for indicated columns and batch size. Number of rows in file is returned. This function is used for benchmarking

Parameters:
  • columns (list of integers, default None) – If None, scan all columns
  • batch_size (int, default 64K) – Number of rows to read at a time internally
Returns:

num_rows (number of rows in file)

schema