pyarrow.parquet.ParquetFile#
- class pyarrow.parquet.ParquetFile(source, *, metadata=None, common_metadata=None, read_dictionary=None, binary_type=None, list_type=None, memory_map=False, buffer_size=0, pre_buffer=False, coerce_int96_timestamp_unit=None, decryption_properties=None, thrift_string_size_limit=None, thrift_container_size_limit=None, filesystem=None, page_checksum_verification=False, arrow_extensions_enabled=True)[source]#
- Bases: - object- Reader interface for a single Parquet file. - Parameters:
- sourcestr,pathlib.Path,pyarrow.NativeFile, or file-like object
- Readable source. For passing bytes or buffer-like file containing a Parquet file, use pyarrow.BufferReader. 
- metadataFileMetaData, defaultNone
- Use existing metadata object, rather than reading from file. 
- common_metadataFileMetaData, defaultNone
- Will be used in reads for pandas schema metadata if not found in the main file’s metadata, no other uses at the moment. 
- read_dictionarylist
- List of column names to read directly as DictionaryArray. 
- binary_typepyarrow.DataType, defaultNone
- If given, Parquet binary columns will be read as this datatype. This setting is ignored if a serialized Arrow schema is found in the Parquet metadata. 
- list_typesubclassofpyarrow.DataType, defaultNone
- If given, non-MAP repeated columns will be read as an instance of this datatype (either pyarrow.ListType or pyarrow.LargeListType). This setting is ignored if a serialized Arrow schema is found in the Parquet metadata. 
- memory_mapbool, default False
- If the source is a file path, use a memory map to read file, which can improve performance in some environments. 
- buffer_sizeint, default 0
- If positive, perform read buffering when deserializing individual column chunks. Otherwise IO calls are unbuffered. 
- pre_bufferbool, default False
- Coalesce and issue file reads in parallel to improve performance on high-latency filesystems (e.g. S3). If True, Arrow will use a background I/O thread pool. 
- coerce_int96_timestamp_unitstr, defaultNone
- Cast timestamps that are stored in INT96 format to a particular resolution (e.g. ‘ms’). Setting to None is equivalent to ‘ns’ and therefore INT96 timestamps will be inferred as timestamps in nanoseconds. 
- decryption_propertiesFileDecryptionProperties, defaultNone
- File decryption properties for Parquet Modular Encryption. 
- thrift_string_size_limitint, defaultNone
- If not None, override the maximum total string size allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files. 
- thrift_container_size_limitint, defaultNone
- If not None, override the maximum total size of containers allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files. 
- filesystemFileSystem, defaultNone
- If nothing passed, will be inferred based on path. Path will try to be found in the local on-disk filesystem otherwise it will be parsed as an URI to determine the filesystem. 
- page_checksum_verificationbool, default False
- If True, verify the checksum for each page read from the file. 
- arrow_extensions_enabledbool, default True
- If True, read Parquet logical types as Arrow extension types where possible, (e.g., read JSON as the canonical arrow.json extension type or UUID as the canonical arrow.uuid extension type). 
 
- source
 - Examples - Generate an example PyArrow Table and write it to Parquet file: - >>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) - >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet') - Create a - ParquetFileobject from the Parquet file:- >>> parquet_file = pq.ParquetFile('example.parquet') - Read the data: - >>> parquet_file.read() pyarrow.Table n_legs: int64 animal: string ---- n_legs: [[2,2,4,4,5,100]] animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]] - Create a ParquetFile object with “animal” column as DictionaryArray: - >>> parquet_file = pq.ParquetFile('example.parquet', ... read_dictionary=["animal"]) >>> parquet_file.read() pyarrow.Table n_legs: int64 animal: dictionary<values=string, indices=int32, ordered=0> ---- n_legs: [[2,2,4,4,5,100]] animal: [ -- dictionary: ["Flamingo","Parrot",...,"Brittle stars","Centipede"] -- indices: [0,1,2,3,4,5]] - __init__(source, *, metadata=None, common_metadata=None, read_dictionary=None, binary_type=None, list_type=None, memory_map=False, buffer_size=0, pre_buffer=False, coerce_int96_timestamp_unit=None, decryption_properties=None, thrift_string_size_limit=None, thrift_container_size_limit=None, filesystem=None, page_checksum_verification=False, arrow_extensions_enabled=True)[source]#
 - Methods - __init__(source, *[, metadata, ...])- close([force])- iter_batches([batch_size, row_groups, ...])- Read streaming batches from a Parquet file. - read([columns, use_threads, use_pandas_metadata])- Read a Table from Parquet format. - read_row_group(i[, columns, use_threads, ...])- Read a single row group from a Parquet file. - read_row_groups(row_groups[, columns, ...])- Read a multiple row groups from a Parquet file. - scan_contents([columns, batch_size])- Read contents of file for the given columns and batch size. - Attributes - Return the Parquet metadata. - Return the number of row groups of the Parquet file. - Return the Parquet schema, unconverted to Arrow types - Return the inferred Arrow schema, converted from the whole Parquet file's schema - iter_batches(batch_size=65536, row_groups=None, columns=None, use_threads=True, use_pandas_metadata=False)[source]#
- Read streaming batches from a Parquet file. - Parameters:
- batch_sizeint, default 64K
- Maximum number of records to yield per batch. Batches may be smaller if there aren’t enough rows in the file. 
- row_groupslist
- Only these row groups will be read from the file. 
- columnslist
- If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’. 
- use_threadsbool, default True
- Perform multi-threaded column reads. 
- use_pandas_metadatabool, default False
- If True and file has custom pandas schema metadata, ensure that index columns are also loaded. 
 
- batch_size
- Yields:
- pyarrow.RecordBatch
- Contents of each batch as a record batch 
 
 - Examples - Generate an example Parquet file: - >>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet') >>> parquet_file = pq.ParquetFile('example.parquet') >>> for i in parquet_file.iter_batches(): ... print("RecordBatch") ... print(i.to_pandas()) ... RecordBatch n_legs animal 0 2 Flamingo 1 2 Parrot 2 4 Dog 3 4 Horse 4 5 Brittle stars 5 100 Centipede 
 - property metadata#
- Return the Parquet metadata. 
 - property num_row_groups#
- Return the number of row groups of the Parquet file. - Examples - >>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet') >>> parquet_file = pq.ParquetFile('example.parquet') - >>> parquet_file.num_row_groups 1 
 - read(columns=None, use_threads=True, use_pandas_metadata=False)[source]#
- Read a Table from Parquet format. - Parameters:
- columnslist
- If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’. 
- use_threadsbool, default True
- Perform multi-threaded column reads. 
- use_pandas_metadatabool, default False
- If True and file has custom pandas schema metadata, ensure that index columns are also loaded. 
 
- columns
- Returns:
- pyarrow.table.Table
- Content of the file as a table (of columns). 
 
 - Examples - Generate an example Parquet file: - >>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet') >>> parquet_file = pq.ParquetFile('example.parquet') - Read a Table: - >>> parquet_file.read(columns=["animal"]) pyarrow.Table animal: string ---- animal: [["Flamingo","Parrot",...,"Brittle stars","Centipede"]] 
 - read_row_group(i, columns=None, use_threads=True, use_pandas_metadata=False)[source]#
- Read a single row group from a Parquet file. - Parameters:
- iint
- Index of the individual row group that we want to read. 
- columnslist
- If not None, only these columns will be read from the row group. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’. 
- use_threadsbool, default True
- Perform multi-threaded column reads. 
- use_pandas_metadatabool, default False
- If True and file has custom pandas schema metadata, ensure that index columns are also loaded. 
 
- i
- Returns:
- pyarrow.table.Table
- Content of the row group as a table (of columns) 
 
 - Examples - >>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet') >>> parquet_file = pq.ParquetFile('example.parquet') - >>> parquet_file.read_row_group(0) pyarrow.Table n_legs: int64 animal: string ---- n_legs: [[2,2,4,4,5,100]] animal: [["Flamingo","Parrot",...,"Brittle stars","Centipede"]] 
 - read_row_groups(row_groups, columns=None, use_threads=True, use_pandas_metadata=False)[source]#
- Read a multiple row groups from a Parquet file. - Parameters:
- row_groupslist
- Only these row groups will be read from the file. 
- columnslist
- If not None, only these columns will be read from the row group. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’. 
- use_threadsbool, default True
- Perform multi-threaded column reads. 
- use_pandas_metadatabool, default False
- If True and file has custom pandas schema metadata, ensure that index columns are also loaded. 
 
- row_groups
- Returns:
- pyarrow.table.Table
- Content of the row groups as a table (of columns). 
 
 - Examples - >>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet') >>> parquet_file = pq.ParquetFile('example.parquet') - >>> parquet_file.read_row_groups([0,0]) pyarrow.Table n_legs: int64 animal: string ---- n_legs: [[2,2,4,4,5,...,2,4,4,5,100]] animal: [["Flamingo","Parrot","Dog",...,"Brittle stars","Centipede"]] 
 - scan_contents(columns=None, batch_size=65536)[source]#
- Read contents of file for the given columns and batch size. - Parameters:
- Returns:
- num_rowsint
- Number of rows in file 
 
- num_rows
 - Notes - This function’s primary purpose is benchmarking. The scan is executed on a single thread. - Examples - >>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet') >>> parquet_file = pq.ParquetFile('example.parquet') - >>> parquet_file.scan_contents() 6 
 - property schema#
- Return the Parquet schema, unconverted to Arrow types 
 - property schema_arrow#
- Return the inferred Arrow schema, converted from the whole Parquet file’s schema - Examples - Generate an example Parquet file: - >>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet') >>> parquet_file = pq.ParquetFile('example.parquet') - Read the Arrow schema: - >>> parquet_file.schema_arrow n_legs: int64 animal: string 
 
 
    