pyarrow.dataset.ParquetFragmentScanOptions#

class pyarrow.dataset.ParquetFragmentScanOptions(bool use_buffered_stream=False, *, buffer_size=8192, bool pre_buffer=True, cache_options=None, thrift_string_size_limit=None, thrift_container_size_limit=None, decryption_config=None, decryption_properties=None, bool page_checksum_verification=False, bool arrow_extensions_enabled=False)#

Bases: FragmentScanOptions

Scan-specific options for Parquet fragments.

Parameters:

use_buffered_streambool, default False: Read files through buffered input streams rather than loading entire row groups at once. This may be enabled to reduce memory overhead. Disabled by default.
buffer_sizeint, default 8192: Size of buffered stream, if enabled. Default is 8KB.
pre_bufferbool, default True: If enabled, pre-buffer the raw Parquet data instead of issuing one read per column chunk. This can improve performance on high-latency filesystems (e.g. S3, GCS) by coalescing and issuing file reads in parallel using a background I/O thread pool. Set to False if you want to prioritize minimal memory usage over maximum speed.
cache_optionspyarrow.CacheOptions, default None: Cache options used when pre_buffer is enabled. The default values should be good for most use cases. You may want to adjust these for example if you have exceptionally high latency to the file system.
thrift_string_size_limitint, default None: If not None, override the maximum total string size allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files.
thrift_container_size_limitint, default None: If not None, override the maximum total size of containers allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files.
decryption_configpyarrow.dataset.ParquetDecryptionConfig, default None: If not None, use the provided ParquetDecryptionConfig to decrypt the Parquet file.
decryption_propertiespyarrow.parquet.FileDecryptionProperties, default None: If not None, use the provided FileDecryptionProperties to decrypt encrypted Parquet file.
page_checksum_verificationbool, default False: If True, verify the page checksum for each page read from the file.
arrow_extensions_enabledbool, default False: If True, read Parquet logical types as Arrow extension types where possible, (e.g., read JSON as the canonical arrow.json extension type or UUID as the canonical arrow.uuid extension type).

__init__(*args, **kwargs)#

Methods

`__init__`(args, *kwargs)
`equals`(self, ParquetFragmentScanOptions other)

Attributes

`arrow_extensions_enabled`
`buffer_size`
`cache_options`
`decryption_properties`
`page_checksum_verification`
`parquet_decryption_config`
`pre_buffer`
`thrift_container_size_limit`
`thrift_string_size_limit`
`type_name`
`use_buffered_stream`

arrow_extensions_enabled#

buffer_size#

cache_options#

decryption_properties#

equals(self, ParquetFragmentScanOptions other)#

Parameters:

otherpyarrow.dataset.ParquetFragmentScanOptions

Returns:

bool

page_checksum_verification#

parquet_decryption_config#

pre_buffer#

thrift_container_size_limit#

thrift_string_size_limit#

type_name#

use_buffered_stream#