pyarrow.dataset.ParquetFragmentScanOptions#

class pyarrow.dataset.ParquetFragmentScanOptions(bool use_buffered_stream=False, *, buffer_size=8192, bool pre_buffer=True, cache_options=None, thrift_string_size_limit=None, thrift_container_size_limit=None, decryption_config=None, bool page_checksum_verification=False)#

Bases: FragmentScanOptions

Scan-specific options for Parquet fragments.

Parameters:
use_buffered_streambool, default False

Read files through buffered input streams rather than loading entire row groups at once. This may be enabled to reduce memory overhead. Disabled by default.

buffer_sizeint, default 8192

Size of buffered stream, if enabled. Default is 8KB.

pre_bufferbool, default True

If enabled, pre-buffer the raw Parquet data instead of issuing one read per column chunk. This can improve performance on high-latency filesystems (e.g. S3, GCS) by coalescing and issuing file reads in parallel using a background I/O thread pool. Set to False if you want to prioritize minimal memory usage over maximum speed.

cache_optionspyarrow.CacheOptions, default None

Cache options used when pre_buffer is enabled. The default values should be good for most use cases. You may want to adjust these for example if you have exceptionally high latency to the file system.

thrift_string_size_limitint, default None

If not None, override the maximum total string size allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files.

thrift_container_size_limitint, default None

If not None, override the maximum total size of containers allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files.

decryption_configpyarrow.dataset.ParquetDecryptionConfig, default None

If not None, use the provided ParquetDecryptionConfig to decrypt the Parquet file.

page_checksum_verificationbool, default False

If True, verify the page checksum for each page read from the file.

__init__(*args, **kwargs)#

Methods

__init__(*args, **kwargs)

equals(self, ParquetFragmentScanOptions other)

Parameters:

Attributes

buffer_size

cache_options

page_checksum_verification

parquet_decryption_config

pre_buffer

thrift_container_size_limit

thrift_string_size_limit

type_name

use_buffered_stream

buffer_size#
cache_options#
equals(self, ParquetFragmentScanOptions other)#
Parameters:
otherpyarrow.dataset.ParquetFragmentScanOptions
Returns:
bool
page_checksum_verification#
parquet_decryption_config#
pre_buffer#
thrift_container_size_limit#
thrift_string_size_limit#
type_name#
use_buffered_stream#