pub struct CdcOptions {
pub min_chunk_size: usize,
pub max_chunk_size: usize,
pub norm_level: i32,
}Expand description
EXPERIMENTAL: Options for content-defined chunking (CDC).
Content-defined chunking is an experimental feature that optimizes parquet files for content addressable storage (CAS) systems by writing data pages according to content-defined chunk boundaries. This allows for more efficient deduplication of data across files, hence more efficient network transfers and storage.
Each content-defined chunk is written as a separate parquet data page. The following options control the chunks’ size and the chunking process. Note that the chunk size is calculated based on the logical value of the data, before any encoding or compression is applied.
Fields§
§min_chunk_size: usizeMinimum chunk size in bytes, default is 256 KiB. The rolling hash will not be updated until this size is reached for each chunk. Note that all data sent through the hash function is counted towards the chunk size, including definition and repetition levels if present.
max_chunk_size: usizeMaximum chunk size in bytes, default is 1024 KiB.
The chunker will create a new chunk whenever the chunk size exceeds this value.
Note that the parquet writer has a related data_page_size_limit property that
controls the maximum size of a parquet data page after encoding. While setting
data_page_size_limit to a smaller value than max_chunk_size doesn’t affect
the chunking effectiveness, it results in more small parquet data pages.
norm_level: i32Number of bit adjustment to the gearhash mask in order to center the chunk size around the average size more aggressively, default is 0. Increasing the normalization level increases the probability of finding a chunk, improving the deduplication ratio, but also increasing the number of small chunks resulting in many small parquet data pages. The default value provides a good balance between deduplication ratio and fragmentation. Use norm_level=1 or norm_level=2 to reach a higher deduplication ratio at the expense of fragmentation. Negative values can also be used to reduce the probability of finding a chunk, resulting in larger chunks and fewer data pages. Note that values outside [-3, 3] are not recommended, prefer using the default value of 0 for most use cases.
Trait Implementations§
Source§impl Clone for CdcOptions
impl Clone for CdcOptions
Source§fn clone(&self) -> CdcOptions
fn clone(&self) -> CdcOptions
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more