Skip to main content

CdcOptions

Struct CdcOptions 

Source
pub struct CdcOptions {
    pub min_chunk_size: usize,
    pub max_chunk_size: usize,
    pub norm_level: i32,
}
Expand description

EXPERIMENTAL: Options for content-defined chunking (CDC).

Content-defined chunking is an experimental feature that optimizes parquet files for content addressable storage (CAS) systems by writing data pages according to content-defined chunk boundaries. This allows for more efficient deduplication of data across files, hence more efficient network transfers and storage.

Each content-defined chunk is written as a separate parquet data page. The following options control the chunks’ size and the chunking process. Note that the chunk size is calculated based on the logical value of the data, before any encoding or compression is applied.

Fields§

§min_chunk_size: usize

Minimum chunk size in bytes, default is 256 KiB. The rolling hash will not be updated until this size is reached for each chunk. Note that all data sent through the hash function is counted towards the chunk size, including definition and repetition levels if present.

§max_chunk_size: usize

Maximum chunk size in bytes, default is 1024 KiB. The chunker will create a new chunk whenever the chunk size exceeds this value. Note that the parquet writer has a related data_page_size_limit property that controls the maximum size of a parquet data page after encoding. While setting data_page_size_limit to a smaller value than max_chunk_size doesn’t affect the chunking effectiveness, it results in more small parquet data pages.

§norm_level: i32

Number of bit adjustment to the gearhash mask in order to center the chunk size around the average size more aggressively, default is 0. Increasing the normalization level increases the probability of finding a chunk, improving the deduplication ratio, but also increasing the number of small chunks resulting in many small parquet data pages. The default value provides a good balance between deduplication ratio and fragmentation. Use norm_level=1 or norm_level=2 to reach a higher deduplication ratio at the expense of fragmentation. Negative values can also be used to reduce the probability of finding a chunk, resulting in larger chunks and fewer data pages. Note that values outside [-3, 3] are not recommended, prefer using the default value of 0 for most use cases.

Trait Implementations§

Source§

impl Clone for CdcOptions

Source§

fn clone(&self) -> CdcOptions

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for CdcOptions

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for CdcOptions

Source§

fn default() -> Self

Returns the “default value” for a type. Read more
Source§

impl Copy for CdcOptions

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

§

fn vzip(self) -> V

§

impl<T> Allocation for T
where T: RefUnwindSafe + Send + Sync,

§

impl<T> Ungil for T
where T: Send,