parquet::column::chunker::cdc

Struct ContentDefinedChunker

pub(crate) struct ContentDefinedChunker {
    max_def_level: i16,
    max_rep_level: i16,
    repeated_ancestor_def_level: i16,
    min_chunk_size: i64,
    max_chunk_size: i64,
    rolling_hash_mask: u64,
    rolling_hash: u64,
    has_matched: bool,
    nth_run: usize,
    chunk_size: i64,
}

Expand description

CDC (Content-Defined Chunking) divides data into variable-sized chunks based on content rather than fixed-size boundaries.

For example, given this sequence of values in a column:

File1:    [1,2,3,   4,5,6,   7,8,9]
           chunk1   chunk2   chunk3

If a value is inserted between 3 and 4:

File2:    [1,2,3,0,   4,5,6,   7,8,9]
           new-chunk  chunk2   chunk3

The chunking process adjusts to maintain stable boundaries across data modifications. Each chunk defines a new parquet data page which is contiguously written to the file. Since each page is compressed independently, the files’ contents look like:

File1:    [Page1][Page2][Page3]...
File2:    [Page4][Page2][Page3]...

When uploaded to a content-addressable storage (CAS) system, the CAS splits the byte stream into content-defined blobs with unique identifiers. Identical blobs are stored only once, so Page2 and Page3 are deduplicated across File1 and File2.

§Implementation

Only the parquet writer needs to be aware of content-defined chunking; the reader is unaffected. Each parquet column writer holds a ContentDefinedChunker instance depending on the writer’s properties. The chunker’s state is maintained across the entire column without being reset between pages and row groups.

This implements a FastCDC-inspired algorithm using gear hashing. The input data is fed byte-by-byte into a rolling hash; when the hash matches a predefined mask, a new chunk boundary candidate is recorded. To reduce the exponential variance of chunk sizes inherent in a single gear hash, the algorithm requires 8 consecutive mask matches — each against a different pre-computed gear hash table — before committing to a boundary. This central-limit-theorem normalization makes the chunk size distribution approximately normal between min_chunk_size and max_chunk_size.

The chunker receives the record-shredded column data (def_levels, rep_levels, values) and iterates over the (def_level, rep_level, value) triplets while adjusting the column-global rolling hash. Whenever the rolling hash matches, the chunker creates a new chunk. For nested data (lists, maps, structs) chunk boundaries are restricted to top-level record boundaries (rep_level == 0) so that a nested row is never split across chunks.

Note that boundaries are deterministically calculated exclusively based on the data itself, so the same data always produces the same chunks given the same configuration.

Ported from the C++ implementation in apache/arrow#45360 (cpp/src/parquet/chunker_internal.cc).

Fields§

§max_def_level: i16

Maximum definition level for this column.

§max_rep_level: i16

Maximum repetition level for this column.

§repeated_ancestor_def_level: i16

Definition level at the nearest REPEATED ancestor.

§min_chunk_size: i64

Minimum chunk size in bytes. The rolling hash will not be updated until this size is reached for each chunk. All data sent through the hash function counts towards the chunk size, including definition and repetition levels if present.

§max_chunk_size: i64

Maximum chunk size in bytes. A new chunk is created whenever the chunk size exceeds this value. The chunk size distribution approximates a normal distribution between min_chunk_size and max_chunk_size. Note that the parquet writer has a related data_pagesize property that controls the maximum size of a parquet data page after encoding. While setting data_pagesize smaller than max_chunk_size doesn’t affect chunking effectiveness, it results in more small parquet data pages.

§rolling_hash_mask: u64

Mask for matching against the rolling hash.

§rolling_hash: u64

Rolling hash state, never reset — initialized once for the entire column.

§has_matched: bool

Whether the rolling hash has matched the mask since the last chunk boundary.

§nth_run: usize

Current run count for the central-limit-theorem normalization.

§chunk_size: i64

Current chunk size in bytes.

Struct ContentDefinedChunker Copy item path

§Implementation

Fields§

Implementations§

impl ContentDefinedChunker

pub fn new(desc: &ColumnDescriptor, options: &CdcOptions) -> Result<Self>

fn calculate_mask( min_chunk_size: i64, max_chunk_size: i64, norm_level: i32, ) -> Result<u64>

fn roll(&mut self, bytes: &[u8])

fn roll_fixed<const N: usize>(&mut self, bytes: &[u8; N])

fn roll_level(&mut self, level: i16)

fn need_new_chunk(&mut self) -> bool

fn calculate<F>( &mut self, def_levels: LevelDataRef<'_>, rep_levels: LevelDataRef<'_>, num_levels: usize, roll_value: F, ) -> Vec<CdcChunk>where F: FnMut(&mut Self, usize),

pub(crate) fn get_arrow_chunks( &mut self, def_levels: LevelDataRef<'_>, rep_levels: LevelDataRef<'_>, array: &dyn Array, ) -> Result<Vec<CdcChunk>>

fn validate_chunks( &self, chunks: &[CdcChunk], num_levels: usize, total_values: usize, )

Trait Implementations§

impl Debug for ContentDefinedChunker

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Auto Trait Implementations§

impl Freeze for ContentDefinedChunker

impl RefUnwindSafe for ContentDefinedChunker

impl Send for ContentDefinedChunker

impl Sync for ContentDefinedChunker

impl Unpin for ContentDefinedChunker

impl UnsafeUnpin for ContentDefinedChunker

impl UnwindSafe for ContentDefinedChunker

Blanket Implementations§

impl<T> Allocation for Twhere T: RefUnwindSafe + Send + Sync,

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<T> Ungil for Twhere T: Send,

impl<V, T> VZip<V> for Twhere V: MultiLane<T>,

fn vzip(self) -> V

Struct ContentDefinedChunker

fn calculate<F>( &mut self, def_levels: LevelDataRef<'_>, rep_levels: LevelDataRef<'_>, num_levels: usize, roll_value: F, ) -> Vec<CdcChunk>
where F: FnMut(&mut Self, usize),

impl<T> Allocation for T
where T: RefUnwindSafe + Send + Sync,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<T> Ungil for T
where T: Send,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,