Skip to main content

Sbbf

Struct Sbbf 

Source
pub struct Sbbf(Vec<Block>);
Expand description

A split block Bloom filter (SBBF).

An SBBF partitions its bit space into fixed-size 256-bit (32-byte) blocks, each fitting in a single CPU cache line. Each block contains eight 32-bit words, aligned with SIMD lanes for parallel bit manipulation. When checking membership, only one block is accessed per query, eliminating the cache-miss penalty of standard Bloom filters.

§Sizing and folding

Filters are initially sized for a maximum expected number of distinct values (NDV) via Sbbf::new_with_ndv_fpp. After all values are inserted, the filter is compacted by calling Sbbf::fold_to_target_fpp, which folds the filter down to the smallest size that still meets the target false positive probability.

The creation of this structure is based on the crate::file::properties::BloomFilterProperties struct set via crate::file::properties::WriterProperties and is thus hidden by default.

Tuple Fields§

§0: Vec<Block>

Implementations§

Source§

impl Sbbf

Source

pub fn new_with_ndv_fpp(ndv: u64, fpp: f64) -> Result<Self, ParquetError>

Create a new Sbbf with given number of distinct values and false positive probability. Will return an error if fpp is greater than or equal to 1.0 or less than 0.0.

Source

pub fn new_with_num_of_bytes(num_bytes: usize) -> Self

Create a new Sbbf with given number of bytes, the exact number of bytes will be adjusted to the next power of two bounded by BITSET_MIN_LENGTH and BITSET_MAX_LENGTH.

Source

pub fn new(bitset: &[u8]) -> Self

Creates a new Sbbf from a raw byte slice.

Source

pub fn write<W: Write>(&self, writer: W) -> Result<(), ParquetError>

Write the bloom filter data (header and then bitset) to the output. This doesn’t flush the writer in order to boost performance of bulk writing all blocks. Caller must remember to flush the writer. This method usually is used in conjunction with Self::from_bytes for serialization/deserialization.

Source

pub fn write_bitset<W: Write>(&self, writer: W) -> Result<(), ParquetError>

Write the bitset in serialized form to the writer.

Source

fn header(&self) -> BloomFilterHeader

Create and populate BloomFilterHeader from this bitset for writing to serialized form

Source

pub fn read_from_column_chunk<R: ChunkReader>( column_metadata: &ColumnChunkMetaData, reader: &R, ) -> Result<Option<Self>, ParquetError>

Read a new bloom filter from the given offset in the given reader.

Source

fn hash_to_block_index(&self, hash: u64) -> usize

Map a 64-bit hash to a block index in [0, num_blocks).

Uses the “multiply-and-shift” trick (a fast alternative to modulo):

  upper32 = hash >> 32           // take the top 32 bits of the hash
  index   = (upper32 * N) >> 32  // ∈ [0, N)  where N = num_blocks

Why this matters for folding (Lemma 1): when N is a power of two and you halve it to N/2, the index also halves:

  index_N   = (upper32 * N)   >> 32
  index_N/2 = (upper32 * N/2) >> 32 = index_N / 2  (integer division)

So the block that held hash h in the big filter is at index / 2 in the half-sized filter — exactly where fold ORs it.

Source

pub fn insert<T: AsBytes + ?Sized>(&mut self, value: &T)

Insert an AsBytes value into the filter

Source

fn insert_hash(&mut self, hash: u64)

Insert a hash into the filter

Source

pub fn check<T: AsBytes + ?Sized>(&self, value: &T) -> bool

Check if an AsBytes value is probably present or definitely absent in the filter

Source

fn check_hash(&self, hash: u64) -> bool

Check if a hash is in the filter. May return true for values that was never inserted (“false positive”) but will always return false if a hash has not been inserted.

Source

pub(crate) fn estimated_memory_size(&self) -> usize

Return the total in memory size of this bloom filter in bytes

Source

pub fn num_blocks(&self) -> usize

Returns the number of blocks in this bloom filter.

Source

pub fn fold_to_target_fpp(&mut self, target_fpp: f64)

Fold the bloom filter down to the smallest size that still meets the target FPP (False Positive Percentage).

Folds the filter by merging groups of adjacent blocks via bitwise OR, where each fold level halves the number of blocks. The fold count is chosen as the maximum number of folds whose estimated FPP stays within target_fpp. The filter stops at a minimum size of 1 block (32 bytes).

§How it works

SBBFs use multiplicative hashing for block selection:

block_index = ((hash >> 32) * num_blocks) >> 32

A single fold halves the block count: when num_blocks is halved, the new index becomes floor(original_index / 2), so blocks 2i and 2i+1 map to the same position. More generally, k folds reduce the block count by 2^k, merging groups of 2^k adjacent blocks in a single pass:

folded[i] = blocks[i*2^k] | blocks[i*2^k + 1] | ... | blocks[i*2^k + 2^k - 1]

This differs from standard Bloom filter folding, which merges the two halves (B[i] | B[i + m/2]) because standard filters use modular hashing where h(x) mod (m/2) maps indices i and i + m/2 to the same position.

§Correctness

Folding never introduces false negatives. Every bit that was set in the original filter remains set in the folded filter (via bitwise OR). The only effect is a controlled increase in FPP as set bits from different blocks are merged together. This is was originally proven in Sailhan & Stehr 2012 for standard bloom filters and is empirically demonstrated for SBBFs in Lemma 1 and Lemma 2 of the tests.

§References
Source

fn num_folds_for_target_fpp(&self, target_fpp: f64) -> u32

Determine how many folds can be applied without exceeding target_fpp.

Computes the average per-block fill rate in a single pass (no allocation), then analytically estimates the FPP at each fold level.

When two blocks with independent fill rate f are OR’d, the expected fill of the merged block is 1 - (1-f)^2. After k folds (merging 2^k blocks):

f_k = 1 - (1 - f)^(2^k)

SBBF membership checks perform k=8 bit checks within one 256-bit block, so the estimated FPP at fold level k is f_k^8.

Source

fn fold_n(&mut self, num_folds: u32)

Fold the filter num_folds times in a single pass.

Merges groups of 2^num_folds adjacent blocks via bitwise OR, producing len / 2^num_folds output blocks. The original allocation is reused.

§Panics

Panics if num_folds is 0 or would reduce the filter below 1 block.

Source

pub fn from_bytes(bytes: &[u8]) -> Result<Self, ParquetError>

Reads a Sbff from Thrift encoded bytes

§Examples
// In a real application, you would read serialized bloom filter bytes from a cache.
// This example demonstrates the deserialization process.
// Assuming you have bloom filter bytes from a Parquet file:
let bloom_filter = Sbbf::from_bytes(&serialized_bytes)?;
// Now you can use the bloom filter to check for values
if bloom_filter.check(&"some_value") {
    println!("Value might be present (or false positive)");
} else {
    println!("Value is definitely not present");
}

Trait Implementations§

Source§

impl Clone for Sbbf

Source§

fn clone(&self) -> Sbbf

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for Sbbf

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

§

impl Freeze for Sbbf

§

impl RefUnwindSafe for Sbbf

§

impl Send for Sbbf

§

impl Sync for Sbbf

§

impl Unpin for Sbbf

§

impl UnsafeUnpin for Sbbf

§

impl UnwindSafe for Sbbf

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

§

fn vzip(self) -> V

§

impl<T> Allocation for T
where T: RefUnwindSafe + Send + Sync,

§

impl<T> Ungil for T
where T: Send,