pub struct Sbbf(Vec<Block>);Expand description
A split block Bloom filter (SBBF).
An SBBF partitions its bit space into fixed-size 256-bit (32-byte) blocks, each fitting in a single CPU cache line. Each block contains eight 32-bit words, aligned with SIMD lanes for parallel bit manipulation. When checking membership, only one block is accessed per query, eliminating the cache-miss penalty of standard Bloom filters.
§Sizing and folding
Filters are initially sized for a maximum expected number of distinct values (NDV) via
Sbbf::new_with_ndv_fpp. After all values are inserted, the filter is compacted by
calling Sbbf::fold_to_target_fpp, which folds the filter down to the smallest size
that still meets the target false positive probability.
The creation of this structure is based on the crate::file::properties::BloomFilterProperties
struct set via crate::file::properties::WriterProperties and is thus hidden by default.
Tuple Fields§
§0: Vec<Block>Implementations§
Source§impl Sbbf
impl Sbbf
Sourcepub fn new_with_ndv_fpp(ndv: u64, fpp: f64) -> Result<Self, ParquetError>
pub fn new_with_ndv_fpp(ndv: u64, fpp: f64) -> Result<Self, ParquetError>
Create a new Sbbf with given number of distinct values and false positive probability.
Will return an error if fpp is greater than or equal to 1.0 or less than 0.0.
Sourcepub fn new_with_num_of_bytes(num_bytes: usize) -> Self
pub fn new_with_num_of_bytes(num_bytes: usize) -> Self
Create a new Sbbf with given number of bytes, the exact number of bytes will be adjusted to the next power of two bounded by BITSET_MIN_LENGTH and BITSET_MAX_LENGTH.
Sourcepub fn write<W: Write>(&self, writer: W) -> Result<(), ParquetError>
pub fn write<W: Write>(&self, writer: W) -> Result<(), ParquetError>
Write the bloom filter data (header and then bitset) to the output. This doesn’t
flush the writer in order to boost performance of bulk writing all blocks. Caller
must remember to flush the writer.
This method usually is used in conjunction with Self::from_bytes for serialization/deserialization.
Sourcepub fn write_bitset<W: Write>(&self, writer: W) -> Result<(), ParquetError>
pub fn write_bitset<W: Write>(&self, writer: W) -> Result<(), ParquetError>
Write the bitset in serialized form to the writer.
Sourcefn header(&self) -> BloomFilterHeader
fn header(&self) -> BloomFilterHeader
Create and populate BloomFilterHeader from this bitset for writing to serialized form
Sourcepub fn read_from_column_chunk<R: ChunkReader>(
column_metadata: &ColumnChunkMetaData,
reader: &R,
) -> Result<Option<Self>, ParquetError>
pub fn read_from_column_chunk<R: ChunkReader>( column_metadata: &ColumnChunkMetaData, reader: &R, ) -> Result<Option<Self>, ParquetError>
Read a new bloom filter from the given offset in the given reader.
Sourcefn hash_to_block_index(&self, hash: u64) -> usize
fn hash_to_block_index(&self, hash: u64) -> usize
Map a 64-bit hash to a block index in [0, num_blocks).
Uses the “multiply-and-shift” trick (a fast alternative to modulo):
upper32 = hash >> 32 // take the top 32 bits of the hash
index = (upper32 * N) >> 32 // ∈ [0, N) where N = num_blocksWhy this matters for folding (Lemma 1): when N is a power of two and you halve it to N/2, the index also halves:
index_N = (upper32 * N) >> 32
index_N/2 = (upper32 * N/2) >> 32 = index_N / 2 (integer division)So the block that held hash h in the big filter is at index / 2 in
the half-sized filter — exactly where fold ORs it.
Sourcepub fn insert<T: AsBytes + ?Sized>(&mut self, value: &T)
pub fn insert<T: AsBytes + ?Sized>(&mut self, value: &T)
Insert an AsBytes value into the filter
Sourcefn insert_hash(&mut self, hash: u64)
fn insert_hash(&mut self, hash: u64)
Insert a hash into the filter
Sourcepub fn check<T: AsBytes + ?Sized>(&self, value: &T) -> bool
pub fn check<T: AsBytes + ?Sized>(&self, value: &T) -> bool
Check if an AsBytes value is probably present or definitely absent in the filter
Sourcefn check_hash(&self, hash: u64) -> bool
fn check_hash(&self, hash: u64) -> bool
Check if a hash is in the filter. May return true for values that was never inserted (“false positive”) but will always return false if a hash has not been inserted.
Sourcepub(crate) fn estimated_memory_size(&self) -> usize
pub(crate) fn estimated_memory_size(&self) -> usize
Return the total in memory size of this bloom filter in bytes
Sourcepub fn num_blocks(&self) -> usize
pub fn num_blocks(&self) -> usize
Returns the number of blocks in this bloom filter.
Sourcepub fn fold_to_target_fpp(&mut self, target_fpp: f64)
pub fn fold_to_target_fpp(&mut self, target_fpp: f64)
Fold the bloom filter down to the smallest size that still meets the target FPP (False Positive Percentage).
Folds the filter by merging groups of adjacent blocks via bitwise OR, where each
fold level halves the number of blocks. The fold count is chosen as the maximum
number of folds whose estimated FPP stays within target_fpp. The filter stops
at a minimum size of 1 block (32 bytes).
§How it works
SBBFs use multiplicative hashing for block selection:
block_index = ((hash >> 32) * num_blocks) >> 32A single fold halves the block count: when num_blocks is halved, the new index
becomes floor(original_index / 2), so blocks 2i and 2i+1 map to the same
position. More generally, k folds reduce the block count by 2^k, merging
groups of 2^k adjacent blocks in a single pass:
folded[i] = blocks[i*2^k] | blocks[i*2^k + 1] | ... | blocks[i*2^k + 2^k - 1]This differs from standard Bloom filter folding, which merges the two halves
(B[i] | B[i + m/2]) because standard filters use modular hashing where
h(x) mod (m/2) maps indices i and i + m/2 to the same position.
§Correctness
Folding never introduces false negatives. Every bit that was set in the original filter remains set in the folded filter (via bitwise OR). The only effect is a controlled increase in FPP as set bits from different blocks are merged together. This is was originally proven in Sailhan & Stehr 2012 for standard bloom filters and is empirically demonstrated for SBBFs in Lemma 1 and Lemma 2 of the tests.
§References
Sourcefn num_folds_for_target_fpp(&self, target_fpp: f64) -> u32
fn num_folds_for_target_fpp(&self, target_fpp: f64) -> u32
Determine how many folds can be applied without exceeding target_fpp.
Computes the average per-block fill rate in a single pass (no allocation), then analytically estimates the FPP at each fold level.
When two blocks with independent fill rate f are OR’d, the expected fill
of the merged block is 1 - (1-f)^2. After k folds (merging 2^k blocks):
f_k = 1 - (1 - f)^(2^k)SBBF membership checks perform k=8 bit checks within one 256-bit block,
so the estimated FPP at fold level k is f_k^8.
Sourcefn fold_n(&mut self, num_folds: u32)
fn fold_n(&mut self, num_folds: u32)
Fold the filter num_folds times in a single pass.
Merges groups of 2^num_folds adjacent blocks via bitwise OR, producing
len / 2^num_folds output blocks. The original allocation is reused.
§Panics
Panics if num_folds is 0 or would reduce the filter below 1 block.
Sourcepub fn from_bytes(bytes: &[u8]) -> Result<Self, ParquetError>
pub fn from_bytes(bytes: &[u8]) -> Result<Self, ParquetError>
Reads a Sbff from Thrift encoded bytes
§Examples
// In a real application, you would read serialized bloom filter bytes from a cache.
// This example demonstrates the deserialization process.
// Assuming you have bloom filter bytes from a Parquet file:
let bloom_filter = Sbbf::from_bytes(&serialized_bytes)?;
// Now you can use the bloom filter to check for values
if bloom_filter.check(&"some_value") {
println!("Value might be present (or false positive)");
} else {
println!("Value is definitely not present");
}