Skip to main content

Module bloom_filter

Module bloom_filter 

Source
Expand description

Bloom filter implementation specific to Parquet, as described in the spec.

§Bloom Filter Size

Parquet uses the Split Block Bloom Filter (SBBF) as its bloom filter implementation. For each column upon which bloom filters are enabled, the offset and length of an SBBF is stored in the metadata for each row group in the parquet file. The size of each filter is initialized using a calculation based on the desired number of distinct values (NDV) and false positive probability (FPP). The FPP for a SBBF can be approximated as1:

f = (1 - e^(-k * n / m))^k

Where, f is the FPP, k the number of hash functions, n the NDV, and m the total number of bits in the bloom filter. This can be re-arranged to determine the total number of bits required to achieve a given FPP and NDV:

m = -k * n / ln(1 - f^(1/k))

SBBFs use eight hash functions to cleanly fit in SIMD lanes2, therefore k is set to 8. The SBBF will spread those m bits accross a set of b blocks that are each 256 bits, i.e., 32 bytes, in size. The number of blocks is chosen as:

b = NP2(m/8) / 32

Where, NP2 denotes the next power of two, and m is divided by 8 to be represented as bytes.

Here is a table of calculated sizes for various FPP and NDV:

NDVFPPbSize (KB)
10,0000.12568
10,0000.0151216
10,0000.0011,02432
10,0000.00011,02432
100,0000.14,096128
100,0000.014,096128
100,0000.0018,192256
100,0000.000116,384512
100,0000.0000116,384512
1,000,0000.132,7681,024
1,000,0000.0165,5362,048
1,000,0000.00165,5362,048
1,000,0000.0001131,0724,096
1,000,0000.00001131,0724,096
1,000,0000.000001262,1448,192

§Structure: Filter → Blocks → Words → Bits

An SBBF is an array of blocks. Each block is 256 bits (32 bytes), divided into eight 32-bit words. A word is just a u32 — an array of 32 individual bits that can each be “set” (1) or “not set” (0).

  Sbbf (the whole filter)
  ┌──────────┬──────────┬──────────┬─── ─── ──┬──────────┐
  │ Block 0  │ Block 1  │ Block 2  │   ...    │ Block N-1│
  └──────────┴──────────┴──────────┴─── ─── ──┴──────────┘
       │
       ▼
  One Block = 256 bits = 8 words
  ┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
  │ word 0 │ word 1 │ word 2 │ word 3 │ word 4 │ word 5 │ word 6 │ word 7 │
  │ (u32)  │ (u32)  │ (u32)  │ (u32)  │ (u32)  │ (u32)  │ (u32)  │ (u32)  │
  └────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘
       │
       ▼
  One Word = 32 individual bits
  ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
  │0│0│1│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│  ← bit 29 is set
  └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘

Inserting a value hashes it to a 64-bit number, then:

  1. The upper 32 bits pick which block (via Sbbf::hash_to_block_index).
  2. The lower 32 bits pick one bit position in each of the 8 words (via Block::mask). So each insert sets exactly 8 bits (one per word) in a single block.

Checking does the same two steps and returns true only if all 8 bits are already set — meaning the value was probably inserted (or is a false positive).

§Bloom Filter Folding

After inserting all values into a bloom filter it can be “folded” to minimize it’s size. See Sbbf::fold_to_target_fpp for details on the algorithm and its mathematical basis.

Structs§

Block 🔒
A single 256-bit block, the basic unit of the Split Block Bloom Filter.
BloomFilterHeader
Bloom filter header is stored at beginning of Bloom filter data of each column and followed by its bitset.
Sbbf
A split block Bloom filter (SBBF).

Constants§

BITSET_MAX_LENGTH
The maximum number of bytes for a bloom filter bitset.
BITSET_MIN_LENGTH
The minimum number of bytes for a bloom filter bitset.
SALT 🔒
Salt as defined in the spec.
SBBF_HEADER_SIZE_ESTIMATE 🔒
SEED 🔒

Functions§

chunk_read_bloom_filter_header_and_offset 🔒
given an initial offset, and a byte buffer, try to read out a bloom filter header and return both the header and the offset after it (for bitset).
hash_as_bytes 🔒
num_of_bits_from_ndv_fpp 🔒
optimal_num_of_bytes 🔒
read_bloom_filter_header_and_length 🔒
given a [Bytes] buffer, try to read out a bloom filter header and return both the header and length of the header.
read_bloom_filter_header_and_length_from_bytes 🔒
Given a byte slice, try to read out a bloom filter header and return both the header and length of the header.