Expand description
Bloom filter implementation specific to Parquet, as described in the spec.
§Bloom Filter Size
Parquet uses the Split Block Bloom Filter (SBBF) as its bloom filter implementation. For each column upon which bloom filters are enabled, the offset and length of an SBBF is stored in the metadata for each row group in the parquet file. The size of each filter is initialized using a calculation based on the desired number of distinct values (NDV) and false positive probability (FPP). The FPP for a SBBF can be approximated as1:
f = (1 - e^(-k * n / m))^kWhere, f is the FPP, k the number of hash functions, n the NDV, and m the total number
of bits in the bloom filter. This can be re-arranged to determine the total number of bits
required to achieve a given FPP and NDV:
m = -k * n / ln(1 - f^(1/k))SBBFs use eight hash functions to cleanly fit in SIMD lanes2, therefore
k is set to 8. The SBBF will spread those m bits accross a set of b blocks that
are each 256 bits, i.e., 32 bytes, in size. The number of blocks is chosen as:
b = NP2(m/8) / 32Where, NP2 denotes the next power of two, and m is divided by 8 to be represented as bytes.
Here is a table of calculated sizes for various FPP and NDV:
| NDV | FPP | b | Size (KB) |
|---|---|---|---|
| 10,000 | 0.1 | 256 | 8 |
| 10,000 | 0.01 | 512 | 16 |
| 10,000 | 0.001 | 1,024 | 32 |
| 10,000 | 0.0001 | 1,024 | 32 |
| 100,000 | 0.1 | 4,096 | 128 |
| 100,000 | 0.01 | 4,096 | 128 |
| 100,000 | 0.001 | 8,192 | 256 |
| 100,000 | 0.0001 | 16,384 | 512 |
| 100,000 | 0.00001 | 16,384 | 512 |
| 1,000,000 | 0.1 | 32,768 | 1,024 |
| 1,000,000 | 0.01 | 65,536 | 2,048 |
| 1,000,000 | 0.001 | 65,536 | 2,048 |
| 1,000,000 | 0.0001 | 131,072 | 4,096 |
| 1,000,000 | 0.00001 | 131,072 | 4,096 |
| 1,000,000 | 0.000001 | 262,144 | 8,192 |
§Structure: Filter → Blocks → Words → Bits
An SBBF is an array of blocks. Each block is 256 bits (32 bytes),
divided into eight 32-bit words. A word is just a u32 — an array of
32 individual bits that can each be “set” (1) or “not set” (0).
Sbbf (the whole filter)
┌──────────┬──────────┬──────────┬─── ─── ──┬──────────┐
│ Block 0 │ Block 1 │ Block 2 │ ... │ Block N-1│
└──────────┴──────────┴──────────┴─── ─── ──┴──────────┘
│
▼
One Block = 256 bits = 8 words
┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│ word 0 │ word 1 │ word 2 │ word 3 │ word 4 │ word 5 │ word 6 │ word 7 │
│ (u32) │ (u32) │ (u32) │ (u32) │ (u32) │ (u32) │ (u32) │ (u32) │
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘
│
▼
One Word = 32 individual bits
┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
│0│0│1│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│0│ ← bit 29 is set
└─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘Inserting a value hashes it to a 64-bit number, then:
- The upper 32 bits pick which block (via
Sbbf::hash_to_block_index). - The lower 32 bits pick one bit position in each of the 8 words (via
Block::mask). So each insert sets exactly 8 bits (one per word) in a single block.
Checking does the same two steps and returns true only if all 8 bits
are already set — meaning the value was probably inserted (or is a false
positive).
§Bloom Filter Folding
After inserting all values into a bloom filter it can be “folded” to minimize it’s size.
See Sbbf::fold_to_target_fpp for details on the algorithm and its mathematical basis.
Structs§
- Block 🔒
- A single 256-bit block, the basic unit of the Split Block Bloom Filter.
- Bloom
Filter Header - Bloom filter header is stored at beginning of Bloom filter data of each column and followed by its bitset.
- Sbbf
- A split block Bloom filter (SBBF).
Constants§
- BITSET_
MAX_ LENGTH - The maximum number of bytes for a bloom filter bitset.
- BITSET_
MIN_ LENGTH - The minimum number of bytes for a bloom filter bitset.
- SALT 🔒
- Salt as defined in the spec.
- SBBF_
HEADER_ 🔒SIZE_ ESTIMATE - SEED 🔒
Functions§
- chunk_
read_ 🔒bloom_ filter_ header_ and_ offset - given an initial offset, and a byte buffer, try to read out a bloom filter header and return both the header and the offset after it (for bitset).
- hash_
as_ 🔒bytes - num_
of_ 🔒bits_ from_ ndv_ fpp - optimal_
num_ 🔒of_ bytes - read_
bloom_ 🔒filter_ header_ and_ length - given a [Bytes] buffer, try to read out a bloom filter header and return both the header and length of the header.
- read_
bloom_ 🔒filter_ header_ and_ length_ from_ bytes - Given a byte slice, try to read out a bloom filter header and return both the header and length of the header.