Expand description
Bloom filter implementation specific to Parquet, as described in the spec.
Β§Bloom Filter Size
Parquet uses the Split Block Bloom Filter (SBBF) as its bloom filter implementation. For each column upon which bloom filters are enabled, the offset and length of an SBBF is stored in the metadata for each row group in the parquet file. The size of each filter is initialized using a calculation based on the desired number of distinct values (NDV) and false positive probability (FPP). The FPP for a SBBF can be approximated as1:
f = (1 - e^(-k * n / m))^k
Where, f
is the FPP, k
the number of hash functions, n
the NDV, and m
the total number
of bits in the bloom filter. This can be re-arranged to determine the total number of bits
required to achieve a given FPP and NDV:
m = -k * n / ln(1 - f^(1/k))
SBBFs use eight hash functions to cleanly fit in SIMD lanes2, therefore
k
is set to 8. The SBBF will spread those m
bits accross a set of b
blocks that
are each 256 bits, i.e., 32 bytes, in size. The number of blocks is chosen as:
b = NP2(m/8) / 32
Where, NP2
denotes the next power of two, and m
is divided by 8 to be represented as bytes.
Here is a table of calculated sizes for various FPP and NDV:
NDV | FPP | b | Size (KB) |
---|---|---|---|
10,000 | 0.1 | 256 | 8 |
10,000 | 0.01 | 512 | 16 |
10,000 | 0.001 | 1,024 | 32 |
10,000 | 0.0001 | 1,024 | 32 |
100,000 | 0.1 | 4,096 | 128 |
100,000 | 0.01 | 4,096 | 128 |
100,000 | 0.001 | 8,192 | 256 |
100,000 | 0.0001 | 16,384 | 512 |
100,000 | 0.00001 | 16,384 | 512 |
1,000,000 | 0.1 | 32,768 | 1,024 |
1,000,000 | 0.01 | 65,536 | 2,048 |
1,000,000 | 0.001 | 65,536 | 2,048 |
1,000,000 | 0.0001 | 131,072 | 4,096 |
1,000,000 | 0.00001 | 131,072 | 4,096 |
1,000,000 | 0.000001 | 262,144 | 8,192 |
Structs§
- Block πEach block is 256 bits, broken up into eight contiguous βwordsβ, each consisting of 32 bits. Each word is thought of as an array of bits; each bit is either βsetβ or βnot setβ.
- A split block Bloom filter.
Constants§
- BITSET_
MAX_ πLENGTH - BITSET_
MIN_ πLENGTH - SALT πSalt as defined in the spec.
- SEED π
Functions§
- given an initial offset, and a byte buffer, try to read out a bloom filter header and return both the header and the offset after it (for bitset).
- hash_
as_ πbytes - optimal_
num_ πof_ bytes - given a [Bytes] buffer, try to read out a bloom filter header and return both the header and length of the header.