pub struct BloomFilterProperties {
pub fpp: f64,
pub ndv: u64,
}Expand description
Controls the bloom filter to be computed by the writer.
The bloom filter is initially sized for ndv distinct values at the given fpp, then
automatically folded down after all values are inserted to achieve optimal size while
maintaining the target fpp. See Sbbf::fold_to_target_fpp for details on the
folding algorithm.
Fields§
§fpp: f64False positive probability. This should be always between 0 and 1 exclusive. Defaults to DEFAULT_BLOOM_FILTER_FPP.
You should set this value by calling WriterPropertiesBuilder::set_bloom_filter_fpp.
The bloom filter data structure is a trade of between disk and memory space versus fpp, the smaller the fpp, the more memory and disk space is required, thus setting it to a reasonable value e.g. 0.1, 0.05, or 0.001 is recommended.
This value also serves as the target FPP for bloom filter folding: after all values are inserted, the filter is folded down to the smallest size that still meets this FPP.
ndv: u64Maximum expected number of distinct values. Defaults to DEFAULT_BLOOM_FILTER_NDV.
You should set this value by calling WriterPropertiesBuilder::set_bloom_filter_ndv.
When not explicitly set via the builder, this defaults to
max_row_group_row_count (resolved at
build time). The bloom filter is initially sized for this many distinct values at the
given fpp, then folded down after insertion to achieve optimal size. A good heuristic
is to set this to the expected number of rows in the row group. If fewer distinct values
are actually written, the filter will be automatically compacted via folding.
Thus the only negative side of overestimating this value is that the bloom filter will use more memory during writing than necessary, but it will not affect the final bloom filter size on disk.
If you wish to reduce memory usage during writing and are able to make a reasonable estimate
of the number of distinct values in a row group, it is recommended to set this value explicitly
rather than relying on the default dynamic sizing based on max_row_group_row_count.
If you do set this value explicitly it is probably best to set it for each column
individually via WriterPropertiesBuilder::set_column_bloom_filter_ndv rather than globally,
since different columns may have different numbers of distinct values.
Trait Implementations§
Source§impl Clone for BloomFilterProperties
impl Clone for BloomFilterProperties
Source§fn clone(&self) -> BloomFilterProperties
fn clone(&self) -> BloomFilterProperties
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more