Overview

A unified C++ framework for probabilistic data structures based on the fundamental distinction between latent (true but unobservable) values and their observed (noisy but measurable) approximations. By making approximation a first-class concept at the type level, Bernoulli types enable formal reasoning about error propagation in probabilistic algorithms.

Key Concepts

Latent vs. Observed

The framework distinguishes:

  • Latent values (S, f, x): True mathematical objects we care about
  • Observed values (~S, ~f, ~x): Noisy approximations we compute with
// Latent set S - exists mathematically
Set<int> S = {1, 2, 3, 4, 5};

// Observed set ~S - Bloom filter approximation
bloom_filter<int> obs_S(1000, 3);
obs_S.insert(1);
obs_S.insert(2);

// Query may have false positives
if (obs_S.contains(6)) {  // Might be true even though 6 ∉ S
    // Handle approximate result
}

Examples in Practice

  • Bloom filters: Observe latent sets through membership queries
  • Count-Min sketches: Observe latent frequency distributions
  • MinHash: Observe latent set similarities
  • Miller-Rabin: Observe latent primality property

Mathematical Foundation

Type System

  • Type erasure: Store different implementations uniformly
  • Expression templates: Lazy evaluation of set operations
  • Error propagation: Track uncertainty through computations
  • Interval arithmetic: Handle uncertain error rates

Operations

Set-theoretic operations preserve Bernoulli properties:

  • Union: ~S1 ∪ ~S2 with error rate composition
  • Intersection: ~S1 ∩ ~S2 with compounded approximation
  • Difference: ~S1 \ ~S2 with modified error characteristics

Features

Data Structures

  • Bloom filters - Space-efficient set membership
  • Count-Min sketch - Frequency estimation
  • MinHash - Set similarity
  • Cuckoo filters - Deletable Bloom filters

Algorithms

  • Space-optimal constructions - Achieve information-theoretic bounds
  • Adaptive error rates - Adjust parameters dynamically
  • Majority voting - Improve accuracy with multiple observations

Example

#include <bernoulli/observed_bool.hpp>
#include <bernoulli/bloom_filter.hpp>

// Observed boolean with 10% error rate
observed_bool obs_true(true, 0.1);

// Bloom filter as observed set
bloom_filter<std::string> filter(10000, 4);
filter.insert("hello");
filter.insert("world");

// Membership test with explicit error
auto fpr = filter.false_positive_rate();
std::cout << "FPR: [" << fpr.min << ", " << fpr.max << "]" << std::endl;

Research Papers

The repository includes LaTeX source for seven research papers:

  1. Foundations of Bernoulli Types - Core framework and mathematical foundations
  2. Bernoulli Sets and Boolean Algebra - Set-theoretic operations and error propagation
  3. Universal Bernoulli Maps - Function approximation and composition
  4. Regular Types and Bernoulli Types - Type algebra and limitations
  5. Applications to Search and Retrieval - Information retrieval applications
  6. Entropy Maps Implementation - Space-optimal implementations
  7. Statistical Analysis - Distributions and confidence intervals

Applications

  • Privacy-preserving data structures (oblivious types)
  • Space-efficient approximate algorithms
  • Distributed systems with probabilistic guarantees
  • Stream processing with bounded memory
  • Encrypted search with controlled leakage

Resources