The Bernoulli Model: A Probabilistic Framework for Data Structures and Types

Last updated on Jul 25, 2024 18 min read boolean, bernoulli, probabilistic-data-structure, bloom-filter

Motivation

The Bernoulli Model is a general framework for thinking about probabilistic data structures and types of a particular sort. A big reason for developing the Bernoulli Model formalism is so that we can use Bernoulli Models of data types to develop Oblivious Data Types. We will go into that in a separate document, but the basic idea is that Bernoulli approximations have a lot of desirable properties for developing oblivious data types, and the Bernoulli Model formalism allows us to reason about the correctness of the oblivious data types and to make them more space-efficient by trading accuracy for space while allowing for $O (1)$ time complexity.

The Bernoulli Model also provides a formalism for how to think about various probabilistic data structures, like the Bloom filter, Count-Min sketch, or my invention, the Bernoulli data type, which comprises an entire family of data structures that are all based on the Bernoulli Model, from sets (like the Bloom filter) to maps in a near-space optimal way, while allowing for more savings by trading accuracy for space in a controlled way.

Introduction: Bernoulli Boolean

The Boolean type, denoted by $bool$ , models the set of values given by ${true, false}$ , or more compactly ${0, 1}$ . This document entertains the replacement of $bool$ with a type $B_{bool}$ , which represents a sort of *noisy* Boolean. In general, we can have a Bernoulli type for any type $T$ , denoted by $B_{T}$ .

As special case, data structures like Bloom filters can be thought of as a Bernoulli type for the set indicator function, $1_{A} : X \mapsto 0, 1$ , but more on that later.

There are two types that have fewer values than the Boolean type, the absurd type, denoted by $void$ , which is the type with no values (and thus values of this type cannot be constructed) and the unit type, denoted by $()$ , which has only a single value, also denoted by $()$ . Since there is only a single value of the unit type, there is no uncertainty in the unit type.

As degenerate cases, the Bernoulli Model of the absurd is just the absurd type, $B_{void} \equiv void$ , and Bernoulli Model of the unit type is just the unit type, $B_{()} \equiv ()$ .

The Boolean type $bool$ has two possible values, $true$ and $false$ , and is thus the first type for which we can introduce uncertainty.

Every Bernoulli Model also has an order, an integer greater than 1, which essentially describes the number of independent ways in which the process that generated the Bernoulli approximation can produce errors. We denote that a Bernoulli Model of type $T$ has order $k$ with $B_{T}^{(k)}$ . For the absurd and unit types, the maximum order is $0$ , and in general $T \equiv B_{T}^{(0)}$ , i.e., if there are no ways to introduce errors, then that is equivalent to the type itself.

Unless it is useful, we drop the order information and simply write $B_{T}$ . In the Bernoulli Model, we may denote the latent value $x$ that is being approximated using the notation $B_{T} (x)$ and we say that it is latent if we do not know the value of $x$ and we are trying to approximate it using the Bernoulli Model approximation $B_{T} (x)$ . In this case, the latent value $x$ is unobservable and we can think of $B_{T} (x)$ as a noisy measurement of $x$ .

In the Bernoulli Boolean Model, $B_{bool} (x)$ is a random Bernoulli variable such that $Pr {B_{bool} (x) \neq x} = ε (x)$ for each $x \in {0, 1}$ , where $0 < ε (x) < 1$ is the probability of an error. In most practical situations, $ε (x)$ is known or its expectation is known, and it can be pre-specified to balance factors like space complexity and accuracy.

Binary Channels

Let’s begin by thinking about the Binary Symmetric Channel and the Binary Asymmetric Channel. The Bernoulli Boolean model can exhibit two distinct behaviors, represented as different “channels” through which Boolean values are transmitted:

Binary Symmetric Channel (First-order Bernoulli Model): The probability of an equality error is the same for $1$ and $0$ . We denote this by the type $B_{bool}^{(1)}$ .
Binary Asymmetric Channel (Second-order Bernoulli Model): The probability of an equality error differs for $1$ and $0$ . We denote this by the type $B_{bool}^{(2)}$ .

False Positives and Negatives

Errors in the Bernoulli Boolean model can be understood in terms of false negatives and false positives:

$B_{bool} (0) = 1$ is a *false negative*.
$B_{bool} (1) = 0$ is a *false positive*.

In the first-order model, the probability of a false negative equals the probability of a false positive. In the second-order model, these probabilities differ. In a specific but common version of the second-order Bernoulli Boolean model, false negatives occur with probability 0 and false positives occur with probability $0 < ε < 1$ .

Prediction

$B_{T} (x)$ is correlated with the latent value $x$ , and thus provides information about the latent value. For instance, it allows one to predict $x$ given $B_{T} (x)$ better than if no observations where given whatsoever, assuming we know the error rate $ε (x)$ is better than a random guess, e.g., in the case of $B_{bool}$ , $ε (x) < 0.5$ . If we have no prior information about the latent variable, the best (maximum likelihood) estimate of of its value is the observation $B_{T} (x)$ .

However, because $ε (x)$ is normally known a priori, we can estimate the probability that the latent variable is $X = x$ using Bayes' rule: $Pr {X = x ∣ B_{T} (x) = x} \propto Pr {B_{T} (x) = x ∣ X = x} Pr {X = x},$ where $Pr {X = x}$ is the prior probability that $X = x$ and $Pr {B_{T} (x) = x ∣ X = x}$ is the probability that $B_{T} (x) = x$ given that $X = x$ (the probability that the observation is correct). In the Bernoulli Boolean Model, for $X = 1$ this is the true positive rate $τ$ and for $X = 0$ this is the true negative rate $ν$ .

In the first-order model, if the probability of being correct $τ$ , then: $Pr {x ∣ B_{bool}^{(1)} (x)} = \frac{τ Pr {X = x}}{τ Pr {X = x} + (1 - τ) (1 - Pr {X = x})} .$

Dividing the numerator and denominator by $Pr {X = x}$ , we get: $Pr {x ∣ B_{bool}^{(1)} (x)} = \frac{τ}{τ + (1 - τ) (1 / Pr {X = x} - 1)} .$

Assuming maximum ignorance (maximum entropy) about $x$ (i.e., $Pr {X = 1} = 0.5$ ), the following expression is obtained for the first-order Bernoulli Boolean Model, the above equation simplifies to $Pr {x ∣ B_{bool}^{(1)} (x)} = τ$ .

Suppose we have $n$ noisy i.i.d. measurements of the latent variable $x$ , ${B_{1, bool}^{(1)} (x), \dots, B_{n, bool}^{(1)} (x)},$ in which case the maximum likelihood estimate of $x$ is the majority vote, i.e., the value that occurs most frequently in the observations. As the number of independent sources of observations goes to infinity, the majority vote converges in probability to $x$ .

This is not a typical use-case for the Bernoulli Boolean model, since it will mostly be a analytical result of probabilistic data structures that may be framed in the context of a Bernoulli Model.

Inducing Bernoulli Types

Here, we discuss how to generalize the results.

Unit Functions

We discussed the idea of the Bernoulli Model for value types like $bool$ . We can think of these Bernoulli Models as Bernoulli approximations of the set of unit types $() \mapsto X$ . A unit type has no input, and it maps to a single constant value in the output. There are $| X |$ functions of this type.

If we replace $() \mapsto X$ with $B_{X}$ or, equivalently, $B_{() \mapsto X}$ , we allow for the possibility of errors. The confusion matrix for $B_{() \mapsto bool}^{(2)}$ is provided in Table 1.

Table 1: Second-order Bernoulli Boolean Model for $() \mapsto bool$

$x / B_{bool}^{(2)}$	observe $1$	observe $0$
latent $1$	$τ = 1 - η$	$η$
latent $0$	$ε$	$ν = 1 - ε$

So, we might observe $B_{() \mapsto bool} (x) = 1$ and, according to the confusion matrix, the latent value $x$ is $1$ with probability $τ$ (true positive rate) and $0$ with probability $ε$ (false positive rate). Likewise, we might observe $B_{() \mapsto bool} (x) = 0$ and the latent value $x$ is $1$ with probability $η$ (false negative rate) and $0$ with probability $ν$ (true negative rate).

We see that the Bernoulli Model for $() \mapsto bool$ has a maximum order of 2, since $η$ and $ε$ are independent probabilities that fully describe the model. No additional free parameters are possible, since each row must sum to 1 (the total probability theorem), i.e., given a latent value $x$ , the probability of observing $1$ or $0$ must sum to $1$ since that is the only possible two outcomes in the Bernoulli Boolean Model.

We can think of this as the asymmetric binary channel, where the probability of an error is different for $1$ and $0$ .

A first-order Bernoulli Boolean model is a model where $ϵ = ε = η$ . See the confusion matrix in Table 2.

Table 2: First-Order Bernoulli Boolean Model for $() \mapsto bool$

$x / B_{bool}^{(1)}$	observe $1$	observe $0$
latent $1$	$τ = 1 - ϵ$	$η = ϵ$
latent $0$	$ε = ϵ$	$ν = 1 - ϵ$

We can see that there is only one free parameter, $ϵ$ , corresponding to the first-order Bernoulli Boolean Model. We can think of this as a binary symmetric channel, where the probability of an error is the same for $1$ and $0$ .

For completeness, we can write down the confusion matrix for the zeroth-order model, which is just the standard Boolean model:

Table 3: Zeroth-Order Bernoulli Boolean Model for $() \mapsto bool$

$x / B_{bool}^{(1)}$	observe $1$	observe $0$
latent $1$	1	0
latent $0$	0	1

We see that there are 0 free parameters, and the model is deterministic.

Prediction: Boolean Values

Earlier, we discussed how to predict latent values from Bernoulli Model observations. Let’s apply these insights to the first-order Bernoulli Model for $() \mapsto bool$ , which we can denote as $B_{bool}^{(1)}$ .

We can use Bayes' rule to compute the probability that the latent value is $X = 1$ given that we observed $B_{b o o l}^{(1)} = 1$ : $Pr {X = 1 | B_{b o o l}^{(1)} = 1} = \frac{Pr {B_{b o o l}^{(1)} = 1 | X = 1} Pr {X = 1}}{Pr {B_{b o o l}^{(1)} = 1}}$ We know $Pr {B_{b o o l}^{(1)} = 1 | X = 1}$ from the confusion matrix; it’s just the true positive rate, $τ = 1 - ϵ$ : $Pr {X = 1 | B_{b o o l}^{(1)} = 1} = \frac{(1 - ϵ) Pr {X = 1}}{Pr {B_{b o o l}^{(1)} = 1}}$ What is $Pr {B_{b o o l}^{(1)} = 1}$ ? By the total probability theorem: $\begin{array}{r} Pr {B_{b o o l}^{(1)} = 1} = Pr {B_{b o o l}^{(1)} = 1 | X = 1} Pr {X = 1} + Pr {B_{b o o l}^{(1)} = 1 | X = 0} Pr {X = 0} . \end{array}$ We can substitute in the confusion matrix values: $Pr {B_{b o o l}^{(1)} = 1} = (1 - ϵ) Pr {X = 1} + ϵ Pr {X = 0} .$

If we substitute this back into the expression for $Pr {X = 1 | B_{b o o l}^{(1)} = 1}$ , divide the numerator and denominator by $Pr {X = 1}$ , and then simplify, we get $Pr {X = 1 | B_{b o o l}^{(1)} = 1} = \frac{1 - ϵ}{1 - ϵ (1 - q / (1 - q))}$ where $q = Pr {X = 0}$ .

Let’s evaluate $q = Pr {X = 0}$ at some interesting points:

At $q = 0$ , $Pr {X = 1 | B_{b o o l}^{(1)} = 1} = 1$ . This makes sense. We know that the latent value $1$ occurs with probability $1 - q = 1$ , therefore whatever no matter what we observe, the latent value is $1$ .
As $q \to 1$ , $Pr {X = 1 | B_{b o o l}^{(1)} = 1} \to 0$ . This also makes sense; we know that the latent value $1$ occurs with probability $q = 1$ , so no matter what we observe, the latent value is $0$ .
At $q = 0.5$ , $Pr {X = 1 | B_{b o o l}^{(1)} = 1} = 1 - ϵ$ . This is the maximum entropy case, where we have no prior information about the latent value and assume maximum ignorance.

Unary Bernoulli Functions

In this section, we expand our focus to unary functions.

Lifing Unary Functions

If we have a function $f : b o o l \mapsto b o o l$ , then the space of all possible functions is given by Table 4.

Table 4: All possible functions of type $b o o l \mapsto b o o l$

$f$	$f (t r u e)$	$f (f a l s e)$
$i d$	$t r u e$	$f a l s e$
$n o t$	$f a l s e$	$t r u e$
$t r u e$	$t r u e$	$t r u e$
$f a l s e$	$f a l s e$	$f a l s e$

Suppose we replace the Boolean inputs with Bernoulli Boolean values and ask the question, “What is the probability that $f (B_{b o o l}^{(1)} (x)) = f (x)$ ?”

Notice that $f (B_{b o o l}^{(1)} (x))$ is $f (x)$ with some probability, but $f (x)$ may be latent depending on $f$ . For the constant fuctions, $true$ and $false$ , we get the same function, i.e., $true : B_{b o o l}^{(k)} \mapsto B_{b o o l}^{(0)} \equiv true : bool \mapsto bool$ since $true : bool \mapsto bool$ always outputs $true$ , and similiarly for $false : bool \mapsto bool$ .

However, the $id$ and $not$ functions are different. For instance, suppose $Pr {B_{b o o l}^{(1)} (x) = x} = p$ . Then, when we input $B_{b o o l}^{(1)} (true)$ into $id$ , we get the correct output $true$ with probability $p$ and the incorrect output $false$ with probability $1 - p$ . Likewise,when we input $B_{bool}^{(1)} (false)$ into $id$ , we get the correct output $false$ with probability $p$ and the incorrect output $true$ with probability $1 - p$ .

Since we can think of these outputs as either correct or incorrect with probability $p$ and $1 - p$ respectively, they are Bernoulli Boolean values, e.g., this $id$ function on Bernoulli Booleans is of type $B_{bool}^{(1)} \mapsto B_{bool}^{(1)}$ . We monadically lift $id : bool \mapsto bool$ to a function of type $B_{bool}^{(1)} \mapsto B_{bool}^{(1)}$ .

Notice that this is not a Bernoulli Model of $B_{bool \mapsto bool}$ , but rather a function of type $B_{bool} \mapsto B_{bool}$ . This may surprise the reader, but it is important to think about how these Bernoulli Models compose.

For instance, if we have the function $true : bool \mapsto bool$ , then when we provide it with a Bernoulli Boolean value, we know that we $true$ is the correct output – there are no latent values. However, a Bernoulli Model of the (function) $true : bool \mapsto bool$ is a completely different concept. The Bernoulli approximation of this funcction is of type $B_{bool \mapsto bool}$ , and the latent function, $true : bool \mapsto bool$ , is not observable, but $B_{bool \mapsto bool} (true)$ is observable and provides some information about the latent function.

Presumably, some process generated the Bernoulli approximation $B_{bool \mapsto bool} (true)$ , and we wish to use that approxmiate function as a replacement for the latent function we are actually interested in, which is $true : bool \mapsto bool$ .

There are only 4 possible functions of type $bool \mapsto bool$ , and in Table 5 we show the confusion matrix for the highest-order model, $B_{bool \mapsto bool}^{(12)}$ .

Table 5: Bernoulli Model for $bool \mapsto bool$

	observe $id$	observe $not$	observe $true$	observe $false$
latent $id$	$p_{11}$	$p_{12}$	$p_{13}$	$p_{14}$
latent $not$	$p_{21}$	$p_{22}$	$p_{23}$	$p_{24}$
latent $true$	$p_{31}$	$p_{32}$	$p_{33}$	$p_{34}$
latent $false$	$p_{41}$	$p_{42}$	$p_{43}$	$p_{44}$

Each row must sum to 1, $\sum_{j} p_{i j} = 1$ , so we only have up to a maximum of $4 (4 - 1) = 12$ degrees of freedom. This means the highest Bernoulli Boolean order is 12, but we normally drop the order information and just write $B_{bool \mapsto bool}$ (and propogate error rates using interval arithmetic).

Often, the order is either first-order or for various reasons. The first-order Bernoulli Model of $bool \mapsto bool$ is a model where every entry in the confusion matrix is a function of some single value. The maximum entropy confusion matrix, given error rates $ϵ$ , is given in Table 6.

Table 5: Bernoulli Model for $bool \mapsto bool$

	observe $id$	observe $not$	observe $true$	observe $false$
latent $id$	$1 - ϵ$	$ϵ / 3$	$ϵ / 3$	$ϵ / 3$
latent $not$	$ϵ / 3$	$1 - ϵ$	$ϵ / 3$	$ϵ / 3$
latent $true$	$ϵ / 3$	$ϵ / 3$	$1 - ϵ$	$ϵ / 3$
latent $false$	$ϵ / 3$	$ϵ / 3$	$ϵ / 3$	$1 - ϵ$

When we have a Bernoulli Model approximation of some latent function of type $bool \mapsto bool$ , we wish to store the error information in the output so that we can propagate error information through the computation.

We do this by saying that the output is a Bernoulli Boolean, because it may or may not be correct, i.e., the Bernoulli Model generates a function of type $bool \mapsto B_{bool}$ where the output is a Bernoulli Boolean value. In our algorithms, we created a type system for this, using interval arithmetic to propagate the error rates through the computation.

Notice that the Bernoulli Model on $bool \mapsto bool$ does not change the type of the input, $bool$ . We can, of course, also provide as input to this function a Bernoulli Booleans, in which case we will usually get an even higher-order Bernoulli Boolean as output.

Since functions are values, we can ask the question, what is the probability that the latent function is equal to the observed function: $Pr {B_{bool \mapsto bool} (f) = f} .$

In this case, we are asking about the equality of the functions, which is mathematically equivalent to asking whether each input in the domain maps to the same output as the latent function: $\begin{array}{r} Pr {B_{bool \mapsto bool}^{(1)} (id) (true) = id (true)} \times \\ Pr {B_{bool \mapsto bool}^{(1)} (id) (false) = id (false)} \end{array}$

From the confusion matrix, we know this product of probabilities is $1 - ϵ$ . In fact, normally, the process that generates the Bernoulli Model of the latent function is based on these kinds of probabilities on individual inputs.

Higher-Order Bernoulli Models

If we are trying to estimate the latent function, a higher order complicates the estimation problem (more parameters to estimate). However, a higher-order may be desirable in many cases, since it allows for more capacity to approximate the latent function. In general, when looking at the confusion matrix, we want the diagonal to be as close to 1 as possible. For the off-diagonal elements, we want functions that are more similiar to the latent function to have larger probabilities than functions that are less similiar to the latent function. This is just a way of minimizing a loss function.

Set-Indicator Functions

The set-indicator function is a function that maps an element of a set to a Boolean value. We can think of this as a function of type $X \mapsto bool$ , which returns true if the input is in the set and false otherwise.

Let us consider Bernoulli Models for set-indicator functions. The Bloom filter, for instance, is a Bernoulli model of the set-indicator function. It is a second-order model, since false negatives occur with probability 0 and false positives occur with probability $ε$ . This is actually the expectation of the false positive rate, and the true false positive rate is a random variable that cannot usually be computed unless $X$ is a finite set.

HashSet: Approximate Set-Indicator Functions

Suppose we have a cryptographic hash function $h : X \mapsto {0, 1}^{n}$ , where $n$ is the number of bits in the hash value. We define an approximate set-indicator function, denoted by $HashSet$ , in the following way:

We are given a set $A$ to (approximately) represent.
We find a seed $s$ such that $h (x s) = 0^{n}$ for all $x \in A$ .
We do not look at any $x \notin A$ , and by the properties of $h$ , for $x \notin A$ , $h (x s) = 0^{n}$ with probability $2^{- n}$ corresponding to a false positive, and otherwise $h (x s) \neq 0^{n}$ with probability $1 - 2^{- n}$ .
We define membership as $x \in A \equiv h (x s) = 0^{n}$ .

So, we construct a $HashSet$ by finding a seed $s$ such that $h (x s) = 0^{n}$ and that is the only value we need to store, but since the probability that all $x \in A$ hash to $0^{n}$ occurs with probability $2^{- n}$ , each trial is Bernoulli distributed with probability $\prod_{j = 1}^{| A |} 2^{- n} = 2^{- n | A |},$ which will take an expected number of trials $2^{n | A |}$ . We can store the seed as a string of $n | A |$ bits, or $n$ bits per element in the set $A$ . Since it has a false positive rate $ε = 2^{- n}$ , $n = - \log_{2} ε$ , and we can reparametrize the space complexity as $- \log_{2} ε$ bits per element. This is the information theoretic lower-bound, but we achieved it only by using an algorithm with exponential time complexity. We can do better by using different algorithms, but it comes at a cost to space complexity.

Bernoulli Model for Set-Indicator Functions

The number of functions of type $X \mapsto bool$ is $2^{| X |}$ . We can think of these as the set of all possible set-indicator functions. We can approximate this set of functions with a Bernoulli Model, which we denote $B_{X \mapsto bool}$ .

We can use the simple $HashSet$ construction to induce the Bernoulli Model for set-indicator functions. Specifically, in the $HashSet$ , false positives occur with some positive probability $ϵ$ and false negatives occur with probability $0$ .

Technically, however, the Bernoulli Model is being applied to the set-indicator function, but this also specifies a Bernoulli Model on the Boolean output of the set-indicator function:

$\in: X \times P (X) \mapsto B_{bool} .$

Let’s consider $X = {1, 2}$ and $A = {2}$ . The confusion matrix for the this contruction is given in Table 7.

Table 7: Bernoulli Model for $X \mapsto bool$

	$1_{\emptyset}$	$1_{{1}}$	$1_{{2}}$	$1_{{1, 2}}$
$1_{\emptyset}$	$(1 - ϵ)^{2}$	$ϵ (1 - ϵ)$	$ϵ (1 - ϵ)$	$ϵ^{2}$
$1_{{1}}$	$0$	$1 - ϵ$	$0$	$ϵ$
$1_{{2}}$	$0$	$0$	$1 - ϵ$	$ϵ$
$1_{{1, 2}}$	$0$	$0$	$0$	$1$

We do not know the latent set $A = {2}$ , we only have the approximation $B_{X \mapsto bool} (A)$ , and we use this as a replacement for the latent set $A$ .

Notice that if we start with $A = {2}$ , in row 3, we have zeros in the first two columns for the empty set and ${1}$ , which makes sense as by construction no false negatives are possible, only false positives.

We see that with probability $1 - ϵ$ , the output is correct, and with probability $ϵ$ , the output is incorrect. This is a Bernoulli Model of the set-indicator function (or Bernoulli Set), and we can use this to provide information about the latent set.

On the equality of Bernoulli Sets, we can ask the question, what is the probability that $B_{X \mapsto bool} (A) = A$ ? This is just $1 - ϵ$ , as shown in the confusion matrix. However, this is not typically what people are interested in. They are interested in the probability that an element that tests positive is in fact in the latent set $A$ : $Pr {x \in B_{X \mapsto bool} (A)} .$

This is just the true positive rate, which is $1 - ϵ$ by construction. Thus, false positives occur with probability $ϵ$ , and false negatives occur with probability 0. We see that when we do membership tests, we can think of the output as a Bernoulli Boolean value: $x \in B_{X \mapsto bool} (A) : X \mapsto {bool}^{(2)} .$

Conclusion

The Bernoulli Model is a way of thinking about the uncertainty in the output of a function, and how that uncertainty propagates through a computation, and typically the uncertainty is due to a trade-off between space complexity and accuracy. The more space we use to represent the function, the more closely it is expected to approximate the latent function.