The Bernoulli Model: A Probabilistic Framework for Data Structures and Types
Motivation
The Bernoulli Model is a general framework for thinking about
probabilistic data structures and types of a particular sort.
A big reason for developing the Bernoulli Model formalism is so that we can use
Bernoulli Models of data types to develop Oblivious Data Types. We will
go into that in a separate document, but the basic idea is that Bernoulli
approximations have a lot of desirable properties for developing
oblivious data types, and the Bernoulli Model formalism allows us to reason about
the correctness of the oblivious data types and to make them more space-efficient
by trading accuracy for space while allowing for
The Bernoulli Model also provides a formalism for how to think about various probabilistic data structures, like the Bloom filter, Count-Min sketch, or my invention, the Bernoulli data type, which comprises an entire family of data structures that are all based on the Bernoulli Model, from sets (like the Bloom filter) to maps in a near-space optimal way, while allowing for more savings by trading accuracy for space in a controlled way.
Introduction: Bernoulli Boolean
The Boolean type, denoted by
As special case, data structures like Bloom filters can be thought of as a Bernoulli type for the set indicator function,
, but more on that later.
There are two types that have fewer values than the Boolean type, the absurd type, denoted by
As degenerate cases, the Bernoulli Model of the absurd is just the absurd type,
The Boolean type
Every Bernoulli Model also has an order, an integer greater than 1, which essentially describes the number of independent ways in which the process that generated the Bernoulli approximation can produce errors. We denote that a Bernoulli Model of type
Unless it is useful, we drop the order information and simply write
In the Bernoulli Boolean Model,
Binary Channels
Let’s begin by thinking about the Binary Symmetric Channel and the Binary Asymmetric Channel. The Bernoulli Boolean model can exhibit two distinct behaviors, represented as different “channels” through which Boolean values are transmitted:
Binary Symmetric Channel (First-order Bernoulli Model): The probability of an equality error is the same for
and . We denote this by the type .Binary Asymmetric Channel (Second-order Bernoulli Model): The probability of an equality error differs for
and . We denote this by the type .
False Positives and Negatives
Errors in the Bernoulli Boolean model can be understood in terms of false negatives and false positives:
is a *false negative*. is a *false positive*.
In the first-order model, the probability of a false negative equals the probability
of a false positive. In the second-order model, these probabilities differ.
In a specific but common version of the second-order Bernoulli Boolean model, false
negatives occur with probability 0 and false positives occur with probability
Prediction
However, because
In the first-order model, if the probability of being correct
Dividing the numerator and denominator by
Assuming maximum ignorance (maximum entropy) about
Suppose we have
This is not a typical use-case for the Bernoulli Boolean model, since it will mostly be a analytical result of probabilistic data structures that may be framed in the context of a Bernoulli Model.
Inducing Bernoulli Types
Here, we discuss how to generalize the results.
Unit Functions
We discussed the idea of the Bernoulli Model for value types like
If we replace
Table 1: Second-order Bernoulli Boolean Model for
observe | observe | |
---|---|---|
latent | ||
latent |
So, we might observe
We see that the Bernoulli Model for
We can think of this as the asymmetric binary channel, where the probability of an error is different for
A first-order Bernoulli Boolean model is a model where
Table 2: First-Order Bernoulli Boolean Model for
observe | observe | |
---|---|---|
latent | ||
latent |
We can see that there is only one free parameter,
For completeness, we can write down the confusion matrix for the zeroth-order model, which is just the standard Boolean model:
Table 3: Zeroth-Order Bernoulli Boolean Model for
observe | observe | |
---|---|---|
latent | 1 | 0 |
latent | 0 | 1 |
We see that there are 0 free parameters, and the model is deterministic.
Prediction: Boolean Values
Earlier, we discussed how to predict latent values from Bernoulli Model observations. Let’s apply these insights to the first-order Bernoulli Model for
We can use Bayes' rule to compute the probability that the latent value is
If we substitute this back into the expression for
Let’s evaluate
- At
, . This makes sense. We know that the latent value occurs with probability , therefore whatever no matter what we observe, the latent value is . - As
, . This also makes sense; we know that the latent value occurs with probability , so no matter what we observe, the latent value is . - At
, . This is the maximum entropy case, where we have no prior information about the latent value and assume maximum ignorance.
Unary Bernoulli Functions
In this section, we expand our focus to unary functions.
Lifing Unary Functions
If we have a function
Table 4: All possible functions of type
Suppose we replace the Boolean inputs with Bernoulli Boolean values and ask the question, “What is the probability that
Notice that
However, the
Since we can think of these outputs as either correct or incorrect with probability
Notice that this is not a Bernoulli Model of
For instance, if we have the function
Presumably, some process generated the Bernoulli approximation
There are only 4 possible functions of type
Table 5: Bernoulli Model for
observe | observe | observe | observe | |
---|---|---|---|---|
latent | ||||
latent | ||||
latent | ||||
latent |
Each row must sum to 1,
Often, the order is either first-order or for various reasons. The first-order
Bernoulli Model of
Table 5: Bernoulli Model for
observe | observe | observe | observe | |
---|---|---|---|---|
latent | ||||
latent | ||||
latent | ||||
latent |
When we have a Bernoulli Model approximation of some latent function of type
We do this by saying that the output is a Bernoulli Boolean, because it may or may not be correct, i.e., the Bernoulli Model generates a function of type
Notice that the Bernoulli Model on
Since functions are values, we can ask the question, what is the probability
that the latent function is equal to the observed function:
In this case, we are asking about the equality of the functions, which is mathematically equivalent to asking whether each input in the domain maps to the same output as the latent function:
From the confusion matrix, we know this product of probabilities is
Higher-Order Bernoulli Models
If we are trying to estimate the latent function, a higher order complicates the estimation problem (more parameters to estimate). However, a higher-order may be desirable in many cases, since it allows for more capacity to approximate the latent function. In general, when looking at the confusion matrix, we want the diagonal to be as close to 1 as possible. For the off-diagonal elements, we want functions that are more similiar to the latent function to have larger probabilities than functions that are less similiar to the latent function. This is just a way of minimizing a loss function.
Set-Indicator Functions
The set-indicator function is a function that maps an element of a set to a
Boolean value. We can think of this as a function of type
Let us consider Bernoulli Models for set-indicator functions. The Bloom filter,
for instance, is a Bernoulli model of the set-indicator function. It is a second-order
model, since false negatives occur with probability 0 and false positives occur with
probability
HashSet: Approximate Set-Indicator Functions
Suppose we have a cryptographic hash function
- We are given a set
to (approximately) represent. - We find a seed
such that for all . - We do not look at any
, and by the properties of , for , with probability corresponding to a false positive, and otherwise with probability . - We define membership as
.
So, we construct a
Bernoulli Model for Set-Indicator Functions
The number of functions of type
We can use the simple
Technically, however, the Bernoulli Model is being applied to the set-indicator function, but this also specifies a Bernoulli Model on the Boolean output of the set-indicator function:
Let’s consider
Table 7: Bernoulli Model for
We do not know the latent set
Notice that if we start with
We see that with probability
On the equality of Bernoulli Sets, we can ask the question, what is the probability
that
This is just the true positive rate, which is
Conclusion
The Bernoulli Model is a way of thinking about the uncertainty in the output of a function, and how that uncertainty propagates through a computation, and typically the uncertainty is due to a trade-off between space complexity and accuracy. The more space we use to represent the function, the more closely it is expected to approximate the latent function.