Entropy Maps
The PDF version of this post is available on GitHub.
The basic theory behind an entropy map is to map values in the domain to values in the codomain by hashing to a prefix-free code in the codomain. We do not store anything related to the domain, since we are simply hashing them, and a prefix of that hash will be used as a code for a value in the codomain.
We actually allow for many different codes for each value in the codomain, so that,
for instance, a code for, say, the value a
may be 00
, 01
, 10
, and 11
.
Notice that we can efficiently decode this as a
if the hash is less than 4.
Suppose
If
Rate distortion: Bernoulli maps
We can allow rate distortion, too, by failing to code for some of the elements
properly. For instance, a popular choice is when one of the values, say
Bernoulli set-indicator function
For instance, suppose we have a set-indicator function
There are a countably infinite set of random hash functions which hash all
elements in
For any
It is interesting to note that the entropy map initially frames the problem as a compression problem, but we can also think of it as a rate-distortion problem. Implicitly, in the above set-indicator function approximation, we are choosing to minimize a loss function in which false negatives are much more costly than false negatives, either because it is unlikely we will test a negative element for membership, or because false positives are not nearly as costly as false negatives, e.g., falsely thinking a rustling in the bushes is a tiger (false positive) is much less costly than failing to notice a tiger in the bushes (false negative).
In either case, we call this set-indicator approximation a Bernoulli set-indicator
function, bernoulli<(set<X>, X) -> bool>{
}
. This is the function that
is communicated, not the latent set-indicator function
A randomly chosen random hash function that satisfies (is conditioned on) the
property that it hashes all elements in
Table 1: Conditional distribution of Bernoulli set-indicator functions given latent
set-indicator function on
latent/observed | ||||
---|---|---|---|---|
We see that the constraint of no false negatives generates a confusion matrix
with a lot of zeros. If we observe bernoulli<set<X>,X) -> bool>{
}
, then the latent
set-indicator function is either
What is the total degrees-of-freedom for a confusion matrix of this type?
Table 2: Confusion matrix with maximum degrees-of-freedom
latent/observed | ||||
---|---|---|---|---|
We see that there are
The degree-of-freedom is one way to think about the complexity of a model. The more degrees-of-freedom, the more complex the model. The more complex the model, the more data we need to estimate the parameters of the model, although frequently we already know the parameters of the model, since it may have been specified as a part of the algrorithm that generated the Bernoulli approximation.
The confusion matrix in Tables 1 and 2 represent the conditional distribution of the
Bernoulli set-indicator function given the latent set-indicator function, which we
denote by bernoulli<set<X>,X) -> bool>
.
Boolean Bernoulli as constant function
How many functions are there of type () -> bool
? There are two, true
and false
.
That is, there are
So, we can also think of Boolean values as functions of type () -> bool
. Then,
when we apply the Bernoulli model bernoulli<() -> bool>
, we get the same result
as before.
Table 3: Confusion matrix for Bernoulli model applied to Boolean values
latent/observed | true | false |
---|---|---|
true | ||
false |
This confusion matrix has a maximum of two degrees-of-freedom, since there are two
parameters,
In the binary symmetric channel,
Table 4: Confusion matrix for Bernoulli model applied to Boolean values
latent/observed | true | false |
---|---|---|
true | ||
false |
Conditional distribution of latent function given observed function
Once we have an observation, say bernoulli<set<X>,X) -> bool>{x}
, what does the
confusion matrix tell us? Let’s abstract the problem a bit so we can focus on
deriving the result.
Let
Suppose we are interested in
So, to compute
Second, what is
So, let’s replace x
and bernoulli<(set<X>,X> -> bool>{y}
. Then, we can compute the conditional
distribution of the latent x
given the observed bernoulli<(set<X>,X> -> bool>{y}
by looking at the confusion matrix in Table 2 and picking out the specific row
and column of interest and then normalizing by the sum of the column.
For instance, suppose we observe
More generally, we have
Table 5: Conditional probability of latent set-indicator function given observed set-indicator function
observed/latent | ||||
---|---|---|---|---|
The conditional distribution in Table 5 is one way to think about the uncertainty of the latent set-indicator function given the observed set-indicator function. The entropy is another way to think about the uncertainty, but we will not compute it here.
We see that when we observe the empty set for a Bernoulli model in which
false negatives are not possible, then we know for certain that the latent
set-indicator function is
Algorithms
The simplest algorithm is a one-level hash function evaluation, where we hash the
domain values concatenated with some bit string
Two-level hash function evaluation
The more practical solution is a two-level hash scheme. First, we hash each
This way, we can keep the probability
Since each decoding is an independent Bernoulli trial, we see that
the probability that a particular
Oblivious entropy maps
An oblivious entropy map is just an entropy map where the hash function is
applied to trapdoors of