Skip to main content

Entropy Maps

The PDF version of this post is available on GitHub.

An entropy map approximates a function f:XYf : \mathcal{X} \to \mathcal{Y} by hashing domain values to prefix-free codes in the codomain. We store nothing about the domain itself. We just hash, and a prefix of that hash serves as a code for a codomain value.

We allow multiple codes per codomain value. For instance, the value a might be encoded by 00, 01, 10, and 11. If the hash is less than 4, we decode it as a.

Suppose Pr{f(X)=y}=py\Pr\lbrace f(X) = y\rbrace = p_y where XpXX \sim p_X. The optimally space-efficient code, assuming a uniform hash function hh, assigns prefix-free codes for yy whose probability of being selected by hh sums to pyp_y. The expected bit length is

=yYpylog2py, \ell = -\sum_{y \in \mathcal{Y}} p_y \log_2 p_y,

which is the entropy of the output distribution. That is why we call it an entropy map.

If X\mathcal{X} is finite, we can think of it as implicitly encoding the domain and storing the prefix-free code for each domain element. The average bit length per element is \ell, and the total is X|X| \ell.

Rate distortion: Bernoulli maps

We can allow errors. If one codomain value yy’ is very common (say py>.99p_{y’} > .99), we can give it a prefix-free code that covers probability pyp_{y’} and then skip coding for it in the entropy map. A random xXx \in \mathcal{X} will map to yy’ with probability pyp_{y’} (which can be made as close to 1 as desired by trading space for accuracy). For the remaining domain values, we code them correctly, or allow errors on those too after attempting correct coding.

Bernoulli set-indicator function

Consider a set-indicator function

1A:X{0,1}, 1_{\mathcal{A}} : \mathcal{X} \to \lbrace0,1\rbrace,

where AX\mathcal{A} \subseteq \mathcal{X} and X\mathcal{X} is very large (possibly infinite). We assign prefix-free codes for codomain value 11 such that a random hash maps an element of X\mathcal{X} to a code for 11 with probability ε\varepsilon, where ε\varepsilon is small (say 2102^{-10}).

There exists a (countably infinite) set of hash functions that hash all elements in A\mathcal{A} to codes for 11 and elements in A=XA\mathcal{A}’ = \mathcal{X} \setminus \mathcal{A} to codes for either 00 or 11. Choosing a random hash function with this property, we expect ε\varepsilon of the elements in A\mathcal{A}’ to hash to 11 (false positives) and the remaining 1ε1 - \varepsilon to hash to 00.

For any xXx \in \mathcal{X}, we test membership by checking whether a prefix of h(x)h(x) is a code for 00 or 11. If it is a code for 00, then xx is definitely not in A\mathcal{A}. If it is a code for 11, then xx is in A\mathcal{A} with a false positive rate of ε\varepsilon and a true positive rate of 11, since we explicitly chose a hash function that maps all elements of A\mathcal{A} to codes for 11.

There is an interesting framing here. The entropy map starts as a compression problem, but it is also a rate-distortion problem. In the set-indicator approximation above, we are implicitly choosing to minimize a loss function where false negatives are much more costly than false positives. Either because negative elements are rarely queried, or because false positives are cheap compared to false negatives (falsely thinking a rustling in the bushes is a tiger is much less costly than failing to notice an actual tiger).

We call this a Bernoulli set-indicator function, bernoulli<(set<X>, X) -> bool>{ 1A1_A }. This is what gets communicated, not the latent function 1A1_A.

The confusion matrix for a hash function conditioned on mapping all of A\mathcal{A} to codes for 11:

Table 1: Conditional distribution of Bernoulli set-indicator functions on X={a,b}\mathcal{X} = \lbrace a,b\rbrace

latent/observed11_\emptyset1{a}1_{\lbrace a\rbrace}1{b}1_{\lbrace b\rbrace}1{a,b}1_{\lbrace a,b\rbrace}
11_\emptyset(1ε)2(1-\varepsilon)^2(1ε)ε(1-\varepsilon)\varepsilon(1ε)ε(1-\varepsilon)\varepsilonε2\varepsilon^2
1{a}1_{\lbrace a\rbrace}001ε1-\varepsilon00ε\varepsilon
1{b}1_{\lbrace b\rbrace}00001ε1-\varepsilonε\varepsilon
1{a,b}1_{\lbrace a,b\rbrace}00000011

The no-false-negatives constraint produces a lot of zeros. If we observe bernoulli<set<X>,X) -> bool>{1{a}1_{\lbrace a\rbrace}}, the latent function is either 11_{\emptyset} or 1{a}1_{\lbrace a\rbrace}. Since ε\varepsilon is small, 1{a}1_{\lbrace a\rbrace} is much more likely.

The maximum degrees of freedom for this confusion matrix:

Table 2: Confusion matrix with maximum degrees of freedom

latent/observed11_\emptyset1{a}1_{\lbrace a\rbrace}1{b}1_{\lbrace b\rbrace}1{a,b}1_{\lbrace a,b\rbrace}
11_\emptysetp11p_{1 1}p12p_{1 2}p13p_{1 3}1p11p12p131-p_{1 1}-p_{1 2}-p_{1 3}
1{a}1_{\lbrace a\rbrace}p21p_{2 1}p22p_{2 2}p23p_{2 3}1p21p22p231-p_{2 1}-p_{2 2}-p_{2 3}
1{b}1_{\lbrace b\rbrace}p31p_{3 1}p32p_{3 2}p33p_{3 3}1p31p32p331-p_{3 1}-p_{3 2}-p_{3 3}
1{a,b}1_{\lbrace a,b\rbrace}p41p_{4 1}p42p_{4 2}p43p_{4 3}1p41p42p431-p_{4 1}-p_{4 2}-p_{4 3}

That is 4×(41)=124 \times (4 - 1) = 12 degrees of freedom. Table 1 has just 1 degree of freedom (ε\varepsilon).

Degrees of freedom give a measure of model complexity. More parameters means more data needed for estimation, though frequently we already know the parameters because they were specified by the algorithm that generated the Bernoulli approximation.

Boolean Bernoulli as constant function

How many functions of type () -> bool? Two: true and false. That is {true,false}{1}=21=2|\lbrace true, false\rbrace|^{|\lbrace1\rbrace|} = 2^1 = 2.

We can think of Boolean values as functions () -> bool. Applying the Bernoulli model bernoulli<() -> bool> gives the same result as before.

Table 3: Confusion matrix for Bernoulli model on Boolean values

latent/observedtruefalse
truep11p_{1 1}1p111-p_{1 1}
false1p221-p_{2 2}p22p_{2 2}

Maximum of two degrees of freedom. In the binary symmetric channel, p11=p22p_{1 1} = p_{2 2}:

Table 4: Symmetric confusion matrix for Bernoulli Boolean

latent/observedtruefalse
truepp1p1-p
false1p1-ppp

Conditional distribution of latent function given observed function

Once we have an observation bernoulli<set<X>,X) -> bool>{x}, what does the confusion matrix tell us? Let me abstract the problem.

Let XX and YY be random variables. Suppose P(X=xY=y)P(X = x | Y = y) is hard to compute, but P(Y=yX=x)P(Y = y | X = x) is easy. (This is exactly the situation with confusion matrices: we know the conditional distribution of the observed function given the latent one, but we want the reverse.)

Bayes’ rule:

P(X=xY=y)=P(Y=yX=x)P(X=x)P(Y=y) P(X = x | Y = y) = \frac{P(Y = y | X = x) P(X = x)}{P(Y = y)}

We need two things. First, P(X=x)P(X = x), the prior. If we know the distribution of XX, encode it. Otherwise use an uninformed prior (uniform).

Second, P(Y=y)P(Y = y), the normalizing constant:

P(Y=y)=xP(Y=yX=x)P(X=x) P(Y = y) = \sum_{x'} P(Y = y | X = x') P(X = x')

With a uniform prior over finite X|X|, P(X=x)=1/XP(X = x) = 1/|X|, and:

P(X=xY=y)=P(Y=yX=x)xP(Y=yX=x) P(X = x | Y = y) = \frac{P(Y = y | X = x)}{\sum_{x'} P(Y = y | X = x')}

Applying this to the set-indicator case: replace XX with the latent indicator and YY with the observed Bernoulli indicator. Read the confusion matrix, pick out the relevant column, normalize.

For example, observing 1{a}1_{\lbrace a\rbrace}: the column is (p12,p22,p32,p42)(p_{1 2}, p_{2 2}, p_{3 2}, p_{4 2})’. The conditional probability is

pk2=pk2j=14pj2, p_{k|2} = \frac{p_{k 2}}{\sum_{j=1}^4 p_{j 2}},

where kk indexes the latent function and we condition on column 2.

More generally:

pki=pkij=14pji. p_{k|i} = \frac{p_{k i}}{\sum_{j=1}^4 p_{j i}}.

Doing this for all four observed indicators using the one-parameter confusion matrix from Table 1:

Table 5: Conditional probability of latent indicator given observed indicator

observed/latent11_\emptyset1{a}1_{\lbrace a\rbrace}1{b}1_{\lbrace b\rbrace}1{a,b}1_{\lbrace a,b\rbrace}
11_\emptyset11000000
1{a}1_{\lbrace a\rbrace}ε/(1+ε)\varepsilon/(1+\varepsilon)1/(1+ε)1/(1+\varepsilon)0000
1{b}1_{\lbrace b\rbrace}ε/(1+ε)\varepsilon/(1+\varepsilon)001/(1+ε)1/(1+\varepsilon)00
1{a,b}1_{\lbrace a,b\rbrace}ε2/(1+ε)2\varepsilon^2/(1+\varepsilon)^2ε/(1+ε)2\varepsilon/(1+\varepsilon)^2ε/(1+ε)2\varepsilon/(1+\varepsilon)^21/(1+ε)21/(1+\varepsilon)^2

When we observe the empty set in a model where false negatives are impossible, we know the latent function is 11_\emptyset with certainty. Observing 1{a}1_{\lbrace a\rbrace}, we know the latent is either 11_{\emptyset} or 1{a}1_{\lbrace a\rbrace}, with the latter much more likely since ε\varepsilon is small.

Algorithms

The simplest algorithm: hash domain values concatenated with a bit string bb and decode the result. Find bb such that h(x+b)h(x + b) decodes to the correct f(x)f(x) for all xXx \in \mathcal{X}.

Two-level hash function evaluation

The practical version uses a two-level scheme. First, hash each xXx \in \mathcal{X} concatenated with a bit string bb (same as before), but use the result to index into a hash table HH at position jj. Then choose a bit string for H[j]H[j] for each xx mapping to jj such that f(x)=decode(h(x+H[j]))f(x) = \text{decode}(h(x + H[j])).

This keeps the success probability pj=xPr{f(x)=decode(h(x+H[j]))}p_j = \prod_x \Pr\lbrace f(x) = \text{decode}(h(x + H[j]))\rbrace roughly constant for each bucket, independent of the total domain size, by choosing an appropriately sized hash table.

Each decoding is an independent Bernoulli trial. The probability that a particular xx hashing to bucket jj is decoded correctly is the fraction of hash values that are valid prefix-free codes for f(x)f(x).

Oblivious entropy maps

An oblivious entropy map applies the hash function to trapdoors of X\mathcal{X} rather than elements directly, and the prefix-free codes for Y\mathcal{Y} have no discernible structure (a random assignment of hash values to codomain values). This is relevant to my work on encrypted search.

Discussion