Skip to main content

Rank-Ordered Encrypted Search

Rank-ordered Encrypted Search

Consider a map from a set of keys to a set of values, where the keys are the search keys or relations between search keys, and the values are the information about the search keys which may be used to compute rank-ordering, such as frequency or proximity information.

Encrypted search enables untrusted systems to obliviously search documents on behalf of authorized users, where the search query, the confidential documents being searched over, and the search results are oblivious values.

Oblivious values are values that do not reveal any information about the underlying plaintext values, such as an oblivoius Boolean value which may be true or false (or neither, in the case of noisy oblivious values). These requirements may not be fully met in most cases. For example, the search results may reveal the number of search keys in the query (although this too can be mitigated). However, the goal is to minimize the amount of information that is revealed to an untrusted system that is performing the search.

One of the earlier papers presented on Encrypted search observed that individuals may wish to outsource storage logistics to cloud storage providers but do not trust the providers with confidentiality requirements (Son00). To enable an untrusted system to search a confidential document on an authorized user’s behalf, the untrusted system must have some sort searchable representation of the document, which we call a secure index.

In rank-ordered Encrypted Search, a confidential document has a degree-of-relevance to a hidden query. The degree-of-relevance can be based off many criteria, such as the multiplicity or proximity of the search keys in the document. Thus, to facilitate rank-ordered search, we need a data structure that is able to store arbitrary information about the search keys in a document.

In what follows, we consider secure indexes represented by an oblivious entropy map, which permits nearly any kind of search, including Boolean search, rank-ordered search, set-theoretic search, proxmity search, and so on. As a special case, we consider secure indexes represented by an oblivious set, which permits only set-theoretic Boolean search.

Oblivious entropy map: theory and implementation

The basic theory behind an entropy map is to map values in the domain to values in the codomain by hashing to a prefix-free code in the codomain. We do not store anything related to the domain, since we are simply hashing them, and a prefix of that hash will be used as a code for a value in the codomain.

We actually allow for many different codes for each value in the codomain, so that, for instance, a code for, say, the value a may be 00, 01, 10, and 11. Notice that we can efficiently decode this as a if the hash is less than 4.

Suppose Pr{f(X)=y}=py\Pr\lbrace f(X) = y\rbrace = p_y, XpXX \sim p_X, then the optimally space-efficient code, assuming a uniform hash function hh, is to assign prefix-free codes for yy whose probability of being mapped to by hh sums to pyp_y. In this case, the expected bit length is given by

=yYpylog2py, \ell = -\sum_{y \in \mathcal{Y}} p_y \log_2 p_y,

which if we imagine sampling xx from X\mathcal{X} with pXp_X and then mapping to y=f(x)y = f(x) and observing the sequence of yy’s, then the expected bit length is the entropy of the sequence of yy’s. This is why we call it an entropy map.

If X\mathcal{X} is finite, then we can just imagine implicitly encoding the domain and then for each value in the domain, storing the prefix-free code that it maps to, which has an average bit length of \ell and a total bit length of X|X| \ell.

We can allow rate distortion, too, by failing to code for some of the elements properly. For instance, a popular choice is when one of the values, say yy’, is extremely common such that, for instance, py>.99p_{y’} > .99, then we can give it a prefix-free code that sums to pyp_{y’} and then not code for it in the entropy map, in which case it will, for some randomly selected xXx \in \mathcal{X}, be mapped to yy’ with probability pyp_{y’} (which is sufficiently large, and can be made as close to 11 as desired if we wish to trade space for accuracy), and then for the remaining values in the domain, code for them correctly (or also allow errors on them, too, but only after trying to find correct codes for each of them).

For instance, suppose we have a set-indicator function 1A:X{0,1}1_{\mathcal{A}} : \mathcal{X} \to \lbrace 0,1\rbrace, where AX\mathcal{A} \subseteq \mathcal{X}, and X\mathcal{X} is a very large set, then we may seek to give 00 a set of prefix codes that sums to p0=1ε>.99p_0 = 1-\varepsilon > .99, and then only find codes for 11 in A\mathcal{A}, which has a probability of p1=1p0=ε.01p_1 = 1 - p_0 = \varepsilon \leq .01. Then, for any xXx \in \mathcal{X}, we can test if 1A(x)=11_{\mathcal{A}}(x) = 1 by testing if h(x)h(x) is a code for 00 or 11, and if it is a code for 00, then we know that it is definitely not a member of A\mathcal{A}, but if it is a code for 11, then it is a member of A\mathcal{A} with a false positive rate of ε\varepsilon and a true positive rate 11, since a randomly drawn element not in A\mathcal{A} will map to 00 with probability 1p0=ε1 - p_0 = \varepsilon and any element in A\mathcal{A} will map to 11 with probability 11 (since we explicitly code for it). In this formulation, we can think of the entropy map as rate-distorted approximation of a set-indicator function, which we refer to as a Bernoulli set-indicator function (or more generally a Bernoulli map over X -> bool).

The Bloom filter is an entropy map in which, implicitly, the probability of a negative membership test is ε\varepsilon. The Bloom filter is a special case of the entropy map over set-indicator functions.

The oblivious entropy map is just an entropy map where the hash function is applied to trapdoors of X\mathcal{X} and the prefix-free codes for Y\mathcal{Y} have no discernible pattern.

In Boolean Encrypted Search, a confidential document is relevant to a hidden query if the secure index of the document contains every trapdoor search key in the hidden query. That is, the secure index may be modeled as an oblivous set.

Sets enable queries about the presence of a trapdoor. Let

contains : (secure_index, trapdoor) -> bool

be a set-membership test (set-indicator function), e.g., contains(S,x) tests if trapdoor x (which may be, say, a trapdoor of unigram word, bigram word, or something else, like a stemmed phrase) is a member of secure index S.

This can be used to implement Boolean search and set-theoretic queries, such as intersection, union, and difference. However, it is not sufficient for rank-ordered search, which requires more information about the relation the query (of trapdoors) has to the set of documents (secure indexes).

Oblivious multi-sets

Multi-sets, which enable queries about the multiplicity of a trapdoor, enable rank-ordered searches that consider the frequency of words, such as BM25.

In this case, we need a data structure that is able to store the multiplicity of search keys in a document. This is enabled by an oblivious multi-set where the keys are the search keys and the values are the multiplicity of the search keys. The expected bit length is given by

O(log2ε+log2qmax) \mathcal{O}(-\log_2 \varepsilon + \log_2 q_{\rm{max}}) log2fmaxfmin+1ε \log_2 \frac{f_{\rm{max}} - f_{\rm{min}} + 1}{\varepsilon}

bits per trapdoor if we assume that the frequency of a word in a document is uniformly distributed between fminf_{\rm{min}} and fmaxf_{\rm{max}}.

Let

contains : (secure_index, unigram) -> bool

be a set-membership test (set-indicator function), e.g., contains(S,x) tests if unigram xx is a member of secure index ss. This is the bare minimum and only supports Boolean search.

The interface provides a multiplicity function: Multiplicity(s,x)Z\text{Multiplicity}(s, x) \mapsto \mathbb{Z} counts the multiplicity of unigram xx in secure index ss.

Secure postings list

A secure postings list uses a variable-length perfect map filter. Keys are unigrams and map to array elements. The first bit indicates the initial position and the gaps are encoded in the remaining bits. If this information cannot fit, then the remaining bits serve as a pointer into a larger bit string where the first sequence (using a universal coder) indicates starting position.

The interface elements for a secure postings list include:

  • contains(s, x) -> bool: Test unigram xx for membership in secure index ss. This is the bare minimum and only supports Boolean search.
  • multiplicity(s, x) -> int: Count the multiplicity of unigram xx in secure index ss.
  • locations(s, x): Retrieve the locations of unigram xx in secure index ss.
  • containsSequence(s, x_1 x_2 \cdots x_k): Test ordered kk-gram x1x2xkx_1 x_2 \cdots x_k for membership in secure index ss. If a biword model is used, then contains may be used to implement containsSequence.
  • locationsSequence(s, x_1 x_2 \cdots x_k): Retrieve the locations of kk-gram x1x2xkx_1 x_2 \cdots x_k in secure index ss.

These interface elements may be used to implement other interfaces, like rank-ordered search (e.g., BM25 or MinDistX) or set-theoretic queries. Note that if the locations interface is supported, then minimum distance and multiplicity can be computed from the location data.

Secure rank

The secure index does not disclose which plaintext words are contained in a document, nor any information attached to the plaintext word, until a hidden query is submitted against the document. However, at the point of submission, while the plaintext word is not necessarily disclosed, the information attached to the word is.

Such information may be sufficient to allow the adversary to estimate (with a probability of success greater than random chance) the plaintext word that corresponds to a trapdoor in a hidden query. Depending upon the nature of the information, the adversary may be able to estimate other characteristics about the confidential document. In the case of a secure postings list, assuming the postings have minimal error, the adversary may be able to estimate much of the plaintext of an entire confidential document.

One way to combat this technique is to precompute the results of queries (perhaps just reasonably plausible queries) with respect to a given confidential document. Then, the untrusted system does not need to know any information about particular atomic words, but rather, it needs to know information about how a confidential document is relevant to particular queries.

Suppose the query model is a bag-of-words and the maximum number of words in a query is pp. As a bag-of-words, the order of the words in the query is irrelevant. Since a document may only contain some of the words in the query, we must code all 2p12^p-1 subsets (excluding the empty set).

Given a bag-of-words query, we could canonically order the bag and map this to any desired ranking. That is, we precompute the ranking. This can have a high up-front cost and a high storage cost.

If the document consists of kk unique words, then there are

j=1p(kj) \sum_{j=1}^{p} \binom{k}{j}

canonical keys. If pp is unbounded, then there are

2k1 2^k - 1

canonical keys, which is impractical for most documents with even a relatively small number of unique words, e.g., if k=30k=30, then there are approximately a billion keys.

In either case, the expected lower-bound on bit length for the oblivious map is given by

j=1p(kj)(log2ε+μ), -\sum_{j=1}^{p} \binom{k}{j}\left(\log_2 \varepsilon + \mu\right)\,,

where μ\mu is the average bit length of the ranking coder which represents some rank-ordered search metric, like BM25 or MinDist, or the proximity of the words for Boolean proximity searches (or multiple such metrics may be coded).

Suppose the maximum query is p=3p=3, then the map has on the order of

O(k3) \mathcal{O}(k^3)

keys and the oblivious map has a lower-bound on the order of

O(k3)(log2ε+μ) bits -\mathcal{O}(k^3)(\log_2 \varepsilon + \mu) \text{ bits}

or

O(k2)(log2ε+μ) bits per search key -\mathcal{O}(k^2)(\log_2 \varepsilon + \mu) \text{ bits per search key}

If k100k \approx 100, which may be a reasonably large document (especially if common words are removed and stemming is used to normalize words), then the lower-bound is around

104(log2ε+μ) bits per search key -10^4 (\log_2 \varepsilon + \mu) \text{ bits per search key}

Rank-ordered information retrieval provides a mapping between queries and degree-of-relevancy for each document in a collection.

In information retrieval (IR), finding effective ways to measure relevancy is of fundamental importance. To that end, many algorithms and heuristics have been devised to rank-order documents by their estimated relevancy to a given query. We explore term weighting and term proximity weighting heuristics in the context of secure indexes.

Precision and recall

Precision and recall are relevant metrics for Boolean searches; they do not rank retrieved documents like BM25 or MinDistX; a document is either considered relevant (contains all of the terms in a query) or non-relevant.

Precision measures the proportion of retrieved documents that are relevant to the query. It is defined as:

Precision=Number of relevant documents retrievedTotal number of documents retrieved=RAA \text{Precision} = \frac{\text{Number of relevant documents retrieved}}{\text{Total number of documents retrieved}} = \frac{|R \cap A|}{|A|}

where RR is the set of relevant documents and AA is the set of retrieved documents.

Precision has a range of [0,1][0, 1]. Recall measures the proportion of relevant documents that were retrieved. It is defined as:

Recall=Number of relevant documents retrievedTotal number of relevant documents=RAR \text{Recall} = \frac{\text{Number of relevant documents retrieved}}{\text{Total number of relevant documents}} = \frac{|R \cap A|}{|R|}

Recall also has a range in [0,1][0, 1]. It is trivial to achieve a recall of 11 (100%) by retrieving every document in the corpus. However, this comes at the cost of decreased precision. Thus, in general there is a trade-off between precision and recall.

Mean average precision

Mean average precision (MAP) is a popular way to measure the performance of information retrieval systems with respect to degree of relevancy, however relevancy may be defined. In particular, we use MAP to measure how closely the rank-ordered output (of either BM25 or MinDistX) of the secure indexes matches the rank-ordered output of non-secure, canonical indexes which provide perfect location and frequency information.

The more approximately a secure index represents a document, the less information one can infer about the document from the secure index. Thus, to what extent the secure index can approximate a document while still achieving a reasonable MAP score is an important question.

MAP is calculated by taking the mean of the average precisions of nn queries. The precision at kk is given by:

precision(q,k)={k most relevant docs to query q}{top k retrieved docs for query q}k \text{precision}(q, k) = \frac{|\lbrace k \text{ most relevant docs to query } q\rbrace \cap \lbrace \text{top } k \text{ retrieved docs for query } q\rbrace|}{k}

In other words, precision at kk measures how many of the top kk retrieved documents are actually in the top kk most relevant documents.

The average precision for the top nn documents is given by:

avgprecision(q,n)=1nk=1nprecision(q,k) \text{avgprecision}(q, n) = \frac{1}{n} \sum_{k=1}^{n} \text{precision}(q, k)

This averages the precision values across all cutoff points from 1 to nn, giving higher weight to systems that return relevant documents earlier.

The MAP for the top nn documents over a query set QQ is given by:

MAP(Q,n)=1Qi=1Qavgprecision(qi,n) \text{MAP}(Q, n) = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \text{avgprecision}(q_i, n)

Given a set of documents DD and a query QQ, to estimate a document’s MAP score, we need a set of documents DD and a query set QQ. Then, we construct a set of secure indexes SISI for DD, and rank-order both SISI and DD according to each qQq \in Q. Finally, we calculate the MAP over these rank-ordered outputs using the rank-ordered outputs for DD as the true, canonical output.

Example

Consider the following. Suppose the rank-ordered list of relevant documents to a query is (3,0,1,2,4)(3, 0, 1, 2, 4), and the retrieved ranked-ordered list (by a secure index) is (2,4,3,0,1)(2, 4, 3, 0, 1). The precision at k=1k=1 is given by \begin{equation*} \frac{|\lbrace 3 \rbrace \cap \lbrace 2 \rbrace|}{1}=\frac{|\emptyset|}{1}=0,, \end{equation*} the precision at k=2k=2 is given by

{3,0}{2,4}2=2=0. \frac{|\lbrace 3,0 \rbrace \cap \lbrace 2,4 \rbrace|}{2}=\frac{|\emptyset|}{2}=0\,.

the precision at k=3 k=3 is given by

{3,0,1}{2,4,3}3={3}3=13, \frac{|\lbrace 3,0,1 \rbrace \cap \lbrace 2,4,3 \rbrace|}{3}=\frac{|\lbrace 3 \rbrace|}{3}=\frac{1}{3}\,,

the precision at k=4k=4 is given by

{3,0,1,2}{2,4,3,0}4={3,0,1}4=34, \frac{|\lbrace 3,0,1,2 \rbrace \cap \lbrace 2,4,3,0 \rbrace|}{4}=\frac{|\lbrace 3,0,1 \rbrace|}{4}=\frac{3}{4}\,,

and the precision at k=5k=5 is given by

{3,0,1,2,4}{2,4,3,0,1}5={3,0,1,2,4}5=55=1. \frac{|\lbrace 3,0,1,2,4 \rbrace \cap \lbrace 2,4,3,0,1 \rbrace|}{5}=\frac{|\lbrace 3,0,1,2,4 \rbrace|}{5}=\frac{5}{5}=1\,.

Thus, the average precision is given by

0+0+13+34+15=512. \frac{0+0+\frac{1}{3}+\frac{3}{4}+1}{5}=\frac{5}{12}\,.

The MAP would simply be the mean of the average precisions for MM queries.

Note that the average precision for the last value of k k is necessarily 1 1 if, by that iteration of k k , the relevant set and the retrieved set contain the same elements. However, in general, this is not the case; for instance, if the relevant ranked list of documents to a query is (A,B)(A, B), and the retrieved ranked list is (D,C,B,A)(D, C, B, A), then if the mean average precision goes from k=1 k=1 to k=2 k=2 , the average precision is 00. In our simulations, we do a variation of this.

Suppose the relevant ranked list of documents to a query is (A=0.9,B=0.85,C=0,D=0)(A=0.9, B=0.85, C=0, D=0), and the retrieved rank-ordered list is (A=0.9,C=0.85,D=0.5,B=0)(A=0.9, C=0.85, D=0.5, B=0). Then, I calculate the average precision for the top k=3 k=3 instead of the top k=4 k=4 or top k=2 k=2 . In this example, document BB is not included in any of the precision at k=1k=1 to k=3k=3 calculations.

Finally, in one of the experiments, we conduct a “page one” MAP test, i.e., we find the mean average precision using only the top 10 results. The randomized algorithm does much more poorly in this instance, e.g., with over 85% probability, the mean average precision will be less than 0.05.

Term importance: BM25

BM25, a well-established tf-idf variant, is one of the relevancy measures chosen for our experiments. Mathematically:

BM25(d,t,D)=idf(t,D)tf(t,d)(k1+1)tf(t,d)+k1(1b+bdavgdl) \text{BM25}(d, t, D) = \text{idf}(t, D) \cdot \frac{\text{tf}(t, d) \cdot (k_1 + 1)}{\text{tf}(t, d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}

where avgdl\text{avgdl} is the average size (in words) of documents in DD.

Parameters bb and k1k_1 are free parameters. In the experiments, they are set to typical values: bb is set to 0.750.75 and k1k_1 is set to 1.21.2. Ideally these parameters would be automatically tuned for each secure index. The function tf\text{tf} stands for term frequency and simply returns the number of occurrences of term tt in document dd. The function idf\text{idf} stands for inverse document frequency:

idf(t,D)=ln(Dcount(t,D)+0.5count(t,D)+0.5+1) \text{idf}(t, D) = \ln\left(\frac{|D| - \text{count}(t, D) + 0.5}{\text{count}(t, D) + 0.5} + 1\right)

where D|D| is the number of documents in DD and count(t,D)\text{count}(t, D) is a function which returns the number of documents in DD which had one or more occurrences of term tt.

When using BM25 to rank search results, each document dd in DD is ranked according to query QQ by the function BM25Score(d,Q)=tQBM25(d,t,D)\text{BM25Score}(d, Q) = \sum_{t \in Q} \text{BM25}(d, t, D), where tt is a term in query QQ. After giving every document a BM25 rank, the results are sorted in descending order of rank as the final output.

Note that BM25Score is not used in most real-world information retrieval systems; instead, they typically employ a vector space model, in which a vector v(d)v(d) for each document dd in DD is constructed, where each dimension represents a word tt with a weight equal to BM25(d,t,D)\text{BM25}(d, t, D). Subsequently, a document can be ranked according to a query by taking the cosine similarity of their respective vectors. However, this representation is problematic in light of the limited information (for confidentiality) available in secure indexes.

Term proximity: MinDistX

MinDist, like BM25, ranks documents according to their proximity relevance to a given query. It is a less established ranking heuristic than BM25, but in experiments it had performed well compared to other proximity heuristics. In our experiments, we add additional tunable parameters to MinDist and call it MinDistX.

MinDistX is a proximity heuristic in which the minimum distance between each existent pair of terms are summed over.

Example: Consider a document

d=(A,B,D,D,A,D,C)\vec{d} = (A, B, D, D, A, D, C)

and a query

q={A,B,C}\vec{q} = \lbrace A, B, C \rbrace

where AA, BB, CC, and DD are searchable terms. Then, the minimum pairwise distances are given by

d(A,B)=1,d(A,C)=2,d(B,C)=5d(A, B) = 1, \quad d(A, C) = 2, \quad d(B, C) = 5

and the summation of these distances is simply s=1+2+5=8s = 1 + 2 + 5 = 8.

MinDistX is a scoring function dependent upon the summation of the minimum pairwise distances, as given by ss in the example above. Thus, MinDistX requires that a secure index represent distances between terms. The most obvious way to accomplish this is by storing approximate location information.

The more concentrated the query terms in a document are, the more relevant the document is to the query. However, this is true only up to a certain point.

Example: Consider a query

q={computer,science}\vec{q} = \lbrace \text{computer}, \text{science} \rbrace

and two documents AA and BB, where AA contains both computer and science on page 7 and BB contains computer on page 7 and science on page 20. It is obvious that AA should be considered more relevant to q\vec{q} since the two keywords of interest are much closer together.

However, consider a third document CC that contains computer on page 7 and science on page 100. This is not much worse than BB; both documents are not that relevant and BB is only marginally more relevant at best.

MinDistX has the following functional form.

Definition: Let Q\mathbb{Q} be the set of query terms, Q\mathbb{Q’} be the subset of Q\mathbb{Q} that exist in the given document d\vec{d}, and ss be the sum of the minimum pairwise distances between terms in Q\mathbb{Q}^{\ast}.

MinDistX(s,Qα,γ,β,θ)=ln(α+γexp{βsQθ}) \text{MinDistX}(s, |\mathbb{Q}^{\ast}| \mid \alpha, \gamma, \beta, \theta) = \ln\left(\alpha + \gamma \cdot \exp\left\lbrace -\frac{\beta s}{|\mathbb{Q}^{\ast}|^{\theta}} \right\rbrace\right)

where α,γ,β,θ,s>0\alpha, \gamma, \beta, \theta, s > 0.

To determine if MinDistX matches the intuition—a strictly decreasing function that flattens out as ss increases—it may be instructive to consider the limits. Asymptotically, as s0s \to 0:

MinDistX(s,Q)ln(α+γ) \text{MinDistX}(s, |\mathbb{Q}^{\ast}| \mid \cdot) \to \ln(\alpha + \gamma)

and as ss \to \infty:

MinDistX(s,Q)lnα \text{MinDistX}(s, |\mathbb{Q}^{\ast}| \mid \cdot) \to \ln\alpha

To determine if these end points are the maximum and minimum values respectively, let us consider the partial derivative:

sMinDistX(s,Qα,β,γ)=Qθβγexp(p)α+γexp(p) \frac{\partial}{\partial s} \text{MinDistX}(s, |\mathbb{Q}^{\ast}| \mid \alpha, \beta, \gamma) = -|\mathbb{Q}^{\ast}|^{-\theta} \frac{\beta \gamma \exp(-p)}{\alpha + \gamma \exp(-p)}

where

p=βsQθ p = \frac{\beta s}{|\mathbb{Q}^{\ast}|^\theta}

This partial, for all supported values, is negative. As s0s \to 0:

1Qθ(βγα+γ) -\frac{1}{|Q'|^{\theta}} \left(\frac{\beta \gamma}{\alpha + \gamma}\right)

and as ss \to \infty the partial approaches 00. This matches the desired intuition; for small ss, a small increase in ss corresponds to a large decrease in MinDistX; and, for large ss, a small increase in ss corresponds to small decrease in ss.

Note that for large Q|Q^{\ast}|, the MinDistX function decreases less rapidly than for small Q|Q^{\ast}|, which is the desired behavior. Recall that Q|Q^{\ast}| corresponds to the number of query terms matched in the document. We do not wish to penalize (at least not too harshly) a document that contains more of the query’s terms but spread out over a larger region compared to a document that contains fewer of the query’s terms but spread out less.

MinDistX may be used as a way to add proximity sensitivity to already established scoring methods, like BM25. For example, a linear combination of their scores can be used as the final output of a scoring function that is both sensitive to proximity and term frequencies:

Score(d,Q,D)=λMinDistX+(1λ)BM25Score(d,Q,D) \text{Score}(d, Q, D) = \lambda \cdot \text{MinDistX} + (1 - \lambda) \cdot \text{BM25Score}(d, Q, D)

Uncertainty vs mean average precision

In Encrypted Boolean Search (EBS), the only information disclosed to the untrusted system about a collection of confidential objects is which objects are relevant to obfuscated search keys (trapdoors of search keys).

To facilitate Encrypted Rank-Ordered Search (ERS), more information about the confidential object is required to compute a degree-of-relevance.

A key insight is that a significant amount of error or uncertainty is acceptable if the rank-ordered search operations remain sufficiently useful. The trade-off between information leakage and search utility is fundamental to the design of secure indexes.

False positives (from approximate sets and approximate maps) represent one kind of error. We can also deliberately inject other kinds of errors or approximations:

  • Quantization noise: Storing only approximate term frequencies or positions
  • Random perturbation: Adding noise to stored values
  • Truncation: Limiting the precision of stored information
  • Aggregation: Combining information in ways that obscure individual values

The key metric is how these approximations affect Mean Average Precision. Empirically, secure indexes can tolerate substantial approximation while maintaining acceptable MAP scores, especially when the goal is ranking rather than exact retrieval.

Conclusion

Rank-ordered encrypted search extends Boolean encrypted search by enabling relevance scoring without revealing document contents. The core data structures—oblivious entropy maps and their specializations—provide flexible representations that trade off between storage efficiency, information leakage, and search quality.

Key takeaways:

  1. Entropy maps provide the theoretical foundation for space-optimal approximate function representation
  2. Oblivious variants apply trapdoor functions to prevent content disclosure
  3. BM25 and MinDistX can be computed from secure indexes with acceptable accuracy
  4. MAP scores provide a principled way to evaluate the quality-privacy trade-off

The central challenge remains balancing the amount of information stored in secure indexes against the privacy guarantees required. More information enables better ranking but increases potential for inference attacks. The frameworks presented here provide a foundation for reasoning about this trade-off quantitatively.

Discussion