Rank-ordered Encrypted Search
Consider a map from a set of keys to a set of values, where the keys are the search keys or relations between search keys, and the values are the information about the search keys which may be used to compute rank-ordering, such as frequency or proximity information.
Encrypted search enables untrusted systems to obliviously search documents on behalf of authorized users, where the search query, the confidential documents being searched over, and the search results are oblivious values.
Oblivious values are values that do not reveal any information about the underlying plaintext values, such as an oblivoius Boolean value which may be true or false (or neither, in the case of noisy oblivious values). These requirements may not be fully met in most cases. For example, the search results may reveal the number of search keys in the query (although this too can be mitigated). However, the goal is to minimize the amount of information that is revealed to an untrusted system that is performing the search.
One of the earlier papers presented on Encrypted search observed that individuals may wish to outsource storage logistics to cloud storage providers but do not trust the providers with confidentiality requirements (Son00). To enable an untrusted system to search a confidential document on an authorized user’s behalf, the untrusted system must have some sort searchable representation of the document, which we call a secure index.
In rank-ordered Encrypted Search, a confidential document has a degree-of-relevance to a hidden query. The degree-of-relevance can be based off many criteria, such as the multiplicity or proximity of the search keys in the document. Thus, to facilitate rank-ordered search, we need a data structure that is able to store arbitrary information about the search keys in a document.
In what follows, we consider secure indexes represented by an oblivious entropy map, which permits nearly any kind of search, including Boolean search, rank-ordered search, set-theoretic search, proxmity search, and so on. As a special case, we consider secure indexes represented by an oblivious set, which permits only set-theoretic Boolean search.
Oblivious entropy map: theory and implementation
The basic theory behind an entropy map is to map values in the domain to values in the codomain by hashing to a prefix-free code in the codomain. We do not store anything related to the domain, since we are simply hashing them, and a prefix of that hash will be used as a code for a value in the codomain.
We actually allow for many different codes for each value in the codomain, so that,
for instance, a code for, say, the value a may be 00, 01, 10, and 11.
Notice that we can efficiently decode this as a if the hash is less than 4.
Suppose , , then the optimally space-efficient code, assuming a uniform hash function , is to assign prefix-free codes for whose probability of being mapped to by sums to . In this case, the expected bit length is given by
which if we imagine sampling from with and then mapping to and observing the sequence of ’s, then the expected bit length is the entropy of the sequence of ’s. This is why we call it an entropy map.
If is finite, then we can just imagine implicitly encoding the domain and then for each value in the domain, storing the prefix-free code that it maps to, which has an average bit length of and a total bit length of .
We can allow rate distortion, too, by failing to code for some of the elements properly. For instance, a popular choice is when one of the values, say , is extremely common such that, for instance, , then we can give it a prefix-free code that sums to and then not code for it in the entropy map, in which case it will, for some randomly selected , be mapped to with probability (which is sufficiently large, and can be made as close to as desired if we wish to trade space for accuracy), and then for the remaining values in the domain, code for them correctly (or also allow errors on them, too, but only after trying to find correct codes for each of them).
For instance, suppose we have a set-indicator function , where , and is a very large set,
then we may seek to give a set of prefix codes that sums to ,
and then only find codes for in , which has a probability of
. Then, for any , we can
test if by testing if is a code for or , and
if it is a code for , then we know that it is definitely not a member of ,
but if it is a code for , then it is a member of with a false positive
rate of and a true positive rate , since a randomly drawn element
not in will map to with probability and
any element in will map to with probability (since we
explicitly code for it). In this formulation, we can think of the entropy map as
rate-distorted approximation of a set-indicator function, which we refer to as a
Bernoulli set-indicator function (or more generally a Bernoulli map over X -> bool).
The Bloom filter is an entropy map in which, implicitly, the probability of a negative membership test is . The Bloom filter is a special case of the entropy map over set-indicator functions.
The oblivious entropy map is just an entropy map where the hash function is applied to trapdoors of and the prefix-free codes for have no discernible pattern.
Oblivious set: Boolean search
In Boolean Encrypted Search, a confidential document is relevant to a hidden query if the secure index of the document contains every trapdoor search key in the hidden query. That is, the secure index may be modeled as an oblivous set.
Sets enable queries about the presence of a trapdoor. Let
contains : (secure_index, trapdoor) -> bool
be a set-membership test (set-indicator function), e.g., contains(S,x) tests if
trapdoor x (which may be, say, a trapdoor of unigram word, bigram word, or
something else, like a stemmed phrase) is a member of secure index S.
This can be used to implement Boolean search and set-theoretic queries, such as intersection, union, and difference. However, it is not sufficient for rank-ordered search, which requires more information about the relation the query (of trapdoors) has to the set of documents (secure indexes).
Oblivious multi-sets
Multi-sets, which enable queries about the multiplicity of a trapdoor, enable rank-ordered searches that consider the frequency of words, such as BM25.
In this case, we need a data structure that is able to store the multiplicity of search keys in a document. This is enabled by an oblivious multi-set where the keys are the search keys and the values are the multiplicity of the search keys. The expected bit length is given by
bits per trapdoor if we assume that the frequency of a word in a document is uniformly distributed between and .
Let
contains : (secure_index, unigram) -> bool
be a set-membership test (set-indicator function), e.g., contains(S,x) tests if
unigram is a member of secure index . This is the bare minimum and only
supports Boolean search.
The interface provides a multiplicity function: counts the multiplicity of unigram in secure index .
Secure postings list
A secure postings list uses a variable-length perfect map filter. Keys are unigrams and map to array elements. The first bit indicates the initial position and the gaps are encoded in the remaining bits. If this information cannot fit, then the remaining bits serve as a pointer into a larger bit string where the first sequence (using a universal coder) indicates starting position.
The interface elements for a secure postings list include:
contains(s, x) -> bool: Test unigram for membership in secure index . This is the bare minimum and only supports Boolean search.multiplicity(s, x) -> int: Count the multiplicity of unigram in secure index .locations(s, x): Retrieve the locations of unigram in secure index .containsSequence(s, x_1 x_2 \cdots x_k): Test ordered -gram for membership in secure index . If a biword model is used, thencontainsmay be used to implementcontainsSequence.locationsSequence(s, x_1 x_2 \cdots x_k): Retrieve the locations of -gram in secure index .
These interface elements may be used to implement other interfaces, like rank-ordered search (e.g., BM25 or MinDistX) or set-theoretic queries. Note that if the locations interface is supported, then minimum distance and multiplicity can be computed from the location data.
Secure rank
The secure index does not disclose which plaintext words are contained in a document, nor any information attached to the plaintext word, until a hidden query is submitted against the document. However, at the point of submission, while the plaintext word is not necessarily disclosed, the information attached to the word is.
Such information may be sufficient to allow the adversary to estimate (with a probability of success greater than random chance) the plaintext word that corresponds to a trapdoor in a hidden query. Depending upon the nature of the information, the adversary may be able to estimate other characteristics about the confidential document. In the case of a secure postings list, assuming the postings have minimal error, the adversary may be able to estimate much of the plaintext of an entire confidential document.
One way to combat this technique is to precompute the results of queries (perhaps just reasonably plausible queries) with respect to a given confidential document. Then, the untrusted system does not need to know any information about particular atomic words, but rather, it needs to know information about how a confidential document is relevant to particular queries.
Suppose the query model is a bag-of-words and the maximum number of words in a query is . As a bag-of-words, the order of the words in the query is irrelevant. Since a document may only contain some of the words in the query, we must code all subsets (excluding the empty set).
Given a bag-of-words query, we could canonically order the bag and map this to any desired ranking. That is, we precompute the ranking. This can have a high up-front cost and a high storage cost.
If the document consists of unique words, then there are
canonical keys. If is unbounded, then there are
canonical keys, which is impractical for most documents with even a relatively small number of unique words, e.g., if , then there are approximately a billion keys.
In either case, the expected lower-bound on bit length for the oblivious map is given by
where is the average bit length of the ranking coder which represents some rank-ordered search metric, like BM25 or MinDist, or the proximity of the words for Boolean proximity searches (or multiple such metrics may be coded).
Suppose the maximum query is , then the map has on the order of
keys and the oblivious map has a lower-bound on the order of
or
If , which may be a reasonably large document (especially if common words are removed and stemming is used to normalize words), then the lower-bound is around
Rank-ordered search
Rank-ordered information retrieval provides a mapping between queries and degree-of-relevancy for each document in a collection.
In information retrieval (IR), finding effective ways to measure relevancy is of fundamental importance. To that end, many algorithms and heuristics have been devised to rank-order documents by their estimated relevancy to a given query. We explore term weighting and term proximity weighting heuristics in the context of secure indexes.
Precision and recall
Precision and recall are relevant metrics for Boolean searches; they do not rank retrieved documents like BM25 or MinDistX; a document is either considered relevant (contains all of the terms in a query) or non-relevant.
Precision measures the proportion of retrieved documents that are relevant to the query. It is defined as:
where is the set of relevant documents and is the set of retrieved documents.
Precision has a range of . Recall measures the proportion of relevant documents that were retrieved. It is defined as:
Recall also has a range in . It is trivial to achieve a recall of (100%) by retrieving every document in the corpus. However, this comes at the cost of decreased precision. Thus, in general there is a trade-off between precision and recall.
Mean average precision
Mean average precision (MAP) is a popular way to measure the performance of information retrieval systems with respect to degree of relevancy, however relevancy may be defined. In particular, we use MAP to measure how closely the rank-ordered output (of either BM25 or MinDistX) of the secure indexes matches the rank-ordered output of non-secure, canonical indexes which provide perfect location and frequency information.
The more approximately a secure index represents a document, the less information one can infer about the document from the secure index. Thus, to what extent the secure index can approximate a document while still achieving a reasonable MAP score is an important question.
MAP is calculated by taking the mean of the average precisions of queries. The precision at is given by:
In other words, precision at measures how many of the top retrieved documents are actually in the top most relevant documents.
The average precision for the top documents is given by:
This averages the precision values across all cutoff points from 1 to , giving higher weight to systems that return relevant documents earlier.
The MAP for the top documents over a query set is given by:
Given a set of documents and a query , to estimate a document’s MAP score, we need a set of documents and a query set . Then, we construct a set of secure indexes for , and rank-order both and according to each . Finally, we calculate the MAP over these rank-ordered outputs using the rank-ordered outputs for as the true, canonical output.
Example
Consider the following. Suppose the rank-ordered list of relevant documents to a query is , and the retrieved ranked-ordered list (by a secure index) is . The precision at is given by \begin{equation*} \frac{|\lbrace 3 \rbrace \cap \lbrace 2 \rbrace|}{1}=\frac{|\emptyset|}{1}=0,, \end{equation*} the precision at is given by
the precision at is given by
the precision at is given by
and the precision at is given by
Thus, the average precision is given by
The MAP would simply be the mean of the average precisions for queries.
Note that the average precision for the last value of is necessarily if, by that iteration of , the relevant set and the retrieved set contain the same elements. However, in general, this is not the case; for instance, if the relevant ranked list of documents to a query is , and the retrieved ranked list is , then if the mean average precision goes from to , the average precision is . In our simulations, we do a variation of this.
Suppose the relevant ranked list of documents to a query is , and the retrieved rank-ordered list is . Then, I calculate the average precision for the top instead of the top or top . In this example, document is not included in any of the precision at to calculations.
Finally, in one of the experiments, we conduct a “page one” MAP test, i.e., we find the mean average precision using only the top 10 results. The randomized algorithm does much more poorly in this instance, e.g., with over 85% probability, the mean average precision will be less than 0.05.
Term importance: BM25
BM25, a well-established tf-idf variant, is one of the relevancy measures chosen for our experiments. Mathematically:
where is the average size (in words) of documents in .
Parameters and are free parameters. In the experiments, they are set to typical values: is set to and is set to . Ideally these parameters would be automatically tuned for each secure index. The function stands for term frequency and simply returns the number of occurrences of term in document . The function stands for inverse document frequency:
where is the number of documents in and is a function which returns the number of documents in which had one or more occurrences of term .
When using BM25 to rank search results, each document in is ranked according to query by the function , where is a term in query . After giving every document a BM25 rank, the results are sorted in descending order of rank as the final output.
Note that BM25Score is not used in most real-world information retrieval systems; instead, they typically employ a vector space model, in which a vector for each document in is constructed, where each dimension represents a word with a weight equal to . Subsequently, a document can be ranked according to a query by taking the cosine similarity of their respective vectors. However, this representation is problematic in light of the limited information (for confidentiality) available in secure indexes.
Term proximity: MinDistX
MinDist, like BM25, ranks documents according to their proximity relevance to a given query. It is a less established ranking heuristic than BM25, but in experiments it had performed well compared to other proximity heuristics. In our experiments, we add additional tunable parameters to MinDist and call it MinDistX.
MinDistX is a proximity heuristic in which the minimum distance between each existent pair of terms are summed over.
Example: Consider a document
and a query
where , , , and are searchable terms. Then, the minimum pairwise distances are given by
and the summation of these distances is simply .
MinDistX is a scoring function dependent upon the summation of the minimum pairwise distances, as given by in the example above. Thus, MinDistX requires that a secure index represent distances between terms. The most obvious way to accomplish this is by storing approximate location information.
The more concentrated the query terms in a document are, the more relevant the document is to the query. However, this is true only up to a certain point.
Example: Consider a query
and two documents and , where contains both computer and science on page 7 and contains computer on page 7 and science on page 20. It is obvious that should be considered more relevant to since the two keywords of interest are much closer together.
However, consider a third document that contains computer on page 7 and science on page 100. This is not much worse than ; both documents are not that relevant and is only marginally more relevant at best.
MinDistX has the following functional form.
Definition: Let be the set of query terms, be the subset of that exist in the given document , and be the sum of the minimum pairwise distances between terms in .
where .
To determine if MinDistX matches the intuition—a strictly decreasing function that flattens out as increases—it may be instructive to consider the limits. Asymptotically, as :
and as :
To determine if these end points are the maximum and minimum values respectively, let us consider the partial derivative:
where
This partial, for all supported values, is negative. As :
and as the partial approaches . This matches the desired intuition; for small , a small increase in corresponds to a large decrease in MinDistX; and, for large , a small increase in corresponds to small decrease in .
Note that for large , the MinDistX function decreases less rapidly than for small , which is the desired behavior. Recall that corresponds to the number of query terms matched in the document. We do not wish to penalize (at least not too harshly) a document that contains more of the query’s terms but spread out over a larger region compared to a document that contains fewer of the query’s terms but spread out less.
MinDistX may be used as a way to add proximity sensitivity to already established scoring methods, like BM25. For example, a linear combination of their scores can be used as the final output of a scoring function that is both sensitive to proximity and term frequencies:
Uncertainty vs mean average precision
In Encrypted Boolean Search (EBS), the only information disclosed to the untrusted system about a collection of confidential objects is which objects are relevant to obfuscated search keys (trapdoors of search keys).
To facilitate Encrypted Rank-Ordered Search (ERS), more information about the confidential object is required to compute a degree-of-relevance.
A key insight is that a significant amount of error or uncertainty is acceptable if the rank-ordered search operations remain sufficiently useful. The trade-off between information leakage and search utility is fundamental to the design of secure indexes.
False positives (from approximate sets and approximate maps) represent one kind of error. We can also deliberately inject other kinds of errors or approximations:
- Quantization noise: Storing only approximate term frequencies or positions
- Random perturbation: Adding noise to stored values
- Truncation: Limiting the precision of stored information
- Aggregation: Combining information in ways that obscure individual values
The key metric is how these approximations affect Mean Average Precision. Empirically, secure indexes can tolerate substantial approximation while maintaining acceptable MAP scores, especially when the goal is ranking rather than exact retrieval.
Conclusion
Rank-ordered encrypted search extends Boolean encrypted search by enabling relevance scoring without revealing document contents. The core data structures—oblivious entropy maps and their specializations—provide flexible representations that trade off between storage efficiency, information leakage, and search quality.
Key takeaways:
- Entropy maps provide the theoretical foundation for space-optimal approximate function representation
- Oblivious variants apply trapdoor functions to prevent content disclosure
- BM25 and MinDistX can be computed from secure indexes with acceptable accuracy
- MAP scores provide a principled way to evaluate the quality-privacy trade-off
The central challenge remains balancing the amount of information stored in secure indexes against the privacy guarantees required. More information enables better ranking but increases potential for inference attacks. The frameworks presented here provide a foundation for reasoning about this trade-off quantitatively.
Discussion