proof \AtEndEnvironmentproof
Boolean Encrypted Search with Approximate Sets:
Error Propagation, Query Obfuscation, and Threshold-Based Retrieval
Abstract
We present a comprehensive framework for Boolean encrypted search using approximate sets—space-efficient data structures supporting membership queries with tunable false positive rate and false negative rate . Each document is represented by a secure index encoding trapdoors derived from plaintext terms, enabling arbitrary Boolean queries (conjunction, disjunction, negation) to be evaluated obliviously on untrusted servers.
Our primary contributions are threefold. First, we establish a compositional algebra for error propagation in Boolean queries, proving that conjunctions multiply false positive rates ( for terms), disjunctions follow complement probability rules, and negation swaps error types—revealing that even when for atomic tests, negated queries necessarily introduce false negatives. Second, we introduce threshold-based retrieval structures, a novel approximate-set construction using binned seed values that offers simplicity, cache efficiency, and tunable space-accuracy tradeoffs (e.g., 40% storage at versus standard Bloom filter parameters). Third, we formalize -graph indexes as a query obfuscation mechanism that enables randomized query padding, making single-term searches indistinguishable from multi-term queries through combinatorial expansion of term sets.
We derive closed-form expressions for precision and recall as functions of , , collection size, and query structure, and present a recursive algorithm computing error rates for arbitrary Boolean expressions. Extensions include biword indexes for phrase search and higher-order -graphs trading storage for enhanced privacy. Our analysis demonstrates that aggressive space compression (false positive rates of to ) yields near-perfect precision for realistic conjunctive workloads, enabling practical encrypted search systems with 10–100 storage reduction while preserving query privacy through obfuscation.
List of Algorithms
loa
1 Introduction
Search systems increasingly operate on infrastructure that cannot be fully trusted with the contents of the documents or the intent of end users. Encrypted search promises the ability to execute familiar retrieval workloads while the server only manipulates encrypted artefacts. Realising that promise for Boolean retrieval requires careful coordination between the query mechanism, the data structure that indexes each document, and the quantitative guarantees that bound the inevitable leakage.
This report focuses on secure indexes constructed from approximate sets. Each index supports membership tests for trapdoors derived from plaintext search keys and exhibits tunable false positive rate and false negative rate . We examine how these error rates propagate through Boolean query operators and influence classical information retrieval measures. A key insight is that query structure matters: even when atomic membership tests guarantee , compound queries involving negation introduce false negatives, fundamentally coupling query semantics to retrieval accuracy.
Our contributions are fourfold:
-
1.
We formalise the Boolean retrieval workflow used throughout the paper and clarify the assumptions placed on search agents, queries, and result sets.
-
2.
We describe how approximate-set secure indexes can be generated from plaintext documents, including algorithmic sketches for query translation and membership testing.
-
3.
We establish a compositional algebra for error propagation in arbitrary Boolean queries, proving that conjunctions multiply false positive rates, disjunctions apply complement probability rules, and negation swaps error types. We present a recursive algorithm that computes false positive and false negative rates for any Boolean expression.
-
4.
We outline practical extensions—such as biword phrase search and -graph indexes—that broaden the expressiveness of hidden queries without sacrificing the security posture.
The remainder of the document proceeds as follows. Section 2 reviews the information retrieval model and introduces the encrypted search interface. Section 3 elaborates on the construction of secure indexes backed by approximate sets and derives expected precision for conjunctive queries. Section 4 generalises the error analysis to arbitrary Boolean expressions involving conjunction, disjunction, and negation, establishing compositional rules and presenting worked examples. Section 5 discusses the oblivious-set abstraction that underpins the data structure. Finally, Section 6 surveys extensions to the base model that enable richer query semantics including phrase search and -graph indexes.
2 Information retrieval model
An information retrieval process begins when a search agent submits a query representing an information need. In response, the system returns a subset of confidential objects from the universe that satisfy that need.
An Encrypted Search system may support many retrieval paradigms. We focus on Boolean search, a well-understood model described below.
Definition 2.1.
In Boolean search, each object is either relevant or irrelevant to a query. The query is represented as a bag of search keys, and an object is relevant exactly when it contains every key from the query.
For a given query, let denote the corresponding result set.
Assumption 2.1.
The information retrieval model is Boolean search.111See Section 6 for simple extensions to the query interface.
To support Boolean search operations, each confidential object only needs to answer membership tests for search keys. This leads naturally to a set-based representation of documents. We distinguish between true positives and false positives as follows.
Definition 2.2.
A true positive is any confidential object in in which every search key from the query is present. A false positive is any confidential object in that fails at least one of those membership tests.
Classical Boolean search returns all positives and no negatives, thereby ensuring that false positives and false negatives never occur. In the encrypted setting we relax that guarantee in a controlled manner.
2.1 Encrypted search interface
A query submitted to an Encrypted Search system should not be comprehensible to an adversary observing the server.
Definition 2.3.
A hidden query represents the confidential information need of an authorised search agent. It should be interpretable only by that agent and the document owner.
Hidden queries are realised via cryptographic trapdoors.
Definition 2.4 (Trapdoor).
Search agents map each plaintext search key to a cryptographic hash, called a trapdoor.
A trapdoor for a plaintext search key enables an untrusted system to look up the key inside a confidential data set without revealing the underlying word.
Definition 2.5.
A queryable representation of a confidential object is called a secure index.
A secure index must satisfy two broad requirements:
-
1.
If the index has bit length , its encoding should be close to maximally entropic so that it appears indistinguishable from noise.
-
2.
Membership tests derived from authorised queries should succeed with high probability while limiting counterfeit matches.
These requirements motivate the use of approximate-set data structures for secure indexes.
Definition 2.6.
An approximate set is a set-like data structure in which false negatives never occur but false positives may appear with a controllable probability.
We denote by the exact set we wish to represent and by the corresponding approximate set, where .
Assumption 2.2.
The secure index is an approximate set with false positive rate .
Under this assumption a hidden query may evaluate to an approximate result set that contains the exact result set , i.e. . Precision and recall therefore become random variables parameterised by . Other properties such as information disclosure are explored in Section 3.
Trapdoors are produced by a one-way transformation.
Definition 2.7 (Trapdoor).
Given a secret and a search key , the trapdoor is defined as
| (2.1) |
By one-way we mean that, given a hash , finding a preimage such that
| (2.2) |
is computationally infeasible. A random oracle models this behaviour by uniformly distributing inputs over the output space.
Assumption 2.3.
The trapdoor function approximates a random oracle.
If a collision between plaintext words and were to occur, the terms would be aliased—indistinguishable—within the Encrypted Search system. We simplify the analysis with an additional assumption.
Assumption 2.4.
The codomain of is sufficiently large that collisions between search keys are negligible and may be ignored.
Plaintext queries are transformed into hidden queries by the following procedure.
Definition 2.8.
Given a secret and a plaintext query , the function that transforms plaintext queries into hidden queries is denoted by
| (2.3) |
which applies the trapdoor function from Definition 2.7 to each search key.
The function may be a simple substitution cipher as shown in Algorithm 1. More sophisticated techniques that improve confidentiality are possible; see Section 6 for options.
-
A plaintext query.
-
The secret key.
The implementation of in the bag-of-words model is given in Algorithm 2.
-
The secure index approximating a bag-of-words for a confidential object.
-
The set of trapdoors to test for membership.
-
The bag-of-words representation of a document.
-
The secret key.
-
The target false positive rate.
3 Secure index model
Each confidential document is transformed into a secure index via Algorithm 3. The index stores only trapdoors and is materialised as an approximate set. Because false negatives are impossible, any authorised query that matched the original document continues to match the secure index. False positives, however, occur with probability controlled by .
3.1 Approximate result sets
Let be the collection of confidential objects, and let be the exact result set for a plaintext query whose cardinality we denote by . Evaluating the hidden query against the collection yields an approximate result set consisting of every secure index whose membership tests succeed for all trapdoors in .
We assume the following independence property, which is satisfied by common approximate-set instantiations (e.g. Bloom filters with optimised hash functions) and captures the behaviour measured in prior empirical work.
Assumption 3.1.
For any secure index that does not contain the trapdoors in , the membership tests of distinct trapdoors are independent events. Each individual test has false-positive probability .
Under 3.1, the probability that a non-relevant document is spuriously admitted to equals : all trapdoors must simultaneously yield false positives. Let denote the number of false positives. Because there are non-relevant documents, we obtain the binomial model
| (3.1) |
Consequently,
| (3.2) |
The approximate result set satisfies almost surely, so the expected cardinality of is
| (3.3) |
Precision and recall.
Classical precision is defined as . Because hidden queries do not incur false negatives, the numerator equals . Taking expectations gives the deterministic upper bound
| (3.4) |
Recall remains identically ; the encrypted setting preserves completeness while trading a tunable amount of precision for storage and compute efficiency. The expected F1 score simplifies accordingly:
| (3.5) |
Worked example.
Suppose documents contain relevant results. With query length and per-trapdoor false-positive probability , the compound probability is . The expected number of false positives is therefore , so precision remains effectively . Even when is relaxed to the expected number of spurious matches is , demonstrating how aggressively the system can tune before precision begins to degrade.
These expressions make explicit how design choices for the secure index (hash family, filter width, load factor) influence observable retrieval quality. In Section 3.2 we connect to concrete data-structure parameters, enabling end-to-end sizing of encrypted search deployments.
3.2 Storage cost and parameter selection
Let denote the number of trapdoors recorded in a secure index. For memory-efficient approximate sets such as Bloom filters, the relationship between the false-positive rate and the bit budget is well known:
| (3.6) |
The optimal number of hash functions is . Together, Equation 3.6 and provide closed-form guidance for configuring secure indexes: choosing fixes the achievable , and conversely, targeting a specific determines the necessary bit budget.
Perfect-hash filters and quotient filters obey similar asymptotic behaviour. They typically achieve a leading term of bits per element, plus lower-order overheads that capture metadata and cache alignment. Throughout this work we adopt the simplifying approximation
| (3.7) |
which aligns with the dominant term of both Bloom filters and perfect-hash filters. Substituting Equation 3.7 into the combinatorial constructions in Section 6 recovers the storage estimates cited for biwords and -graphs.
Given a fixed storage budget bits per document, the allowable false-positive rate follows directly from Equation 3.6:
| (3.8) |
Coupling this expression with Equation 3.4 yields a simple sizing workflow: pick to meet latency or cost constraints, compute the implied , and verify that the resulting precision satisfies the retrieval objectives for representative workloads.
3.3 Threshold-based retrieval structures
While Bloom filters are the canonical approximate membership structure, we present an alternative construction that is particularly well-suited to secure indexes: threshold-based retrieval structures. This data structure achieves comparable false positive rates to Bloom filters while offering simpler construction and potentially superior space efficiency for certain parameter regimes.
3.3.1 Construction algorithm
Let be the set of trapdoors to store. The construction proceeds as follows:
-
Set of trapdoors.
-
Target false positive rate.
-
Number of bins.
-
Maximum hash value (typically or ).
The membership test is straightforward:
-
Array of seeds.
-
Query trapdoor.
-
False positive rate.
-
Maximum hash value.
3.3.2 Complexity analysis
Time complexity.
For a bin with items, each trapdoor independently hashes below the threshold with probability . The probability that all trapdoors simultaneously succeed is . The number of seed trials follows a geometric distribution with success probability , yielding an expected number of trials:
| (3.9) |
If we partition trapdoors into bins, each bin contains approximately items. The expected construction time is
| (3.10) |
Query time is hash operations (one per bin).
Space complexity.
The structure stores one seed per bin. If seeds are drawn from , each requires bits, yielding total storage:
| (3.11) |
For , this is bits. Compare to a Bloom filter storing items with false positive rate , which requires bits. When , the threshold structure is more space-efficient.
False negative guarantees.
The construction succeeds in finding a valid seed for bin if at least one seed satisfies the threshold condition for all items in that bin. The probability of failure after trying all seeds is
| (3.12) |
where is the bin size. For small failure probability , we require
| (3.13) |
where the approximation holds when .
Example 1 [Parameter selection] Consider trapdoors with and . To ensure false negative probability per bin, we need bin size such that
(3.14) Solving: , yielding . Thus bins should contain at most 7 items. With bins, the structure requires bits.
A Bloom filter for the same parameters requires bits, making the threshold structure more space-efficient in this regime.
3.3.3 Comparison to Bloom filters
Threshold-based retrieval structures offer several advantages for secure indexes:
-
•
Simplicity: No bit array, no multiple hash functions—just seeds and threshold tests.
-
•
Space efficiency: When document sizes are moderate and bins are small, storing seeds can be more compact than Bloom filter bit arrays.
-
•
Tunable construction cost: Smaller bins reduce construction time (lower ) at the cost of more seeds.
-
•
Cache efficiency: Sequential seed testing can be more cache-friendly than scattered bit-array accesses.
The primary trade-off is construction time, which grows exponentially in bin size. Careful selection of based on , , and ensures practical construction while maintaining space efficiency.
3.4 Adversarial considerations
The random-oracle assumption from 2.3 means that trapdoors reveal no information about the underlying plaintext keys beyond equality under the shared secret. Moreover, the approximate-set representation reduces the leakage footprint to access-pattern observations and the occasional false positive.
The primary information leakage in this model comes from two sources. First, access patterns: which indexes are tested for which trapdoors. An adversary observing the server learns query cardinality, term co-occurrence across documents, and potentially query frequency distributions that enable statistical attacks. Techniques described in Section 6—particularly -graph indexes (Section 6.2)—provide a principled mechanism for obfuscating these patterns by collapsing multi-term queries into single trapdoor lookups.
Second, boolean result leakage: each document returns a plaintext boolean value (true/false) indicating whether it satisfies the query. This reveals which documents are relevant, exposing the structure of the result set. The same trapdoor machinery used for search terms can be applied to boolean values themselves: instead of returning plaintext true/false, the system can return boolean trapdoors encoding the result under operations (AND, OR, NOT). By treating every value type—including booleans—as obfuscated, the system approaches a more complete oblivious computation model. When extended recursively to all intermediate values and operations, this framework becomes Turing-complete, enabling arbitrary computations over encrypted data without revealing intermediate or final results to the server. We do not explore this extension in this paper, but note that it represents a natural and effective path toward fuller oblivious computation systems.
The system’s tolerance for elevated false positive rates (demonstrated in Section 3.1) makes the substantial storage overhead of obfuscation techniques practical, enabling meaningful query privacy enhancements in adversarial settings.
4 Boolean query algebra with error propagation
The preceding analysis focused on conjunctive queries, where every search key in must be present for a document to match. Real-world search systems, however, support richer Boolean expressions combining conjunction, disjunction, and negation operators. This section generalises the error analysis to arbitrary Boolean queries and establishes compositional rules for computing false positive and false negative rates.
A crucial observation motivates this generalisation: even when the underlying approximate set guarantees zero false negatives on atomic membership tests (), compound queries involving negation can exhibit non-zero false negative rates. For instance, the query applied to a document containing both and should evaluate to false. If both trapdoors test positive (as expected with ), the negated conjunction correctly returns false. However, if the approximate set could have false negatives, at least one trapdoor might test negative, causing to incorrectly return true—a false positive for the negated query, or equivalently, the system fails to identify a true negative, which manifests as a kind of induced error.
We therefore adopt a general framework where atomic membership tests exhibit:
-
•
False positive rate (item not in set, tests positive)
-
•
False negative rate (item in set, tests negative)
Throughout this section we derive expressions parameterised by both and . The special case recovers the original model from Section 3, while non-zero accommodates approximate sets with weaker guarantees or systems that deliberately introduce false negatives for adversarial obfuscation.
4.1 Error rates for atomic and compound queries
We begin by formalising error rates for Boolean queries. Let denote a Boolean query expression built from atomic search keys and the operators (AND), (OR), and (NOT).
Definition 4.1 (False Positive Rate).
The false positive rate is the probability that query evaluates to true when the ground truth is false.
Definition 4.2 (False Negative Rate).
The false negative rate is the probability that query evaluates to false when the ground truth is true.
For an atomic query corresponding to a single trapdoor, the base error rates are:
| (4.1) |
We now state the fundamental theorems governing error propagation through Boolean operators.
Theorem 4.1 (Conjunction error rates).
Let be a conjunctive query over independent atomic queries. Then
| (4.2) |
and
| (4.3) |
Proof.
A false positive occurs when none of the search keys appear in the document (ground truth is false), yet all membership tests return positive. Under independence, this probability is .
A false negative occurs when all search keys appear (ground truth is true), but at least one membership test fails. The probability that all tests succeed is , so the complement gives . ∎
Theorem 4.2 (Disjunction error rates).
Let be a disjunctive query over independent atomic queries. Then
| (4.4) |
and
| (4.5) |
Proof.
A false positive occurs when none of the search keys appear (ground truth is false), yet at least one membership test returns positive. The probability that all tests correctly return negative is , so the complement gives .
A false negative occurs when at least one search key appears (ground truth is true), but all membership tests fail. The probability is . ∎
Theorem 4.3 (Negation error rates).
Let be a negated atomic query. Then
| (4.6) |
and
| (4.7) |
Proof.
For , the ground truth is false when is present. A false positive occurs when the membership test for incorrectly returns negative (with probability ), causing to evaluate to true.
Conversely, the ground truth is true when is absent. A false negative occurs when the membership test incorrectly returns positive (with probability ), causing to evaluate to false.
Thus negation swaps error types: and . ∎
Corollary 4.3.1 (Negated conjunction).
For ,
| (4.8) |
Proof.
Apply Theorem 4.3 to the conjunction from Theorem 4.1. ∎
Corollary 4.3.2 (Negated disjunction).
For ,
| (4.9) |
Proof.
Apply Theorem 4.3 to the disjunction from Theorem 4.2. ∎
Summary of error rates.
Table 1 summarises the false positive and false negative rates for the five basic Boolean query patterns. When , the FNR column simplifies: conjunctions and disjunctions exhibit zero false negatives, while negations swap all error probability to the FNR term.
| Query type | FPR | FNR |
|---|---|---|
4.2 Compositional rules for arbitrary Boolean expressions
The theorems in Section 4.1 handle single operators applied to atomic queries or uniform combinations of atoms. Real queries, however, nest operators in arbitrary ways: , for instance. We now present compositional rules that compute error rates recursively for any Boolean expression.
Let and be arbitrary subqueries with known error rates and . Define compound queries:
| (4.10) | ||||
| (4.11) | ||||
| (4.12) |
Theorem 4.4 (Compositional conjunction).
| (4.13) |
| (4.14) |
Proof.
The conjunction is false (ground truth) when at least one subquery is false. A false positive requires both subqueries to incorrectly return true, yielding the product .
The conjunction is true when both subqueries are true. It correctly evaluates to true when both succeed, with probability . The false negative probability is the complement. ∎
Theorem 4.5 (Compositional disjunction).
| (4.15) |
| (4.16) |
Proof.
The disjunction is false when both subqueries are false. A false positive occurs when at least one subquery incorrectly returns true. The probability that both correctly return false is ; the complement is the false positive rate.
The disjunction is true when at least one subquery is true. A false negative requires both to incorrectly return false, yielding . ∎
Theorem 4.6 (Compositional negation).
| (4.17) |
| (4.18) |
Proof.
Negation swaps truth values. When is true (so should be false), incorrectly returns true exactly when incorrectly returns false—i.e. with probability . Similarly, when is false, incorrectly returns false when incorrectly returns true, with probability . ∎
These three theorems enable recursive computation of error rates for any Boolean expression. We formalise the procedure in Algorithm 6.
-
A Boolean query expression (abstract syntax tree).
-
Base false positive rate for atomic membership tests.
-
Base false negative rate for atomic membership tests.
4.3 Worked examples
We illustrate Algorithm 6 with concrete Boolean queries, assuming and .
Example 2 [Disjunctive query] Consider . Each atomic query has . Applying Theorem 4.2 with :
(4.19) (4.20) The disjunction raises the false positive rate to approximately while maintaining perfect recall.
Example 3 [Mixed conjunction and disjunction] Consider . First compute the conjunction:
(4.21) (4.22) Now apply disjunction with :
(4.23) (4.24) The compound query exhibits a false positive rate slightly above , reflecting the dominant contribution from the atomic term .
Example 4 [Negation introducing false negatives] Consider with and . By Corollary 4.3.2:
(4.25) (4.26) Despite for atomic tests, the negated disjunction exhibits a false negative rate. This occurs because when neither nor appears (so should be true), a false positive on either or causes the disjunction to incorrectly return true, making the negation return false.
Example 5 [De Morgan equivalence] De Morgan’s law states . We verify that both formulations yield identical error rates. For we have , and similarly for . Applying conjunction:
(4.27) (4.28) matching Section 4.3. This confirms that logically equivalent queries have identical error characteristics under our compositional semantics.
4.4 Implications for query optimisation and system design
The compositional error framework reveals several actionable insights:
Query rewriting for error minimisation.
Logically equivalent Boolean expressions can exhibit dramatically different error rates. For instance, a conjunctive query has , which decreases exponentially with . Disjunctive queries, conversely, have for small , growing linearly. Users or query optimisers can exploit this asymmetry: rewriting disjunctions as unions of conjunctive subqueries may reduce aggregate false positive rates when combined with client-side filtering.
Negation as an error-type swap.
Theorems 4.3 and 4.6 show that negation exchanges FPR and FNR. Systems built on approximate sets with sacrifice perfect recall when queries involve negation. If maintaining high recall is critical, queries should avoid or minimise negation depth. Alternatively, systems could implement exact membership structures for a subset of high-value terms, enabling precise negation without FNR degradation.
Predictable precision and recall.
Given , , and a query , Algorithm 6 computes and before execution. These rates directly determine expected precision and recall. Extending the analysis from Section 3.1, suppose the exact result set has cardinality and the collection contains documents. The expected number of false positives is
| (4.29) |
and the expected number of false negatives is
| (4.30) |
Expected precision and recall become
| (4.31) |
| (4.32) |
When and is a pure conjunction or disjunction (no negation), Equation 4.32 yields , recovering the perfect-recall guarantee from Section 3.
Space–accuracy tradeoffs for compound queries.
The storage cost of an approximate set is proportional to (see Section 3.2). Designers can select to balance space and error rates. Table 1 shows that conjunctions benefit from aggressive compression (small ), since FPR scales as . Disjunctions and negations, however, degrade more gracefully, suggesting that mixed-mode indexes—high precision for conjunctive subqueries, moderate precision for disjunctive expansions—may be cost-effective.
5 The oblivious set abstract data type
Boolean search treats each document as a bag-of-words set. Replacing the exact set with an oblivious set retains the ability to answer membership queries while hiding the underlying vocabulary. An oblivious set stores only trapdoors and therefore exposes neither the plaintext terms nor their multiplicities.
Instead of storing just the search keys that appear in the document, we can precompute a richer set of candidate trapdoors that are expected to be relevant. This design turns the secure index into a catalogue of potential matches rather than a verbatim mirror of the document. The abstraction is particularly useful when combined with approximate-set implementations such as Bloom filters or perfect hash filters: the index inherits the compactness guarantees of the data structure while maintaining the same trapdoor interface.
6 Extensions to the query model
While the core model focuses on bag-of-words Boolean queries, the same machinery supports richer semantics with modest adjustments.
6.1 Biword model
The function admits phrase search by lifting the secure index to operate on adjacent bigrams. Suppose the search keys in a document are ordered as . We add each unigram to the oblivious set as before and, additionally, insert every unique bigram for .
This strategy does not require the search provider to learn that a phrase query is being issued, although the correlation among bigrams can reveal structural hints. Implementing the secure index as an oblivious set, consider a document with search keys. In the worst case where no key repeats, the index contains unigrams and bigrams, for a total of trapdoors. Applying Equation 3.7, the storage requirement becomes
| (6.1) |
If repetitions reduce the number of unique bigrams to , the expression simply substitutes for . The approximation therefore scales linearly with the vocabulary richness captured by the biword index.
The implementation of in the biword model is given in Algorithm 7.
-
A secure index storing unigram and bigram trapdoors.
-
A sequence of trapdoors representing the phrase to test.
-
The secret key.
-
The confidential document.
-
The false positive rate.
6.2 Digraphs, trigraphs, and -graphs for query obfuscation
The primary goal of -graphs is query obfuscation: entangling multiple search terms together to obscure term correlations and distort the observable query distribution away from the plaintext distribution. A -graph secure index stores trapdoors for all combinations of up to search keys, tagged with their intended Boolean operation (AND or OR). This enables multi-term queries to be collapsed into single trapdoor lookups, concealing query structure from the server.
Motivation: obfuscating term correlations and query distributions.
In the baseline unigram model, every query consists of individual term lookups: . An adversary observing the server learns:
-
1.
The exact query cardinality (three terms)
-
2.
Co-occurrence patterns of these three terms across the document collection
-
3.
The distribution of queries, which mirrors the plaintext query distribution (just with substituted trapdoors)
-
4.
Statistical correlations enabling frequency analysis or query reconstruction attacks
A -graph with replaces this with a single compound trapdoor: for conjunction or for disjunction. The adversary observes only that some compound trapdoor was queried, without learning:
-
•
How many distinct terms are encoded (could be 1 to )
-
•
Which specific terms are involved
-
•
Whether this represents a genuine -term query or a padded lower-cardinality query
-
•
The true distribution of plaintext queries
This fundamentally breaks the isomorphism between encrypted and plaintext query distributions. Where the baseline model applies a simple substitution cipher to queries (preserving all distributional properties), -graphs introduce entanglement: terms are cryptographically combined in ways that obscure their individual identities and relationships.
Storage cost and the role of elevated false positive rates.
Suppose a document contains unique search keys. A -graph stores trapdoors for all subsets of size at most , tagged with Boolean operations (AND, OR). For each subset of size , we store both the AND-tagged and OR-tagged trapdoor (and potentially NOT-tagged, though space constraints often limit this). Thus the number of trapdoors is:
| (6.2) |
Using Equation 3.7, the storage requirement is
| (6.3) |
For specific values of (with both AND and OR tags):
-
•
(baseline): trapdoors, bits
-
•
(digraphs): trapdoors, bits
-
•
(trigraphs): trapdoors, bits
-
•
(power set): trapdoors, bits
Example 6 [Concrete storage costs] Consider a typical document with unique search keys (e.g., a 500-word document with Zipf-distributed vocabulary). Storage requirements for various configurations (including both AND and OR tags):
Trapdoors Storage 1 200 2.0 KB 1 200 0.8 KB 2 10,100 101 KB 2 10,100 40.4 KB 3 343,400 3.4 MB 3 343,400 1.37 MB The digraph case increases storage by 50 over baseline, while the trigraph case inflates it by 1700. This exponential growth makes elevated essential: using instead of reduces storage by 2.5, making the trigraph index feasible at 1.37 MB per document instead of 3.4 MB.
For a smaller document with unique keys (e.g., an email or short article), storage becomes more manageable:
Trapdoors Storage 2 420 1.68 KB 3 3,080 12.3 KB 4 17,710 70.8 KB Here even remains practical, enabling strong obfuscation for up to 4-term queries with modest overhead.
This substantial storage overhead is precisely why the system’s tolerance for elevated false positive rates becomes advantageous. Recall from Section 4 that conjunctive queries with terms achieve , which decreases exponentially. If atomic trapdoors use , a three-term query has effective FPR , yielding near-perfect precision even with aggressive compression. The analysis in Section 3.1 demonstrated that precision remains acceptable even at for realistic workloads. By accepting modestly elevated , we obtain order-of-magnitude storage reductions that make -graph indexes practical:
| (6.4) |
Inherent cardinality obfuscation.
A fundamental property of -graphs is that they provide inherent cardinality obfuscation: any query with at most distinct terms collapses to a single trapdoor lookup, making it impossible for the server to determine the true number of terms. A 1-term query and a 3-term query both appear as single compound trapdoor lookups to the server. The adversary observes only that some -graph trapdoor (for ) was queried, but learns nothing about which or which specific terms are encoded.
This stands in stark contrast to the baseline unigram model, where query cardinality is immediately observable: a 3-term query requires three separate trapdoor lookups, directly revealing the query size.
Distribution obfuscation via term repetition.
While -graphs already mask cardinality, the client can further obscure the query distribution by exploiting multiple encodings of the same logical query. For a single search term , the following trapdoors are all semantically equivalent (each matches documents containing ):
-
•
(unigram encoding)
-
•
(repeated bigram encoding)
-
•
(repeated trigram encoding)
-
•
etc., up to with repetitions
Because conjunction is idempotent (), all encodings yield identical results. However, each produces a cryptographically distinct trapdoor due to the hash function. By randomly selecting among these encodings each time term is queried, the client distorts the observable query distribution:
-
•
The baseline model preserves the plaintext query distribution (just with substituted trapdoors)
-
•
With term repetition, the same logical query appears as different trapdoors over time
-
•
The server observes a flattened, artificially uniform distribution that bears no statistical relationship to the underlying plaintext query patterns
This breaks frequency analysis attacks: even if term dominates the plaintext query distribution, its encoded representations appear with the frequency each, obscuring which trapdoors correspond to popular versus rare queries.
The error rates remain unchanged: from Section 4, and due to idempotence. Distribution obfuscation via term repetition adds no accuracy degradation.
Injecting fake trapdoors for additional noise.
For disjunctive queries, the client can inject fake trapdoors—randomly sampled hash values that correspond to no actual search terms—to further obfuscate query structure. Consider a 2-term disjunctive query submitted to a index. The client can augment it with noise:
| (6.5) |
where and are random hash values. Because disjunction semantics mean “match if any term is present,” the fake trapdoors are unlikely to affect query results: they test positive only with probability (the base false positive rate). For typical to , the probability that a fake trapdoor causes a spurious match is negligible.
This technique provides controlled noise injection: the server observes a 4-term OR query but cannot determine which terms are genuine versus fake. Unlike conjunctive queries (where every term must match, so fake terms would break the query), disjunctions tolerate noise gracefully. The expected number of false positives introduced by fake trapdoors is approximately additional spurious documents, which remains small for reasonable and .
Fake trapdoor injection is particularly effective when combined with term repetition: the client can pad a disjunctive query with both repeated genuine terms (for distribution obfuscation) and random fake terms (for cardinality inflation beyond genuine terms), creating maximum uncertainty about query structure.
Security versus performance tradeoff.
While -graphs do reduce query round-trips (a query with terms requires trapdoor lookups instead of ), this performance benefit is secondary. The dominant cost in encrypted search is typically index transfer and client-side decryption, not the number of membership tests. The primary motivation is security:
-
•
Inherent cardinality concealment: Any query with distinct terms becomes a single trapdoor lookup, making it impossible to distinguish 1-term from -term queries.
-
•
Term correlation hiding: Co-occurrence patterns of search keys become opaque to the server, as multi-term queries are cryptographically entangled.
-
•
Distribution obfuscation: Term repetition provides distinct encodings per query, flattening the observable distribution and breaking the isomorphism with plaintext query patterns.
-
•
Frequency analysis resistance: Popular queries appear as multiple distinct trapdoors with the frequency each, obscuring which trapdoors correspond to common versus rare queries.
-
•
Temporal unlinkability: The same logical query appears as different trapdoors across time when using randomized encodings, preventing the server from linking repeated queries.
-
•
Noise injection compatibility: OR queries can incorporate fake trapdoors with minimal accuracy impact, further distorting observable query patterns.
Together, these mechanisms trade storage complexity (exponential in ) and time complexity (construction cost grows with ) for substantial reductions in information leakage. The system’s tolerance for elevated makes this tradeoff practical.
Practical deployment considerations.
System designers should select based on adversarial threat models, query patterns, and storage budgets:
-
•
(baseline): Minimal storage, maximum leakage. Query distribution mirrors plaintext distribution (modulo substitution). Suitable when the server is semi-trusted or queries are low-sensitivity.
-
•
(digraphs): Moderate storage ( overhead), conceals pairwise term correlations. Enables 2-term queries as single trapdoors and provides 2 distinct encodings per term for distribution flattening.
-
•
(trigraphs): Substantial storage ( overhead), strong obfuscation. Supports up to 3-term queries as single lookups, provides 3 encodings per term, and enables meaningful fake trapdoor injection for OR queries. Practical when is acceptable.
-
•
: Exponential storage growth, near-maximal obfuscation. Justifiable for high-value/high-sensitivity workloads with small document vocabularies (e.g., unique terms).
-
•
(full power set): Complete obfuscation—every possible query becomes a single trapdoor lookup, and every term has distinct encodings. Practical only for very small vocabularies or documents.
The key insight is that -graphs and elevated work synergistically: the system’s ability to maintain acceptable precision at (due to exponential FPR decay in conjunctive queries) makes the substantial storage overhead tolerable. Combined with term repetition and fake trapdoor injection, -graphs enable meaningful query privacy enhancements by breaking the distributional link between encrypted and plaintext query patterns.