Boolean Encrypted Search using Approximate Sets

Overview

Academic paper presenting a Boolean encrypted search system using approximate sets for secure indexes. Each index supports membership tests for trapdoors derived from plaintext search keys, with explicit information retrieval performance measures based on false positive rates.

Cryptographic Model

Trapdoor Function

Converts plaintext search keys to encrypted form:

y ← h(x ++ k)

where:

h is a cryptographic hash function (approximates random oracle)
x is the plaintext search key
k is the secret key
++ denotes concatenation
Collisions between search keys assumed negligible

Core Algorithms

1. Hidden Query Generation (Algorithm 1)

Simple substitution cipher mapping plaintext queries to trapdoor sets:

Input: Plaintext query (set of keywords)
Output: Hidden query (set of trapdoors)
Each keyword mapped to trapdoor via h(keyword ++ k)

2. Contains Function (Algorithm 2)

Tests if all trapdoors in a query are present in a secure index:

Input: Hidden query, secure index
Output: Boolean (approximate membership test)
Returns true if all trapdoors found (may have false positives)

3. Secure Index Construction (Algorithm 3)

Creates approximate set from document bag-of-words:

Input: Document keywords, secret key
Output: Secure index (approximate set of trapdoors)
Supports configurable false positive rate

4-5. Biword Phrase Search

Extensions for phrase queries using bigram trapdoors:

Indexes adjacent word pairs (bigrams)
Enables phrase search: “distributed systems” → trapdoor(“distributed systems”)
Higher precision than single-keyword queries

Extensions

k-graphs

Higher-order extensions storing combinations of up to k search keys:

Digraphs (k=2):

Space complexity: n(n+1)/2 log₂ε bits
Stores all 2-combinations of keywords
Enables more precise phrase matching

Trigraphs (k=3) and beyond:

Trade storage for fewer query round trips
Exponential space growth: O(n^k)
Practical for small k (2-4) and moderate vocabularies

Information Retrieval Measures

Performance characterized by:

False Positive Rate (FPR): Probability secure index incorrectly matches irrelevant document
True Positive Rate (TPR): Probability of matching relevant document
Precision: Fraction of returned documents that are relevant
Recall: Fraction of relevant documents returned

All measures derived from underlying approximate set error rates.

Security Considerations

Attack Vectors

Frequency Analysis: Observing query patterns may reveal common keywords

Index Poisoning: Adversary may inject trapdoors to inflate false positives

Substitution Cipher Pattern: Known plaintext attacks if query patterns revealed

Defenses

Query padding and dummy trapdoors
Minimum false positive rate requirements
Query mixing and batching
Periodic re-keying and index rotation

Space Complexity

Basic secure index: O(n log ε) bits for n keywords with FPR ε

Biword index: O(n² log ε) bits for phrase search

k-graph index: O(n^k log ε) bits for k-keyword combinations

Applications

Cloud document search with untrusted server
Encrypted email search
Privacy-preserving web search
Secure database queries
Federated search across encrypted repositories

Resources

PDF: Boolean Encrypted Search
Code: GitHub Repository

Encrypted Search Thesis - Theoretical foundations
Cipher Trapdoor Sets - Implementation framework
Probabilistic Confidentiality Estimator - Privacy measurement
Bernoulli Types - Approximate set theory