
Introduction
In cloud computing, it is tempting to store confidential data on (untrusted) cloud storage providers. However, a system administrator may be able to compromise the confidentiality of the data, threatening to prevent further adoption of cloud computing and electronic information retrieval in general.
The primary challenge is a trade-off problem between confidentiality and usability of the data stored on untrusted systems. Encrypted Search attempts to resolve this trade-off problem.
Encrypted Search allows authorized search agents to investigate presence of specific search terms in a confidential target data set, such as a database of encrypted documents, while the contents, especially the meaning of the target data set and search terms, are hidden from any unauthorized personnel, including the system administrators of a cloud server.
Essentially, Encrypted Search enables oblivous search. For instance, a user may search a confidential database stored on an untrusted remote system without other parties being able to determine the information need of the user searched (and on more sophisticated systems, they are also unable to determine which documents were relevant to the information need).
We denote any untrusted party that has full access to the untrusted remote system (where the confidential data is stored) the adversary.1
Despite the potential of Encrypted Search, perfect confidentiality is not theoretically possible. There are many ways confidentiality may be compromised. In this paper, we consider an adversary whose primary objective is to comprehend the confidential information needs of the search agents by analyzing their history of encrypted queries.
A simple measure of confidentiality is given by the proportion of queries the adversary is able to comprehend. We consider an adversary that employs a known-plaintext attack. However, since the confidentiality is a function of the history of queries, different histories will result in different levels of confidentiality over time.
We apply time series analysis to estimate the forecast distribution of the confidentiality measure. The forecast distribution provides the framework to estimate important security-related questions such as “what will our mean confidentiality six months from now be?”
We are interested in reasonably medium-term forecasts so that we can plan accordingly for the future, e.g., determining how frequently passwords should be reset to try to maintain a base level of confidentiality. Resetting them too frequently poses an independent set of problems, both from a security and usability standpoint, but failing to reset them when the risk of being compromised is too high defeats the central purpose of Encrypted search.
Encrypted search model
An information retrieval process begins when a search agent submits a query to an information system, where a query represents an information need. In response, the information system returns a set of relevant objects, such as documents, that satisfy the information need.
An Encrypted Search system may support many different kinds of queries, but we the query model is a sequence-of-words.
The adversary is given by the following definition.
Definition 1 The adversary is an untrusted agent that is able to observe the sequence of queries submitted by authorized search agent.
The objective of the Encrypted Search system is to prevent the adversary from being able to comprehend the sequence of queries.
A hidden query represents a confidential information need of an authorized search agent that is suppose to be incomprehensible to the adversary.
The primary means by which Encrypted Search is enabled is by the use of cryptographic trapdoors as given by the following definition.
Definition 2 Search agents map plaintext search keys to some cryptographic hash, denoted trapdoors.
A trapdoor for a plaintext search key is necessary to allow an untrusted Encrypted Search system to look for the key in a corresponding confidential data set.
The Encrypted Search system uses a simple substitution cipher in which
each search key is mapped to a unique trapdoor signature.
The simple substitution cipher is denoted by
Since
In a time series, we have one entitty and
The measurements are
We use upper-case to denote random variables and lower-case to denote
realizations, thus
Thus, a realization of the time series
The time series of plaintext keyword searches submitted by the search
agents is denoted by
The adversary may may directly observe
The cipher
Since the time series of plaintext is a priori non-deterministic, we
model it as a random time series
Since
Threat model: known-plaintext attack
The primary source of
information is given by the observable time series of ciphers
Other potential sources of information, such as side-channel information, is not included in the model we consider in this paper. See section Future work for some preliminary thoughts on this expanded topic.
In our threat model, the adversary is interested in estimating
In a known-plaintext attack, the objective of the adversary is to learn
how to undo the substitution cipher
The inverse substitition cipher
A maximum likelihood estimator of
If two plaintexts
The greater the uniformity of
The adversary knows some approximation of
The known-plaintext distribution may be used to solve an approximation
of the MLE
In a known-plaintext attack, the adversary substitutes the unknown true distribution with the known-plaintext distribution and solves the MLE under this substituted distribution.
Confidentiality measure
We are interested in measuring the degree of confidentiality as given by the following definition.
Definition 3 Given a time series
Note that
The measure
If we specify that
Forecasting model
As a function of a random time series
Our primary interest is in forecasting an observed time series
Data description
The accuracy
The confidentiality data
The parameters of a Bigram language model were estimated from a large corpus of plaintext. (The source of the particular corpus used has been lost.)
The estimated Bigram language model was conditionally sampled from to generate plaintexts
.Each plaintext
was cryptographically hashed to a cipher to generate ciphers .
Note that
The function
that maps ciphers to plaintext is estimated after every observations of the cipher time series using a MLE under a unigram language model (some information in the bigram model is not being used by the estimator, which reduces its efficiency) on a different corpus judged to be similiar to the one used to generate . Thus, the unigram MLE of at time is given byNote that
is inconsistent since it does not converge in probability to as a consequence of the adversary’s estimation of with .This inconsistency was motivated out of a desire to be more realistic, since an adversary who is performing the known-plaintext attack cannot in practice know the underlying distribution of
used to generate the keyword searches.The confidentiality measure at time
, denoted by , is computed using .
Time series analysis of
It seems clear that the adversary’s accuracy at a particular time will be correlated with lagged (previous) values of its accuracy and the closer in time they are the more heavily correlated they will generally be (barring exceptions like seasonality).
We partition the data into a training set and a test set. We will not look at the test set until later when we evaluate the model. Here is a quick glimpse of the training set data:
## [1] 0.358159 0.351208 0.347271 0.346403 0.352666 0.350445
Visualization and stationary transformations
If the time series data can be transformed to meet the stationary conditions, then the ARIMA model for the (correlated) residuals is generally a reasonable choice. A stationary time series is given by:
The mean is not a function of time.
The variance is constant.
The autocorrelation is a function of lag rather than time.
A plot of the training partition of
It appears non-stationary. A plot of the sample ACF and PACF are shown in figure 2.
We see that the ACF indicates that
Since there is not necessarily an obvious pattern in the data, we will
avoid the use of procedures like fitting a regression model (for
detrending) and instead try some order of differencing. Differencing is
a non-parametric approach that can often transform a non-stationary time
series into a stationary one, where the
In figure 3, we plot
the differenced process
## Augmented Dickey-Fuller Test
## Dickey-Fuller = -20.37, Lag order = 15, p-value = 0.01
## alternative hypothesis: stationary
The
ARIMA model selection
There are perhaps three primary reasons we would want to infer a general
model for
Guided by the principle of parsimony, we have a bias for simpler
models, i.e., Occam’s razor. As justification for this bias, consider
the following. Assume there is some unknown process
However, if
Since ARIMA models are parameterized by
A plot of sample ACF and PACF of the differenced time series
Since the ACF cuts off after lag
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o o o o o o x x x o o o o
## 1 x x o o o o x o o o o o o o
## 2 x x x o o o x o o o o o o o
## 3 x x x x o o o o o o o o o o
## 4 x x x x x o o o o o o o o o
## 5 x x x x x x o o o o o o o o
## 6 x x x x x o x o o o o o o o
## 7 x x x x x x x o o o o o o o
Ignoring the set small set of
Model construction and evaluation
A good model will yield residuals
They are are uncorrelated. If there are correlations, the residuals contain information that may be used to estimate a better model.
They have zero mean. If they have a non-zero mean, then the model (and forecasts) are biased.
They have constant variance.
IMA(1,1)
This is the simplest model of the three, and the simplest ARIMA model
that seemed compatible with the data. When we fit the
## ma1
## -0.5705635
To assess whether the model results in uncorrelated residuals, we inspect figure 5.
The histogram looks symmetric about
IMA(1,2)
When we fit
## ma1 ma2
## -0.55254743 -0.04124893
To assess whether the model results in uncorrelated residuals, we inspect the plots in figure 6.
The histogram looks symmetric about
ARIMA(1,1,2)
When we fit the
## ar1 ma1 ma2
## 0.8839643 -1.4423574 0.4689439
To assess whether the model results in uncorrelated residuals, we inspect figure 7.
The histogram looks symmetric about
## Box-Ljung test
## data: model.3$residuals
## X-squared = 10.333, df = 7, p-value = 0.1705
The null hypothesis is that the residuals for the model are white noise.
The test reports a
We choose this model. Since only one model seemed like a reasonable fit, measures like AIC were not needed.
What follows ia a summary of the chosen model.
## ARIMA(1,1,2)
##
## Coefficients:
## ar1 ma1 ma2
## 0.8840 -1.4424 0.4689
## s.e. 0.0276 0.0338 0.0266
##
## sigma^2 estimated as 1.088e-05: log likelihood=17177.64
## AIC=-34347.27 AICc=-34347.26 BIC=-34322.1
We present it in the more familiar form given by
Forecasting
One of the primary goals of this time series analysis is forecasting, or
predicting, the future accuracy of the adversary, i.e.,
The held out test data remains within the
Incorporating a priori information
We have a priori knowledge that we wish to incorporate into the model.
Assuming the plaintext distribution
The accuracy of the adversary,
, is a measure between and .Under ideal conditions, acquiring more knowledge by observing a larger sample is not expected to harm the adversary’s accuracy
, in which case the expectation would be a monotonically increasing function that has an asymptotic limit .
Of course, at different points in time the adversary’s accuracy may change due to, say, the presence of significant unaccounted covariates.
An ideal model for these axioms may be something like the Gompertz model or even a scaled, relocated, and shifted cumulative distribution function (cdf). However, for the sake of model simplicity, we assume a logarithmic form, which allows us to use linear regression. It does not have an asymptotic limit, but we hypothesize that it is a reasonable approximation for most finite time-horizons of interest.
Thus, we suppose the time series
Instead of i.i.d. “errors” (deviations from the mean), we have reason to
believe the errors are correlated. We choose to model these errors in
the
Actually, this is not quite true, since according to [@rob_arimax], “The
presence of lagged values
No matter, we press on and fit the model to the data, which yields the
sought after ARIMA regression errors for
## Regression with ARIMA(1,1,2) errors
##
## Coefficients:
## ar1 ma1 ma2 xreg
## 0.8521 -1.4012 0.4355 0.0292
## s.e. 0.0234 0.0277 0.0211 0.0202
##
## sigma^2 estimated as 6.828e-06: log likelihood=45279.87
## AIC=-90549.73 AICc=-90549.73 BIC=-90513.68
We have used the same model as before, except with the dynamic
regression on the logarithm of time
To assess whether the model results in uncorrelated residuals, we insepect figure 9.
The histogram looks symmetric about
## Box-Ljung test
## data: reg_model$residuals
## X-squared = 10.279, df = 6, p-value = 0.1134
At the
We show a time series plot of the model with both the training set (in black) and the test set (in green) superimposed onto it in figure 10.
We have forecasted much further into the future. The forecast seems reasonable, as it follows the subtle positive non-linear trend.
When we compare it to the previous ARIMA model, we see that the subtle positive trend is not captured by the model. We can address this potential shortcoming by forcing the ARIMA model to include the drift term. When we do this, we get the following results.
## ar1 ma1 ma2 drift
## 8.472755e-01 -1.396547e+00 4.324823e-01 4.695287e-06
We see that the drift term is a positive value near
The
Future work: dynamic regression on co-variates
In our time series analysis, the forecasting model only used lagged values of the confidentiality measure to forecast future values and we made no attempt to discover any other co-variates. Therefore, it extrapolated trends but ignored any other information such as side-channel information that may help or hinder the adversary’s efforts to decode the ciphers.
At a time
A potentially interesting model is given by the data
Observe that lagged components of
When the entropy is reduced or an informative observation comes in, this
may have a larger impact on the time series
The information gain does not necessarily need to be related to any of
the time series previously mentioned, either. For instance, suppose the
adversary, through side-channel information, acquires the knowledge that
a certain cipher
Conclusion
The statician George Box once wrote, “All models are wrong, some are
useful.” If we include drifting in the ARIMA model, it eventually
predicts impossible futures. More to the point, it is not a good match
for the theoretical model, as its bias is a function of time
The logarithmic model performs better in this regard, as it takes an inordinately long time (1,531,520,000,000,000 steps) to reach impossible values, although the prediction interval obtains it much more quickly. In addition, it more closely matches the features of the theoretical underlying distribution.
That said, there is still a lot to be said of the
Recall that if we specify that
Inspecting figure 12, we see that
Interestingly, the prediction intervals are far less forgiving and if we
used those as a pessimistic estimator of
To be a useful measure, it would seem that the uncertainty should be lower. Possibly, as we discuss in section Future work, we could incorporate other covariates that help reduce the uncertainty. Or, perhaps we need to impose a more realistic dynamic trend. After all, the logarithm is not a particularly good fit for the data either, it is simply potentially better than the other alternatives.