metafunctor.com

Gave a SLUUG talk "Demystifying Large Language Models (LLMs) on Linux: From Theory to Application"

Fri, 23 Feb 2024 00:00:00 +0000

I gave a presentation for the St. Louis Unix Users Group (SLUUG) about Large Language Models (LLMs) on Linux. The talk was titled “Demystifying Large Language Models (LLMs) on Linux: From Theory to Application”. The talk was about the theory behind LLMs, how they work, and how to use them on Linux.

I also demoed two projects:

A simple Colab notebook that uses very simple Python code to generate text usnig an n-gram model to illustrate the basic idea behind LLMs and why the n-gram model falls short.
A project that uses ElasticSearch and LLMs to allow for search queries over databases using natural language.

The talk was well-received and I had a great time giving it. I’m looking forward to giving more talks in the future. The content for the talk can be found here.

Approximations of Solomonoff Induction

Mon, 19 Feb 2024 00:00:00 +0000

The GitHub for this can be found here.

Shoot me an email at lex@metafunctor.com if you’re interested in collaborating on any projects.

Fine-Tuning Tiny LLMs for ElasticSearch DSL

Mon, 19 Feb 2024 00:00:00 +0000

I am fine-tuning a tiny LLM for ElasticSearch DSL as a proof of concept. The GitHub repo for this project can be found here. It mostly consists of synthetic data. I need to reshape the data so that it’s in the expected format and then fine-tune the model, as the data has been generated, initially from GPT-4 and then from a script I made to sample from those outputs and use them as few-shot examples for Mistral to generate a lot more synthetic data. I will then use this data to fine-tune the model and see how well it performs on the ElasticSearch DSL.

Shoot me an email at lex@metafunctor.com if you’re interested in collaborating on any projects.

Master's project: Reliability Estimation in Series Systems

Mon, 19 Feb 2024 00:00:00 +0000

To complete my masters degree in statistics/mathematics at SIUe I presented my master’s project in October 2023. Their was also a paper associated with the project titled “Reliability Estimation in Series Systems: Maximum Likelihood Techniques for Right-Censored and Masked Failure Data”.

See the project here.

Various LLM Research Projects

Mon, 19 Feb 2024 00:00:00 +0000

The GitHub repo for various foundation model / LLM / math projects can be found here.

Shoot me an email at lex@metafunctor.com if you’re interested in collaborating on any projects.

Entropy Maps

Sun, 18 Feb 2024 00:00:00 +0000

The PDF version of this post is available on GitHub.

The basic theory behind an entropy map is to map values in the domain to values in the codomain by hashing to a prefix-free code in the codomain. We do not store anything related to the domain, since we are simply hashing them, and a prefix of that hash will be used as a code for a value in the codomain.

We actually allow for many different codes for each value in the codomain, so that, for instance, a code for, say, the value a may be 00, 01, 10, and 11. Notice that we can efficiently decode this as a if the hash is less than 4.

Suppose $\Pr\{f(X) = y\} = p_y$, $X \sim p_X$, then the optimally space-efficient code, assuming a uniform hash function $h$, is to assign prefix-free codes for $y$ whose probability of being mapped to by $h$ sums to $p_y$. In this case, the expected bit length is given by $$ \ell = -\sum_{y \in \mathcal{Y}} p_y \log_2 p_y, $$ which if we imagine sampling $x$ from $\mathcal{X}$ with $p_X$ and then mapping to $y = f(x)$ and observing the sequence of $y$’s, then the expected bit length is the entropy of the sequence of $y$’s. This is why we call it an entroy map.

If $\mathcal{X}$ is finite, then we can just imagine implicitly encoding the domain and then for each value in the domain, storing the prefix-free code that it maps to, which has an average bit length of $\ell$ and a total bit length of $|X| \ell$.

Rate distortion: Bernoulli maps

We can allow rate distortion, too, by failing to code for some of the elements properly. For instance, a popular choice is when one of the values, say $y’$, is extremely common such that, for instance, $p_{y’} > .99$, then we can give it a prefix-free code that sums to $p_{y’}$ and then not code for it in the entropy map, in which case it will, for some randomly selected $x \in \mathcal{X}$, be mapped to $y’$ with probability $p_{y’}$ (which is sufficiently large, and can be made as close to $1$ as desired if we wish to trade space for accuracy), and then for the remaining values in the domain, code for them correctly (or also allow errors on them, too, but only after trying to find correct codes for each of them).

Bernoulli set-indicator function

For instance, suppose we have a set-indicator function $$ 1_{\mathcal{A}} : \mathcal{X} \to \{0,1\}, $$ where $\mathcal{A} \subseteq \mathcal{X}$, and $\mathcal{X}$ is a very large set (even infinite), then we may assign prefix-free codes for the codomain value $1$ s.t. a priori, a random hash function hashes an element in $\mathcal{X}$ to a prefix-free code for $1$ with probability $\varepsilon$, where $\varepsilon$ is very small, e.g., $2^{-10}$.

There are a countably infinite set of random hash functions which hash all elements in $\mathcal{A} \subseteq \mathcal{X}$ to prefix-free codes for $1$ and all other elements, $\mathcal{A}’ = \mathcal{X} \setminus \mathcal{A}$, to prefix codes either for $0$ or $1$. If we are choosing a random hash function that satisfies this property, then it is expected that $\varepsilon$ of the elements in $\mathcal{A}’$ will hash to a prefix-free code for $1$, and the remaining $1 - \varepsilon$ will hash to a prefix-free code for $0$.

For any $x \in \mathcal{X}$, we can test if $1_{\mathcal{A}}(x) = 1$ by testing if a prefix of $h(x)$ is a prefix-free code for $0$ or $1$, and if it is a code for $0$, then we know that it is definitely not a member of $\mathcal{A}$, but if it is a code for $1$, then it is a member of $\mathcal{A}$ with a false positive rate of $\varepsilon$ and a true positive rate $1$, since a randomly drawn element in $\mathcal{A}’$ will hash to $0$ with probability $1 - \varepsilon$ and any element in $\mathcal{A}$ will map to $1$ with probability $1$ (since we explicitly chose a random hash function that hashes all of the elements in $\mathcal{A}$ to a prefix-free code for $1$).

It is interesting to note that the entropy map initially frames the problem as a compression problem, but we can also think of it as a rate-distortion problem. Implicitly, in the above set-indicator function approximation, we are choosing to minimize a loss function in which false negatives are much more costly than false negatives, either because it is unlikely we will test a negative element for membership, or because false positives are not nearly as costly as false negatives, e.g., falsely thinking a rustling in the bushes is a tiger (false positive) is much less costly than failing to notice a tiger in the bushes (false negative).

In either case, we call this set-indicator approximation a Bernoulli set-indicator function, bernoulli<(set<X>, X) -> bool>{ $1_A$ }. This is the function that is communicated, not the latent set-indicator function $1_A$.

A randomly chosen random hash function that satisfies (is conditioned on) the property that it hashes all elements in $\mathcal{A}$ to a prefix-free code for $1$ has the confusion matrix in Table 1.

Table 1: Conditional distribution of Bernoulli set-indicator functions given latent set-indicator function on $\mathcal{X} = \{a,b\}$

latent/observed	$1_\emptyset$	$1_{\{a\}}$	$1_{\{b\}}$	$1_{\{a,b\}}$
$1_\emptyset$	$(1-\varepsilon)^2$	$(1-\varepsilon)\varepsilon$	$(1-\varepsilon)\varepsilon$	$\varepsilon^2$
$1_{{a}}$	$0$	$1-\varepsilon$	$0$	$\varepsilon$
$1_{{b}}$	$0$	$0$	$1-\varepsilon$	$\varepsilon$
$1_{{a,b}}$	$0$	$0$	$0$	$1$

We see that the constraint of no false negatives generates a confusion matrix with a lot of zeros. If we observe bernoulli<set<X>,X) -> bool>{$1_{\{a\}}$}, then the latent set-indicator function is either $1_{\emptyset}$ or $1_{\{a\}}$. Since $\varepsilon$ is very small, we can be fairly certain that the latent set-indicator function is $1_{{a}}$.

What is the total degrees-of-freedom for a confusion matrix of this type?

Table 2: Confusion matrix with maximum degrees-of-freedom

latent/observed	$1_\emptyset$	$1_{\{a\}}$	$1_{\{b\}}$	$1_{\{a,b}}$
$1_\emptyset$	$p_{1 1}$	$p_{1 2}$	$p_{1 3}$	$1-p_{1 1}-p_{1 2}-p_{1 3}$
$1_{{a}}$	$p_{2 1}$	$p_{2 2}$	$p_{2 3}$	$1-p_{2 1}-p_{2 2}-p_{2 3}$
$1_{{b}}$	$p_{3 1}$	$p_{3 2}$	$p_{3 3}$	$1-p_{3 1}-p_{3 2}-p_{3 3}$
$1_{{a,b}}$	$p_{4 1}$	$p_{4 2}$	$p_{4 3}$	$1-p_{4 1}-p_{4 2}-p_{4 3}$

We see that there are $4 \times (4 - 1) = 12$ degrees-of-freedom for the confusion matrix in Table 2. For the confusion matrix in Table 1, we have $1$ degrees-of-freedom, since we have $1$ parameter, $\varepsilon$.

The degree-of-freedom is one way to think about the complexity of a model. The more degrees-of-freedom, the more complex the model. The more complex the model, the more data we need to estimate the parameters of the model, although frequently we already know the parameters of the model, since it may have been specified as a part of the algrorithm that generated the Bernoulli approximation.

The confusion matrix in Tables 1 and 2 represent the conditional distribution of the Bernoulli set-indicator function given the latent set-indicator function, which we denote by bernoulli<set<X>,X) -> bool>.

Boolean Bernoulli as constant function

How many functions are there of type () -> bool? There are two, true and false. That is, there are $|\{true, false\}|^{|\{1\}|} = 2^1 = 2$ functions.

So, we can also think of Boolean values as functions of type () -> bool. Then, when we apply the Bernoulli model bernoulli<() -> bool>, we get the same result as before.

Table 3: Confusion matrix for Bernoulli model applied to Boolean values

latent/observed	`true`	`false`
`true`	$p_{1 1}$	$1-p_{1 1}$
`false`	$1-p_{2 2}$	$p_{2 2}$

This confusion matrix has a maximum of two degrees-of-freedom, since there are two parameters, $p_{1 1}$ and $p_{2 2}$, since we have the constraint that the sum of the probabilities in each row is $1$.

In the binary symmetric channel, $p_{1 1} = p_{2 2}$:

Table 4: Confusion matrix for Bernoulli model applied to Boolean values

latent/observed	`true`	`false`
`true`	$p$	$1-p$
`false`	$1-p$	$p$

Conditional distribution of latent function given observed function

Once we have an observation, say bernoulli<set<X>,X) -> bool>{x}, what does the confusion matrix tell us? Let’s abstract the problem a bit so we can focus on deriving the result.

Let $X$ and $Y$ be random variables. Assume that $P(X = x | Y = y)$ is difficult to compute, but $P(Y = y | X = x)$ is easy. (This is the case for the confusion matrix. We know the conditional distribution of the observed set-indicator function given the latent set-indicator function, but we want to know the conditional distribution of the latent set-indicator function given the observed set-indicator function, which is not directly available.)

Suppose we are interested in $P(X = x | Y = y)$ but we only know $P(Y = y | X = x)$. Then, we can use Bayes’ rule to compute $P(X = x | Y = y)$: $$ P(X = x | Y = y) = \frac{P(Y = y | X = x) P(X = x)}{P(Y = y)} $$

So, to compute $P(X = x | Y = y)$, we need to know two additional things. First, what is $P(X = x)$? This is usually a prior. If we know something about the distribution of $X$, then encode that information in $P(X = x)$, otherwise we can use an uninformed prior, e.g., assign a uniform probability to each possibility.

Second, what is $P(Y = y)$? This is just a normalizing constant that makes the conditional distribution of $X$ given $Y$ sum to $1$, but it can be computed as follows: $$ P(Y = y) = \sum_{x’} P(Y = y | X = x’) P(X = x’) $$ If $P(X = x)$ is a uniform prior and $|X|$ is finite then $P(X = x) = 1/|X|$. Combining this with the above equation, we get: $$ P(X = x | Y = y) = \frac{P(Y = y | X = x)}{\sum_{x’} P(Y = y | X = x’)} $$

So, let’s replace $X$ with the latent set-indicator function x and $Y$ with the observed bernoulli<(set<X>,X> -> bool>{y}. Then, we can compute the conditional distribution of the latent x given the observed bernoulli<(set<X>,X> -> bool>{y} by looking at the confusion matrix in Table 2 and picking out the specific row and column of interest and then normalizing by the sum of the column.

For instance, suppose we observe $1_{\{a\}}$. Then, we want to know the conditional probability of the latent set-indicator function given the observed set-indicator. The column for $1_{\{a\}}$ is $(p_{1 2}, p_{2 2}, p_{3 2}, p_{4 2})’$. The sum of this column is $p_{1 2} + p_{2 2} + p_{3 2} + p_{4 2} = 1$. So, the conditional probability of the latent set-indicator function given the observed set-indicator is given by $$ p_{k|2} = \frac{p_{k 2}}{\sum_{j=1}^4 p_{j 2}}, $$ where $k$ is the row corresponding to the latent set-indicator function of interest and we conditioning on column $2$, the index for the observed set-indicator function $1_{\{a\}}$.

More generally, we have $$ p_{k|i} = \frac{p_{k i}}{\sum_{j=1}^4 p_{j i}}, $$ where $k$ is the row corresponding to the latent set-indicator function of interest and we are conditioning on column $i$, the index for the observed set-indicator function. If we do this for the four possible observed set-indicator functions for Table 2 (confusion matrix with only one degree-of-freedom), we get Table 5.

Table 5: Conditional probability of latent set-indicator function given observed set-indicator function

observed/latent	$1_\emptyset$	$1_{\{a\}}$	$1_{\{b\}}$	$1_{\{a,b\}}$
$1_\emptyset$	$1$	$0$	$0$	$0$
$1_{{a}}$	$\varepsilon/(1+\varepsilon)$	$1/(1+\varepsilon)$	$0$	$0$
$1_{{b}}$	$\varepsilon/(1+\varepsilon)$	$0$	$1/(1+\varepsilon)$	$0$
$1_{{a,b}}$	$\varepsilon^2/(1+\varepsilon)^2$	$\varepsilon/(1+\varepsilon)^2$	$\varepsilon/(1+\varepsilon)^2$	$1/(1+\varepsilon)^2$

The conditional distribution in Table 5 is one way to think about the uncertainty of the latent set-indicator function given the observed set-indicator function. The entropy is another way to think about the uncertainty, but we will not compute it here.

We see that when we observe the empty set for a Bernoulli model in which false negatives are not possible, then we know for certain that the latent set-indicator function is $1_\emptyset$. However, when we observe $1_{\{a\}}$, we are uncertain about the latent set-indicator function. We know that it is either $1_{\emptyset}$ or $1_{\{a\}}$, but we do not know which one it is. Since $\varepsilon$ is small, it is more much likely to be $1_{\{a\}}$ than $1_{\emptyset}$, though. A similar argument holds for the other two observed set-indicator functions.

Algorithms

The simplest algorithm is a one-level hash function evaluation, where we hash the domain values concatenated with some bit string $b$ such that when we decode the values $h(x + b)$, $x \in \mathcal{X}$, we get a prefix-free code for $y = f(x)$.

Two-level hash function evaluation

The more practical solution is a two-level hash scheme. First, we hash each $x \in \mathcal{X}$ concatented with the same bit string $b$, same as before. However, we use this hash value to index into a hash table $H$ at, say, index $j$. Now, we choose a bit string for $H[j]$ for each $x \in \mathcal{X}$ that hashes to $j$ such that $f(x) = \text{decode}(h(x + H[j]))$.

This way, we can keep the probability $p_j = \prod_x \Pr\{ f(x) = \text{decode}(h(x + H[j]))\}$ for each $x$ that hashes to $j$ more or less constant, independent of the size of the codomain $\mathcal{X}$, by choosing an appropriately-sized hash table $H$.

Since each decoding is an independent Bernoulli trial, we see that the probability that a particular $x$ that hashes to $j$ is decoded correctly is the number of hashes that are a prefix-free code for $f(x)$ divided by the total number of hashes (e.g., an $N$ bit hash function has $2^N$ possible values).

Oblivious entropy maps

An oblivious entropy map is just an entropy map where the hash function is applied to trapdoors of $\mathcal{X}$ and the prefix-free codes for $\mathcal{Y}$ have no rythm or reason to them, e.g., a random selection of hash values for each value in $\mathcal{Y}$.

Reliability Estimation in Series Systems: Maximum Likelihood Techniques for Right-Censored and Masked Failure Data

Sat, 30 Sep 2023 00:00:00 +0000

The Bernoulli Model: A Closer Look at the Boolean Bernoulli Model

Sat, 17 Jun 2023 00:00:00 +0000

Motivation
Bernoulli Booelean
Binary Channels
False Positives and Negatives
Prediction
Inducing Bernoulli types
Noisy Turing machines: noisy logic gates
Summary

Motivation

The Bernoulli model is a general framework for thinking about probabilistic data structures and types of a particular sort. A big reason for developing the Bernoulli Model formalism is so that we can use Bernoulli Models of data types to develop Oblivious Data Types. We will go into that in a separate document, but the basic idea is that Bernoulli approximations have a lot of desirable properties for developing oblivious data types, and the Bernoulli Model formalism allows us to reason about the correctness of the oblivious data types and to make them more space-efficient by trading accuracy for space while allowing for O(1) time complexity.

The Bernoulli Model also provides a formalism for how to think about various probabilistic data structures, like the Bloom filter, Count-Min sketch, or my invention, the Bernoulli data type, which comprises an entire family of data structures that are all based on the Bernoulli Model, from sets (like the Bloom filter) to maps in a near-space optimal way, while allowing for more savings by trading accuracy for space in a controlled way.

The Boolean type, represented as bool in C++, models the set of values given by {true,false}. This document entertains the replacement of bool with a type bernoulli<bool>, which represents a sort of noisy Boolean. In general, we can have a Bernoulli type for any type T, denoed by bernoulli<T>.

Each Bernoulli Model also has an order, an integer greater than 1, and it essentially describes the number of independent ways in which the process that generated the Bernoulli approximation can produce errors. We denote that a Bernoulli Model has order K with bernoulli<T,K>. Unless it is useful, we drop the order information and simply write bernoulli<T>.

As special case, data structures like Bloom filters can be thought of as a Bernoulli data structure.

In the Bernoulli Boolean model, a bool is wrapped inside of a Bernoulli type bernoulli<bool>. We use the notation bernoulli<bool>{x} to denote that it is modeling some latent variable x (unobservable). We can think of bernoulli<bool>{x} as a measurement of x, or a noisy version of the original x, and it may or may not equal x.

The Bernoulli model introduces a notion of uncertainty or error. Specifically, a bernoulli<bool>{x} is a random Bernoulli variable such that

Pr{bernoulli<bool>{x} == x} == p(x)

where 0 < p(x) < 1 is the probability of being correct and 1-p(x) is the probability of an error. In most practical situations, the probability p(x) is known and can be adjusted to balance factors like space and accuracy.

Bernoulli Booelean

In this paper, we narrow our focus to the Boolean Bernoulli Model, which is the simplest Bernoulli Model. Later in this document, we consider Bernoulli Models for Boolean functions too, since it provides a natural opportunity to think about the model in a more general way.

Binary Channels

Let’s begin by thinking about the Binary Symmetric Channel and the Binary Asymmetric Channel. The Bernoulli Boolean model can exhibit two distinct behaviors, represented as different “channels” through which Boolean values are transmitted:

Binary Symmetric Channel (First-order Bernoulli model): The probability of an equality error is the same for true and false. We denote this by the type bernoulli<bool,1>.
Binary Asymmetric Channel (Second-order Bernoulli model): The probability of an equality error differs for true and false. We denote this by the type bernoulli<bool,2>.

False Positives and Negatives

Errors in the Bernoulli Boolean model can be understood in terms of false negatives and false positives:

bernoulli<bool>{false} == true is a false negative.
bernoulli<bool>{true} == false is a false positive.

In the first-order model, the probability of a false negative equals the probability of a false positive. In the second-order model, these probabilities differ. In a specific but common version of the second-order Bernoulli Boolean model, false negatives occur with probability 0 and false positives occur with probability 0 < \varepsilon < 1.

Prediction

bernoulli<bool>{x} is correlated with x, and ideally, bernoulli<bool>{x} provides evidence for x, i.e., allows one to predict x given bernoulli<bool>{x} better than if no observations where given whatsoever. If the probability of correct p(x) is <= 0.5 and we have no prior information about x, the best (ML) estimate of x is the observation bernoulli<bool>{x}.

However, with prior information about x, we can estimate the probability that the latent variable x is true or false. Using Bayes’ rule, the probability that bernoulli<bool>{x} is correct is:

Pr{x == true | bernoulli<bool>{x} == true} ==
Pr{bernoulli<bool>{x} == true | x == true } * Pr{x == true}
/
(Pr{bernoulli<bool>{x} == true | x == true} * Pr{x == true} +
Pr{bernoulli<bool>{x} == true | x == false} * (1-Pr{x == true}))

In the first-order model, if the probability of being correct q, then:

Pr{x == true | bernoulli<bool,1>{x} == true} ==
q * Pr{x == true}
/
(q * Pr{x == true} + (1-q) * (1-Pr{x == true}))

Assuming maximum ignorance (maximum entropy) about x (i.e., Pr{x == true} == 0.5), the following expression is obtained:

Pr{x == true | bernoulli<bool,1>{x} == true} == q

One could even imagine having multiple sources of, say, noisy i.i.d. measurements of the same x. For instance, suppose x == true but we don’t know that and we have 3 measurements of x.

y1 = bernoulli<bool,1>{true} == true
y2 = bernoulli<bool,1>{true} == false
y3 = bernoulli<bool,1>{true} == true

This is more information about x than just one noisy observation. Clearly, and informally, the best prediction for the value of x is the majority vote, which is true in this case.

Consider this. The number of true values is Binomially distributed with parameters n=3 (independent trials) and probability p, so we let N ~ BIN(3,p) denote the random variable representing the number of true values in y1, y2, y3.

Let’s do a case by case analysis to compute the probability that the above majority vote is correct. First, for the majority vote to be correct, N >= 2, which means that N == 2 or N == 3.

The probability that N == 2 is Pr{N == 2} = 3 * p^2 * (1-p).
The probability that N == 3 is Pr{N == 3} = p^3.

Therefore, the probability of no error is 3 * p^2 * (1-p) + p^3. If p = 0.5 (maximum ignorance), we get a no error rate of 0.5, as intuitively expected. For p = 1, we get a no error rate of 1, which is also intuitively expected. The no error rate of a single observation, of course, is just p. Let’s plot these two no error rates together:

test

We see a slight improvement in the no error rate when we have multiple noisy observations of the same latent variable. As the number of independent sources goes to infinity, the error rate goes to 0.

This is not a typical use-case for the Bernoulli Boolean model, since it will mostly be a analytical result of probabilistic data structures that may be framed in the context of a Bernoulli model, but it is interesting to see how the model behaves in this case.

Inducing Bernoulli types

If we have a function f : bool -> bool, then the space of all possible functions is given by Table 1.

Table 1: All possible functions f : bool -> bool

f	f(true)	f(false)
id	true	false
not	false	true
true	true	true
false	false	false

It may be interesting to consider what happens when we replace the Boolean inputs with Bernoulli boolean values and ask the question, “What is the probability that f(bernoulli<bool,1>{x}) == f(x)?”

Notice that f(bernoulli<bool,1>{x}) is f(x) with some probability, but f(x) may be latent depending on f. For the constant fuctions, true and false, we get the same function, i.e., true(bernoulli<bool,1>{true}) == true since true : bool -> bool always outputs true, and similiarly for false : bool -> bool.

However, the id and not functions are different. For instance, suppose Pr{bernoulli<bool,1>{x} == x} == p. Then, when we input bernoulli<bool,1>{true} into id, we get the correct output true with probability p and the incorrect output false with probability 1-p. Likewise, when we input bernoulli<bool,1>{false} into id, we get the correct output id(true) == false with probability p and the incorrect output f(false) == true with probability 1-p, and a similar story for not.

Since we can think of these outputs as either correct or incorrect with probability p, we can call them Bernoulli Boolean values too, e.g., this is a function of type

 bernoulli<bool,1> -> bernoulli<bool,1>

What is this function? It’s just id, but it has been monadically lifted into the Bernoulli Boolean model. Notice also that this is distinct from the type

 bool -> bernoulli<bool,1>

which is what we say is a Bernoulli map from bool to bernoulli<bool,1>. In this case, it is a first-order Bournoulli map on the equality of its output, i.e.,

 Pr{bernoulli<bool -> bool,1>{id}(x) == id(x)} == p

Notice what the notation suggests, too. We are writing bernoulli<bool -> bool,1>{id} to indicate that the true value is id but what we observe is bernoulli<bool->bool,1>{id}. We cannot observe id directly. In fact, if we knew it was the identity function, we already know the correct output. We are interested in the case where we don’t know the correct output, and all we are given as evidence is the observation bernoulli<bool->bool,1>{id}.

So, we are applying the bernoulli concept to the function type bool -> bool, which in this case only has 4 possibilities. Clearly, we normally would not use a Bernoulli model for bool -> bool, and rather, the Bernoulli model would be induced by some source of error, such as transmission over a noisy channel, as previously described. We stick to this simple example for now, though, because it is much more managable to work with, and we can generalize the results to $X -> Y$ where $X$ and $Y$ are arbitrary types, i.e., we observe bernoulli<X->Y,K>{f} and wish to use that to compute the probability that $f(x) = y$ for some $x \in X$ and $y \in Y$.

Notice that we do not change the type of the input, $X$. This is a first-order Bernoulli map. We can, of course, also provide as input to this function a Bernoulli Boolean value, e.g., bernoulli<bool,1>{true}, and we will get a an even higher-order Bernoulli Boolean value as output. In this case, we willl have a higher-order Bernoulli map of type

 bernoulli<bool,1> -> bernoulli<bool>

where for the output we drop the order information, and track the error rates using interval arithmetic, whch we will discuss later.

Since functions are values, we can also ask the question, what is the probability that bernoulli<bool->bool,1>{id} == id? In this case, we are asking about the equality of the functions, which is mathematically equivalent to asking whether each input in the domain maps to the same output, i.e.,

Pr{bernoulli<bool->bool,1>{id}true) == id(true) &&
bernoulli<bool->bool,1>{id}(false) == id(false)}

Since this is a first-order model, the probability that both conditions are true is just the product of the probabilities of each condition being true, i.e.,

Pr{bernoulli<bool->bool>,1>{id}(true) == id(true)} *
Pr{bernoulli<bool->bool>,1>{id}(false) == id(false)} = p^2.

Let’s fix p and consider the confusion matrix for the first-order model, bernoulli<bool->bool,1>. We used the standard naming convention for the outcomes of observations (bernoulli<bool->bool,1>{f}(x)) when compared against the actuality (the latent $f(x)$), where TPR is the true positive rate, FNR is the false negative rate, TNR is the true negative rate, and FPR is the false positive rate. The confusion matrix is given by Table 2.

Table 2: First-Order Bernoulli Model for bool -> bool over Booleans

	observe `true`	observe `false`
latent `true`	TPR $p$	FNR $1-p$
latent `false`	FPR $1-p$	TNR $p$

Note that in the above, we are not discussing the input – it is, after all, observable in this case. We are only discussing the output, which is latent, since we are pretending that we do not know we are dealing with, say, id. We are only given the observation bernoulli<bool->bool,1>{id}. As mentioned previously, there are only 4 possible functions of type bool -> bool, so if $p$ is reasonably small, we can probably estimate the true function with high confidence based on examing inputs with expected outputs.

We might ask the question, can the order N in bernoulli<bool->bool,N> be greater than 2? It is an interesting question. We only have two possible outcomes, true and false, so how could we have a higher-order model? The answer is that we are not tracking the order of the output, but rather, we are tracking the order of the Bernoulli Boolean function approximation. Since we know the type, bool -> bool, we know that there are only 4 possible functions.

Just as before, we knew we had a Boolean value. A Boolean value can only be true or false. We can’t observe the value directly, but we can observe a Bernoulli approximation of the value. For each observed value, we can have unique probability that the latent value is true or false.

Let’s extend this to the discussion of functions of type bool -> bool. There are only 4 possible functions of this type, id, not, true, and false.

Now suppose we are given a Bernoulli bernoulli<bool->bool>{id}. We do not know that the latent function is id, we only know that we have a function bernoulli<bool->bool>{id}, which can be either id, not, true, or false. The best guess for bernoulli<bool->bool>{id} is the function that it matches, assuming that the process that generates these approximations is unbiased.

Let’s construct the confusion matrix for bernoulli<bool->bool>.

Table 3: Bernoulli Model for bool -> bool

	observe `id`	observe `not`	observe `true`	observe `false`
latent `id`	$p_{1 1}$	$p_{1 2}$	$p_{1 3}$	$p_{1 4}$
latent `not`	$p_{2 1}$	$p_{2 2}$	$p_{2 3}$	$p_{2 4}$
latent `true`	$p_{3 1}$	$p_{3 2}$	$p_{3 3}$	$p_{3 4}$
latent `false`	$p_{4 1}$	$p_{4 2}$	$p_{4 3}$	$p_{4 4}$

Each row must sum to 1, $\sum_j p_{i j} = 1$, so we only have up to a maximum of $4 (4-1) = 12$ degrees of freedom. This means the highest Bernoulli Boolean order is 12 (bernoulli<bool->bool,12>), but we normally drop the order and just write bernoulli<bool->bool> and track the error rates using interval arithmetic, as mentioned a few times previously.

Now, when we have a Bernoulli approximation of some latent function of type bool -> bool, we wish to store the error information in the output so that we can propagate it forward. We do this by saying that the output is a Bernoulli Boolean, because it may or may not be correct, i.e., the Bernoulli process bernoulli<bool->bool> generates a function of type bool -> bernoulli<bool> rather than of type bool -> bool. In our algorithms, we created a type system for this, and this extra information can be discarded when tracking errors is not needed.

So, what happens when we have a Bernoulli model bernoulli<bool->bool>, and then we lift it to

bernoulli<bernoulli<bool>->bernoulli<bool>>

by providing bernoulli<bool> as input? When we compare the true output with this lifted Bernoulli model, we still get a maximum order of 12, but if the order is, say, 2, then this lifted model is likely to have a higher order.

The order of the model is not necessarily that important, but it does complicate estimation problems, and it is also desirable to have a higher order models in some cases, for instance if we have an entropy coder, then we want the diagonal of the confusion matrix to be as close to 1 as possible, and we want the off-diagonal elements to be as close to 0 as possible, but when elements are not 0, we want functions that are more similiar to the latent function to have larger probabilities than functions that are less similiar to the latent function. This is just a way of minimizing a loss function in ML, where the function truly is latent and we are trying to find the best approximation to the latent function by minimizing a loss function. The higher the order, the more capacity the model has to approximate the latent function, but the more data we need to estimate the parameters of the model.

ML is not really the target of the Bernoulli model, but it is a useful way to think about the model. The Bernoulli model is really a way of thinking about the uncertainty in the output of a function, and how that uncertainty propagates through a computation, and typically the uncertainty is due to a trade-off between space complexity and accuracy. The more space we use to represent the function, the more closely it is expected to approximate the latent function.

Noisy Turing machines: noisy logic gates

As we consider more complex compound data types, which may always be modeled as functions, we will see that there are many ways these types can participate in the Bernoulli Boolean model. When a Bernoulli value is introduced into the computational model, the entire computation outputs a final result that is a Bernoulli type, e.g., bernoulli<pair<T1,T2>>, pair<T1,bernoulli<T2>, and so on.

The easiest way to think about this is to just consider a Universal Turing machine in which we build programs by composing circuits of binary logic-gates, like and, or, and not. In general, if we replace a single input into the circuit with a Bernoulli Boolean, the output of the circuit is a one or more Bernoulli Booleans. Moreover, and more interestingly, we can replace some of the logic gates with noisy logic-gates, or Bernoulli logic-gates, and the output of the circuit is also a Bernoulli Boolean. We can always discard information about the uncertainty in the output of the circuit, and just get Boolean, but if the uncertainty is non-negligible, then we may want to keep track of it.

So, let’s consider the set of binary functions f : (bool, bool) -> bool.

There are 2^2 = 4 possible functions f : bool -> bool since for each possible input, true or false, we have two possible outputs, true or false.

More generally, if we have f : X -> Y, then we have |Y|^|X| possible functions, where |.| denotes the cardinality of a set. For instance, if X = (bool, bool) and Y = bool, then we have 2^4 = 16 possible functions, since |X| = 4 and |Y| = 2.

Each of these functions has a designated name, which we can use to refer to them, like and, xor, etc. However, we are just going to look at and.

Table 4: and : (bool, bool) -> bool

`x1`	`x2`	`and(x1, x2)`
true	true	true
true	false	false
false	true	false
false	false	false

Now, let’s consider

and : (bernoulli<bool,1>, bernoulli<bool,1>) -> bernoulli<bool,2>`

This is more complicated than might first seem. An error occurs if and returns true when it should return false, or vice versa. The input variables represent latent values, so they do not have a definite value.

We will go row by row, and examine the probability that the output is correct for each output.

Case 1: The Correct Output Is True

In order for the output to be true, both noisy inputs must be true, which is just the product of the probabilities of each condition being true since they are statistically independent outcomes.

Case 2: The Correct Output Is False Given `x1 = true` and `x2 = false`

Consider and(bernoulli<bool,1>{true}, bernoulli<bool,1>{false}). For this to be true, the first must be a true positive and the second must be a false postive, which is just p1 * (1-p2). Since we are interested in the probability that it correctly maps to false, that is just 1 - p1 * (1-p2) = 1 - p1 + p1 * p2.

Case 3: The Correct Output Is False Given `x1 = false` and `x2 = true`

Consider and(bernoulli<bool,1>{false}, bernoulli<bool,1>{true}). For this to be true, the first must be a false positive and the second must be a true positive, which is just (1-p1) * p2. Since we are interested in the probability that it maps correctly to false, that is just 1 - (1-p1) * p2 = 1 - p2 + p1 * p2.

Case 4: The Correct Output Is False Given `x1 = false` and `x2 = false`

Consider and(bernoulli<bool,1>{false}, bernoulli<bool,1>{false}). For this to be true, both must be false positives, which is just (1-p1) * (1-p2). Since we are interestd in the probability that it maps correctly to false, that is just 1 - (1-p1) * (1-p2) = p1 + p2 - p1 * p2.

Summary

Table 6: and with Bernoulli inputs

`x1`	`x2`	`and(x1,x2)`	`Pr{correct}`
1	1	1	`p1 * p2`
1	0	0	`1 - p1 + p1 * p2`
0	1	0	`1 - p2 + p1 * p2`
0	0	0	`p1 + p2 - p1 * p2`

We see that and : (bernoulli<bool,1>, bernoulli<bool,1>) -> bernoulli<bool,4> induces an output that is a fourth-order Bernoulli Boolean. How is this possible when there are only two possible outputs? The answer is that the output is dependent on four different combinations of inputs.

Since x1 and x2 are latent, we can only talk about the probability that the output is correct or not. We see that when the output is 1, the probability that the output is correct is p1 * p2. When the output is 0, the probability that it is correct is more complicated.

We could store all of this information in the type bernoulli<bool,4>, but it is probably more convenient to use interval arithmetic, where we store a range of probabilities for the probabily that the Boolean value being stored is correct. The best choice is just the minimum length interval that contains all of the relevant probabilities for the output being correct. When the output is 1, we see that the minimum spanning interval is just p1 * p2, and when the output is 0, the minimum spanning interval is just the minimum span of

min_span{1 - p1 + p1 * p2, 1 - p2 + p1 * p2, p1 + p2 - p1 * p2}

As we compose more and more logic circuits together, we can keep track of the minimum spanning intervals on outputs using interval arithmetic.

Let’s come back to the idea of Bernoulli types over compound types. In particular, let’s consider applynig the Bernoulli approximation to binary functions of the type (bool, bool) -> bool.

Now, we can apply the Bernoulli approximation

bernoulli<(bool, bool) -> bool>

which will generate functions of the type

(bool, bool) -> bernoulli<bool>

This may be thought of as a noisy binary logic-gate. For the case of the and gate, what we observe in our model is bernoulli<(bool, bool) -> bool>{and}, and it can generate up to 16 different Bernoulli Boolean functions. That means that the maximum order is $16 (16 - 1) = 240$, which isn’t really important, but it’s interesting to note.

Of course, if we have this noisy and function and then put in noisy inputs, then we get a function of type

(bernoulli<bool>, bernoulli<bool>) -> bernoulli<bool>

Uses and limits of abstractions

Sat, 17 Jun 2023 00:00:00 +0000

I’m been thinking about the power and limitations of abstractions in our understanding of the world. This blog post is from a chat I had with a ChatGPT, which can be found here and here.

I’m not sure if this is a good blog post, but I’m posting it anyway. It’s remarkable how quickly you can slap stuff like this together, and I’m not sure this is saying anything valuable, particularly since it only required a bit of prompting from me.

Uses and limits of abstractions

Reality, in all its richness, is far more complex than we can appreciate. Our attempts to understand and navigate it necessitate the use of abstractions, compressions that retain the salient details relevant to a specific context while discarding the rest. These abstractions are indispensable to human cognition, enabling us to engage with parts of reality despite our limited cognitive capacity and incomplete information, but there are also parts of reality that may be fundamentally off-limits to us.

Limited working memories

Human cognitive abilities are bounded. For instance, our working memory can effectively hold and process only a limited amount of information at once. Cognitive psychology often references the “magic number seven”, suggesting that most adults can hold between five and nine items in their working memory.

Consider a situation where we’re dealing with multiple variables $(x_1, x_2, x_3, x_4)$. Our brain might struggle to simultaneously process the joint distribution of these variables due to the limitation of our working memory. However, if we create an abstraction where $X$ represents $(x_1, x_2)$ and $Y$ represents $(x_3, x_4)$, we simplify the cognitive task to handling the joint distribution of just two variables $(X,Y)$, which is a more manageable task. This constraint necessitates the use of abstractions in order to understand complex systems.

Incomplete information

Beyond our cognitive limitations, we also lack complete information about any real-world system. We cannot have total information about the systems we’re trying to understand.

Again, we may use abstractions to deal with this limitation, abstractions that allow us to think more clearly about some parts of the system (that can be observed and usefully reduced) while ignoring other parts (that are not observable or not usefully compressible). For instance, a key concept in statistical mechanics is entropy, which allows us to reason about the behavior of systems with a large number of particles that behave according to some statistical regularties in the aggregate. We might be able to observe certain features of a system, such as the size of a box and its temperature, but there’s much we don’t know, such as the microstates the system is in at a given moment.

However, knowing the temperature and size of the box, we can make useful predictions about the system, such as what its temperature in one hour will be, or whether it will explode if we add a certain amount of heat to it. We can ask certain questions about it, but we cannot ask questions that require knowing the microstate of the system. And, ultimately, everything about the system is determined by its microstate, and so there are many questions we cannot answer.

Entropy allows us to reason about the system using available observations, while acknowledging the underlying complexity that we can’t observe or don’t yet understand. More generally, despite our limited understanding and inability to perceive the entirety of a complex system, we still aim to make meaningful assertions about it. This is where the role of abstractions, like entropy, becomes particularly significant, acting as a cognitive scaffold that allows us to grasp some aspects of the system’s behavior.

The idea of entropy itself can be generalized. It is a compelling concept in information theory, defined as $$ H(X) = -\sum_{x \in X} p(x) \log p(x), $$ where $X$ is a random variable and $p(x)$ is the probability of $X$ taking on the value $x$. When dealing with microstates and supposing that each state is equally probable (which is a reasonable approximation in the case of a gas-filled box), entropy “simplifies” to the logarithm of the number of different states the system can be in that are compatible with what we can observe about the system.

Emergent behavior

It’s important to remember, however, that such reasoning can only get us so far. Some systems have a complexity that is fundamentally irreducible, characterized by emergent behavior that can only be discerned when considering the system as a whole. This is related to our limited working memories, and our need for creating abstractions to work-around our limitations. However, there are some systems that are so complex that they cannot be reduced to simpler parts. The behavior of the system as a whole is not just the sum of the behavior of its parts, but is something new and different. This is known as emergent behavior, and it is a key feature of many complex systems.

Consider the earlier example, where we were dealing with multiple variables $(x_1, x_2, x_3, x_4)$ and reduced it to just two variables $(X,Y)$, where $X = (x_1, x_2)$ and $Y = (x_3, x_4)$. This abstraction is useful in many contexts, but it is not always appropriate. For instance, if $x_1$ and $x_4$ are correlated in some significant way, perhaps only in the distant future, then the reduction to $X$ and $Y$ may fail to capture this salient feature. In this case, to understand the important parts of the system that we are interested in, we need to consider the joint distribution of the full set of variables $(x_1, x_2, x_3, x_4)$.

A popular example is “water is wet”. This is a true statement, but it is not true of any of the individual molecules that make up water. It is an emergent property of the system as a whole, in which billions of these simple molecules interact in locally simple ways. The wetness of water is not a property of the individual molecules, but an emergent property of the system as a whole.

It may even the case that something like consiousness is an emergent property in an even more complex way, such that it cannot be reduced to the behavior of individual neurons or small groups of neruons, but only emerges as a property of the integrated behavior of the entire brain. Indeed, we may be talking about a system with $(x_1, x_2, \ldots, x_n)$, $n = \mathcal{O}(\text{# neurons})$.

Let’s consider again the concept of entropy. The representation of a system, such that it cannot be significantly compressed without losing vital information, may be said to have emerget properties. The earlier example of a box of gas is a good example that can be significantly compressed by considering only its temperature and dimensions. Knowing that information allows us to say a lot about the system. Systems for which a useful compression is not possible may be said to have emergent properties. When this is the case, we cannot compress the system in a way that fits our cognitive limitations, or in a way that is not sensitive to a lack of knowlege.

This is why consciousnes may feel so mysterious and inexplicable, because we cannot understand it, we cannot reduce it to a simpler system that we can, say, program on a computer using our cognition. (This is where machine learning, deep learning, LLMs, and so on become particularly relevant for solving problems that are too complex for us to solved analytically.)

Abstractions as cognitive scaffolds

Abstractions are indispensable to human cognition, enabling us to engage with reality despite our limited cognitive capacity. They allow us to reason about complex systems, even when we lack complete information about them. They also allow us to communicate with others, sharing ideas and knowledge across fields and disciplines. In creating abstractions, we walk a delicate balance. We need to remember these key aspects:

Imperfect representation

Abstractions are, by design, reductions of reality. While they help us manage complexity, most will fail to capture all the necessary information. As our needs evolve, we find ourselves tweaking the abstraction, adding layers of complexity that can make it harder to reason about and thereby diluting its initial utility. There is a constant balancing act in maintaining simplicity while preserving relevance.

This is also known as “the map is not the territory” problem. The map is a useful representation of the territory, but it is not the territory itself.
Abstractions are contextual

They are useful in certain contexts but not in others. We must be aware of the context in which we are using an abstraction and understand its limitations. We must also be aware of the context in which the abstraction was created and the assumptions that were made in its creation. This is particularly important when using abstractions from other fields.
Pedagogical

While an abstraction might not always provide an accurate representation of the object in real-world settings, it serves a valuable educational purpose. It enables us to learn key features about the object and acts as a bootstrap technique for further understanding. Even when ready-made abstractions fall short in our context, we can use principles of reductionism, analogy, and more to grapple with the complexity. Yet, we must remember that many phenomena are cross-cutting or emergent and cannot be fully understood through this process.
Communication

Abstractions are a key tool for communication. They allow us to share ideas and knowledge with others, even those outside our field. However, since an expert is often aware of the many nuances and limitations of an abstraction, they might not be the best person to explain it to a novice. It’s important to frequently ask, “How could I explain this to a five-year-old?” This helps us stay grounded, facilitates cross-pollination of ideas, and allows for broader comprehension.

These pointes demonstrate a dance between simplicity and complexity that we engage in when creating and using abstractions. Like a map, an abstraction is not the territory but a simplified representation of it. As we journey through the landscape of knowledge, these representations guide us, helping us ‘see the forest for the trees’, even as we acknowledge their inherent limitations.

Conclusion

Abstractions are indispensable to human cognition, enabling us to engage with reality despite our limited cognitive capacity and incomplete information. They allow us to reason about complex systems, even when we lack complete information about them. They also allow us to communicate with others, sharing ideas and knowledge across fields and disciplines.

However, we must be cognizant of the limitations of abstractions. They are, by design, reductions of reality. While they help us manage complexity, most will fail in some way. Much of reality may, in fact, be computationally irreducible (to borrow a phrase from Wolfram).

Working memory as an inductive bias

Sat, 17 Jun 2023 00:00:00 +0000

This blog post is from a chat I had with a ChatGPT, which can be found here and here.

I’m not sure if this is a good blog post, but I’m posting it anyway. It’s remarkable how quickly you can slap stuff like this together, and I’m not sure this is saying anything valuable, particularly since it only required a bit of prompting from me.

Working memory as an inductive bias

Human cognitive abilities, while remarkable, are bounded. Our working memory can effectively hold and process only a limited amount of information at once. Cognitive psychology often references the “magic number seven”, suggesting that most adults can hold between five and nine items in their working memory. This constraint necessitates the use of abstractions in order to understand complex systems.

But here’s the rub: In condensing reality to fit our cognitive capacities, we risk losing vital information. For instance, a critical relationship (that may only be critical in a certain context) between $x_2$ and $x_4$ might be discarded in our new model. This often leads to the situation where “the whole is greater than the sum of the parts.” In other words, the full, nuanced understanding of the system may be irreducible, with important aspects of its behaviour emerging only when all variables are considered together. Such emergent phenomena represent a key challenge in working with abstractions.

Working Memory and Inductive Bias

Our small working memory influences our reasoning abilities and shapes our understanding of the world around us, in effect serving as an inductive bias. It’s like a filter, shaping the patterns we detect and the generalizations we form based on the information we encounter.

This constraint, however, might not be entirely detrimental. It could even be an advantage, given the regularities in our reality. Think of it as a form of regularization in machine learning, where constraints prevent the model from overfitting the training data, thereby improving generalization to unseen instances. If we had much larger working memories, we might be prone to overfitting to our past observations, impairing our ability to adapt and survive in new situations, particularly those on the ’long tail of the distribution’. Our survival, after all, depends on avoiding catastrophic mistakes, even after decades of mostly beneficial decisions.

This perspective aligns with principles like Occam’s razor and Solomonoff’s theory of inductive inference, which favor simpler theories or models that sufficiently explain observed phenomena. The complexity of the model, and thus its capacity, is regulated to avoid overfitting and ensure better generalization.

The Limits of Our Understanding

The inductive bias imposed by our limited working memory might be advantageous in the human niche, but it’s essential to consider its potential shortcomings. Could there be aspects of reality that remain inaccessible to us due to our cognitive constraints?

Take, for instance, the phenomenon of consciousness. Understanding how self-awareness arises in a system may require accounting for the joint distribution of an astronomical number of variables. If this complexity is irreducible, our cognitive apparatus, bound by its inductive bias, may simply be inadequate for fully comprehending consciousness.

Indeed, it’s conceivable that vast swaths of reality might be fundamentally off-limits to human cognition, forever obscured by the constraints of our cognitive architecture. The complexity of these phenomena may defy simplification, making them impervious to our attempts at understanding through abstractions.

As we strive to push the boundaries of understanding, we should remain mindful of the limits imposed by our cognitive capacities and the constraints of our abstractions.

The Unconscious Mind and LLMs

In my next blog post, I’ll explore unconscious cognition, which makes up most of our mental activity (system 1 vs system 2 thinking), and the recent progress in machine learning, partciularly transformers and the LLM revolution.

Bernoulli model

Sat, 10 Jun 2023 00:00:00 +0000

likelihood.model R package

Mon, 10 Apr 2023 00:00:00 +0000

The R packge likelihood.model provides an API for specifying likelihood models for statistical inference.

The basic likelihood model is a concept that, in order for your object to satisfy, must implement a number of functions (generic methods). The package provides two different implementations of the concept:

likelihood_contr_model is a flexible framework for specifying likelihood models based on the idea of independent likelihood contributions for different types of observations, e.g., right-censored versus exact observations, or other kinds observations. This model is designed to accomodate more specialized likelihood models, such as series systems with latent commponents which includes ambiguous data about the components, such as masked failure causes.
likelihood_name_model provides a convenient wrapper for distribution functions that follow the naming and argument conventions in the R ecosystem, e.g., if we have some distribution norm (normal), then it has dnorm, pnorm, rnorm, and qnorm, respectively for the density function, probability function, sampler, and quantile function for the normal distribution. They also have standard paramenter arguments, like pnorm has a lower.tail Boolean parameter that computes either the CDF if TRUE and otherwise the survival function. Note that this model may be used to provide contributions to likelihood_contr_model.

The package is designed to be used with the algebraic.mle package, which provides a framework for performing maximum likelihood estimation (MLE).

Rate-distorted perfect hash filter

Mon, 10 Apr 2023 00:00:00 +0000

Problem set solutions

Fri, 31 Mar 2023 00:00:00 +0000

I have a fairly broad interest in problem-solving, from problems in statistics to algorithms. Over the years, I’ve accumulated a collection of problem sets, which I will be refining and posting here every now and then.

You can find these problem sets on various topics here. It’s accessible through the menu bar as well. Please note that while I don’t claim expertise in all areas, these materials may still provide valuable insights.

Using GPT-4 to Build a Simple HTML File Search Interface

Fri, 31 Mar 2023 00:00:00 +0000

This blog post written by GPT-4. See conservation with GPT-4 that built it here. The interface to the browse/search is here. It’s really not fancy, but I’ve never had much of an interest in doing this kind of front-end work, but GPT-4 makes it pretty easy. Btw, I just realize that I think I forgot to reindex database before submitting, so it doesn’t seem to be able to find that conversation except by browsing to it.

Introduction

In this blog post, I’ll walk you through how I built a simple search and browse interface for a set of HTML files using Flask, Whoosh, and the invaluable help of GPT-4, the AI assistant from OpenAI. Together, we tackled several challenges and created a functional and visually appealing solution.

Setting up the Flask application

With GPT-4’s guidance, I started by setting up a basic Flask application, which served as the foundation for our search and browse interface.

Indexing HTML files with Whoosh

GPT-4 helped me use the Whoosh library to create an index of the HTML files. I specified a base directory for the HTML files and grouped them accordingly. Whoosh took care of indexing the files and provided a powerful search capability.

Creating search and browse routes

GPT-4 and I implemented search functionality by creating a search route in the Flask app. We also developed a browse route that allowed users to navigate the file hierarchy.

Displaying search results and browsing files

We created templates to display search results, as well as to browse and view the HTML files. We used the Jinja2 template engine and incorporated the Bootstrap framework to style our interface.

Adding search history and popular searches

To enhance the user experience, we implemented a search history feature that showed the last 10 search queries performed. We also added a popular searches section that persisted across user sessions.

Throughout the process, GPT-4 provided me with valuable advice, helping me debug issues and implement new features. With GPT-4’s assistance, I was able to create a functional search and browse interface for HTML files with ease. Thanks to GPT-4, I now have a practical solution that I can build upon and customize further!

algebraic.mle R package

Wed, 29 Mar 2023 00:00:00 +0000

The R package algebraic.mle provides an algebra over Maximum Likelihood Estimators (MLEs). These estimators possess many desirable, well-defined statistical properties which the package helps you manipulate and utilize.

femtograd R package

Wed, 29 Mar 2023 00:00:00 +0000

The R package femtograd provides a way of doing automatic differentation. It’s not particularly fast, as it was just an experiment, but I may end up using it in my likelihood.model or algebraic.mle packages.

I will probably optimize it and provide a C++ implementation using Rcpp, but I don’t want to spend too much time on it since there are already nice packages that do this.

Model Selection in Weibull Series Systems

Wed, 29 Mar 2023 00:00:00 +0000

The repo on Github contains the prelimniary work. It follows from the Future Work section in the paper, which is here. In particular, it explores the homogenous shape parameter Weiibull series system, which is a simplification of the model and results in a system with a lifetime that is itself Weibull distributed.

Shoot me an email at lex@metafunctor.com if you’re interested in collaborating on any projects.

Reliability Estimation in Series Systems

Wed, 29 Mar 2023 00:00:00 +0000

The repo on Github contains the code and data for my master’s project in mathematics and statistics at SIUe. The project is titled “Reliability Estimation in Series Systems: Maximum Likelihood Techniques for Right-Censored and Masked Failure Data”.

The paper citation page is here.

You can view the PDF here. The abstract is as follows:

This paper investigates maximum likelihood techniques to estimate component reliability from masked failure data in series systems. A likelihood model accounts for right-censoring and candidate sets indicative of masked failure causes. Extensive simulation studies assess the accuracy and precision of maximum likelihood estimates under varying sample size, masking probability, and right-censoring time for components with Weibull lifetimes. The studies specifically examine the accuracy and precision of estimates, along with the coverage probability and width of BCa confidence intervals. Despite significant masking and censoring, the maximum likelihood estimator demonstrates good overall performance. The bootstrap yields correctly specified confidence intervals even for small sample sizes. Together, the modeling framework and simulation studies provide rigorous validation of statistical learning from masked reliability data.

It has a companion R package, wei.series.md.c1.c2.c3, which is a narrow library I developed in support of my master’s project at SIUe in mathematics and statistics. It provides a set of functions for fitting Weibull series systems with masked failure data. See the GitHub repo for more.

I’m working on several spinoff papers. See the Future Work section in the paper for more details.

Shoot me an email at lex@metafunctor.com if you’re interested in collaborating on any projects.

wei.series.md.c1.c2.c3

Wed, 29 Mar 2023 00:00:00 +0000

The R package wei.series.md.c1.c2.c3 is a narrow library I developed in support of my master’s project at SIUe in mathematics and statistics. I made a post about the project here. It provides a set of functions for fitting Weibull series systems with masked failure data. See the GitHub repo for more details.

algebraic.dist R package

Sun, 19 Mar 2023 00:00:00 +0000

The R package algebraic.dist provides an algebra over distributions. It’s not fully-formed yet, but I plan on using it for a lot of my future work. For instance, I’ll move a lot of the code in algebraic.mle and likelihood.model to this package.

After that, I want to experiment with using the algebraic.dist to do the following:

Compose distributions such that operations over distributions generate other known distributions.

There are a lot of well-known compositions, such as the exponential distribution being the minimum of independent exponential distributions, or the sum of independent normal distributions being a normal distribution, but there is a very large space of possible compositions that are not as well-known or well-studied that I want to explore.
Let people use an R expression to lazily compose functions of distributions. Simplifying a distribution expression will generate a most simple R expression that represents the same distribution.

Sometimes, this may result in a simple close-form distribution, like a multivariate normal distribution, but in other cases it may result in a (hopefully simpler) expression that composes multiple distributions and operations over them.

With these R expressions that represent distributions, we can define more operations, like taking the limiting distribution of a sequence of distributions, say $\lim n \to \infty \frac{1}{n} \sum_{i=1}^n X_i$, which is of normal by the central limit theorem.
Deduce various properties of these distributions, such as their moments, variances, etc. Sometimes, this may require numerical integration or Monte Carlo methods, but if the expression simplifies to a known distribution, then we can use the known properties of that distribution.

I have a lot of this code in place in C++, but I want to re-implement it in R so that it’s more accessible to others. I may also implement some of the more interesting compositions in C++ and expose them to R via Rcpp, but I’m not sure yet. I use a lot of templates and metaprogramming in C++, and I’m not sure how well that will translate to Rcpp.

R Package: Dynamic failure rate (DFR) distributions

Wed, 11 Jan 2023 00:00:00 +0000

Note: This is quite unfinished. I plan on fleshing this out when I revisit my master’s project work on estimating latent failure rate distributions. I plan on generalizing the result to allow for arbitrary failure rates and to use the MLE approach to estimate the parameters of the failure rate, since the MLE approach is more general and can be used to estimate the parameters given more complicated data generating processes.

The R packge dfr_dist provides an API for specifying and estimating dynamic failure rate distributions. They can depend on the data in any way, as the failure rate is any function of time and any set of predictors, as long as the failure rate satsifies two key properties:

It’s non-negative. It is not meaningful to have a negative failure rate; the failure rate can decrease some times, and even go to $0$, though.
It’s cumulative hazard has a limit of infinity, $\lim_{t \to \infty} H(t, x_1, \ldots, x_p) = \infty$. If this isn’t satisfied, then the survival function is not well-defined.

This object satisfies all of the requirements of an algebraic.dist and a likelihoood model. It is also designed to work well with the algebraic.mle package, which provides a framework for performing maximum likelihood estimation (MLE) and retrieving various statistical properties of the MLEs.

A Boolean Algebra Over Trapdoors

Fri, 17 Jun 2022 00:00:00 +0000

The PDF version of this post is available on GitHub.

Consider the Boolean algebra

$$ A := (\mathcal{P}(X^*), \land, \lor, \neg, \emptyset, X^*) $$

where $\mathcal{P}$ is the powerset, $X$ is the alphabet, and $X^*$ is the free semigroup on $X$ which is closed under concatenation,

$$ \# : X^* \mapsto X^* \mapsto X^*. $$

For example, if $$ X = \{a,b\} $$ then $$ X^* = \{\epsilon, a, b, aa, ab, ba, bb, aaa, aab, \ldots \} $$ and $\mathcal{P}(X^*)$ is the power set of $X^*$, $$ \mathcal{P}(X^*) = {\emptyset, \epsilon, \{a\}, \{b\}, \{aa\}, \{a,aa\}, \{a,bb\}, \ldots }. $$ Consider the Boolean algebra $$ B := (\{0,1\}^n, \&, |, ~, 0^n, 1^n) $$ and suppose we have a homomorphism $$ F : A \mapsto B $$ defined in the following way. First, we have a cryptographic hash function $$ \rm{hash} : X^* \mapsto {0,1}^n $$ that a priori uniformly distributes over $\{0,1\}^n$, i.e., each $X^*$ maps to any element in the $\{0,1\}^n$ with probability $2^{-n}$.

Then, homomorphism $F$ maps strings in $X^*$ to bit strings in $\{0,1\}^n$ by applying the hash function to the input concatenated with a secret $s$, $$ F a := \rm{hash}(a s). $$

Note #1: Later, we generalize this to mapping each $a$ in $X^*$ to multiple elements in $\{0,1\}^n$ proportional to $1/P[a]$.

Observe that $F$ is one-way, i.e., there is no homomorphism $G$ such that $$ F G B = A. $$

The morphism F defined as \begin{align*} X^* &:= hash(a \# s)\\ \rm{and} &:= \&\\ \rm{or} &:= |\\ \rm{complement} &:= \sim\\ \{\} &:= 0^n\\ X^* &:= 1^n. \end{align*}

Since multiple elements in $X^*$ map to the same element in $\{0,1\}^n$, it is a homomorphism rather than an isomorphism.

What is the probability that two unique elements in $X^*$ map to the same element in $\{0,1\}^n$? That is to say, what is the probability of collision? Since $F$ uniformly distributes over $\{0,1\}^n$, it is just $$ Pr\{\text{$x$ and $y$ collide}\} = 2^(-n). $$ By the law of probability, therefore, the probability that they do not collide is just $$ Pr\{\text{$x$ and $y$ do not collide}\} = 1 - 2^(-n). $$ Next, we define relations on sets. Set membership relation has a characteristic function

 in : X -> 2^X -> bool

which we define as

 F in a b := a \& b == a.

The subset relation has a predicate

 subset : 2^X -> 2^X -> bool

which we define as

 F subset a b := a \& b == a,

just as with the characteristic function, although they have different probabilistic features.

If $X = \{a,b,c\}$, then $2^X = {{},{a},{b},{c},{a,b},{a,c},{b,c},{a,b,c}}$.

A Boolean index over $X$ is a Boolean algebra over $2^X$ with ${}=0$ and $X=1$ with the normal set operations. This is what a lot of prior work was over.

Note that a type that models power_set<trapdoor<X>> is one in which given a value A of this type, each element a in A is a trapdoor<X> can be independently observed. This makes it possible to operate on A as a normal set, with the exception that the mapping the trapdoors to values may not be obvious (although given a history, or a set of sets, frequency analysis or correlation analysis may reveal quite a bit).

template <typename X, size_t N>
struct trapdoor_boolean_algebra
{
using value_type = X;
trapdoor_boolean_algebra() :
value_hash(0),
key_hash(0)
{
// makes the empty set
}
trapdoor_boolean_algebra(trapdoor_boolean_algebra const &) = default;
array<char, N> value_hash;
array<char, 4> key_hash;
};

template <typename X, size_t N>
auto make_empty_trapdoor_set()
{
return trapdoor_boolean_algebra<X,N>();
}

/**
The disjoint union operation is a partial function that is only defined
when the argument sets are disjoint (it is a dependent type). If they are
not disjoint, the operation has undefined behavior.
*/
template <typename X, size_t N>
auto operator+(
trapdoor_boolean_algebra<X,N> const & x,
trapdoor_boolean_algebra<X,N> const & y)
{
if (x.key_hash != y.key_hash)
throw invalid_argument("secret key mismatch");
return trapdoor_boolean_algebra<X>(
x.value_hash | y.value_hash,
x.key_hash);
}

template <typename X, size_t N>
auto operator!(
trapdoor_boolean_algebra<X,N> const & x)
{
return trapdoor_boolean_algebra<X>(
~x.value_hash,
x.key_hash);
}

template <typename X, size_t N>
auto operator*(
trapdoor_boolean_algebra<X,N> const & x,
trapdoor_boolean_algebra<X,N> const & y)
{
if (x.key_hash != y.key_hash)
throw invalid_argument("secret key mismatch");
return trapdoor_boolean_algebra<X>(
x.value_hash & y.value_hash,
x.key_hash);
}

template <typename X, typename Y, size_t N>
auto disjoint_union(
trapdoor_boolean_algebra<X,N> const & x,
trapdoor_boolean_algebra<Y,N> const & y)
{
if (x.key_hash != y.key_hash)
throw invalid_argument("secret key mismatch");
return trapdoor_boolean_algebra<variant<X,Y>>(
x.value_hash | y.value_hash,
x.key_hash);
}

// the bernoulli<bool> stores the log-probability of the value being incorrect
template <typename X, size_t N>
bernoulli<bool> empty(trapdoor_boolean_algebra<X,N> const & xs)
{
auto b = std::all_of(xs.begin(),xs.end(),[](char x) { return x == 0; });
return bernoulli<bool>{b,0.5};
}
template <typename X, size_t N>
bernoulli<bool> contains(
trapdoor<X,N> const & x,
trapdoor_boolean_algebra<X,N> const & xs)
{
if (x.key_hash != xs.key_hash)
throw std::invalid_argment("secret key mismatch");
auto b = std::all_of(xs.begin(),xs.end(),[](char x) { return x == 0; });
return bernoulli<bool, 1>{b, .5};
}
template <typename X>
approximate_bool operator<=(
trapdoor_boolean_algebra<X> const & x,
trapdoor_boolean_algebra<X> const & y)
{
auto b = std::all_of(xs.begin(),xs.end(),[](char x) { return x == 0; });
return approximate_bool{b, .5};
}
template <typename X>
approximate_bool operator==(
trapdoor_boolean_algebra<X> const & x
trapdoor_boolean_algebra<X> const & y)
{
auto b = std::all_of(xs.begin(),xs.end(),[](char x) { return x == 0; });
return approximate_bool{b, .5};
}
template <typename X, size_t N>
auto hash(trapdoor_boolean_algebra<X,N> const & x)
{
return x.value_hash ^ x.key_hash ^ hash(typeid(X))
}

R package: md.tools

Fri, 13 May 2022 00:00:00 +0000

The R package md.tools is a miscellaneous set of tools for working with masked data and common features of masked data. If doing simulation studies, it also supports latent data, i.e., data that is not observed but is generated, and various other features that may be useful for simulation studies.

The tool set takes inspiration from functional programming, with inputs and outputs defined over masked data frames of type tbl_md (or just data frames), making it consistent with the tidyverse way of doing things.

On moral responsibility

Tue, 02 Nov 2021 00:00:00 +0000

Introduction

People throughout history have believed that they belong to a special categorical class called persons. One of the special features of this class is moral agency, which may be based on criteria such as being able to make choices, form intentions, and perform actions.¹ However, if the nature of reality is at odds with such criteria, our moral agency could be reduced to a pleasant fiction.

Moral agents have the capacity to make moral judgments and take actions that comport with morality. While challenges to the proposition that persons are moral agents have come from many different disciplines, few have posed a greater challenge to it than metaphysics. Metaphysics is a discipline in philosophy with a very long history. The etymology of metaphysics derives from the Greek words met'a (after) and physik'a (physics), and it simply refers to Aristotle’s work that contiguously followed after his chapter on physics. However, over time the meaning of met'a has been reinterpreted to mean something more substantial, like beyond. Thus, metaphysics is the study of that which is beyond physics. And arguably it is the most fundamental of all disciplines for it studies the categorical structure of reality itself.

There is some disagreement among meta-physicians about this categorical structure. Disagreements are not just over things like the kinds of categories that ought to be recognized, but they are also over things like whether reality even has a categorical structure. That is, do categories represent absolute ontological differences or do they represent the useful ways in which humans organize their experiences? Does reality have an inherent categorical structure (which may someday be discovered) or do minds invent conventional categories to order their experiences?

Moral agency To say that we are moral agents is to say that we can be held morally responsible for our actions, which can run the gamut from good to bad. For example, people who are around a village during a volcanic eruption may demonstrate good moral conduct if they rescue villagers from the destructive power of the volcanic eruption. Alternatively, they may demonstrate bad moral conduct if they do nothing to help the villagers. The volcano eruption is causally responsible for the destruction to the village, but it is not morally responsible; the volcano has no moral agency.

Under what circumstances are things morally responsible? A theory of moral responsibility might touch upon, at the very least, the following subjects:

The subject of morality itself.
The criteria for being a moral agent.
The circumstances under which moral responsibility can be ascribed.

In due course, after expanding upon each of these points, one might arrive at a satisfactory explanation that accounts for moral responsibility. However, the discipline of metaphysics has not only undermined the notion that persons are morally responsible, but also the notion that reality can even support the possibility of moral responsibility.

Morality

“Morality is simply the attitude we adopt towards people whom we personally dislike.” - Oscar Wilde

The etymology of morality derives from the Latin word moralitas (manner, character, proper behavior), and it can be used a) descriptively to refer to a code of conduct promoted by some group, or b) normatively to refer to a code of conduct that, given specified conditions, would be promoted by any moral agent. In what follows, our primary concern will be of the second normative type where the nature of morality can be explored in greater isolation from cultural norms.

What is the basis for morality? From a behavioral point of view, morality is arguably the study of right and wrong conduct. However, what is the basis of right and wrong? One possibility is that its basis is authoritative prescription, in which case morality is a dictum of might makes right– something is morally good or bad because an authority says so with no further explanation required. For instance, a religious person may believe its basis is the authoritative prescription of a deity (divine command theory). A forceful criticism of this type of moral theory is revealed by asking: is an action good because, say, God commands it, or does God command it because it is good? If it is good because God commands it, then morality is the arbitrary preference of God, but if God commands it because it is good, then morality has a basis independent of God.

Many would contend the basis is essentially rational; persons use reason and experience to inform their conduct, e.g., how can a group best realize common interests? In which case, a sense of right and wrong is, for the most part, the habituation of conduct that promotes cultural dominance or health, i.e., the principle of natural selection applied in the context of a social Darwinism.\footnote{Imagine a society that did not inculcate respect for the property among its citizens, e.g., stealing was not taught as wrong.}

While there are perhaps many other moral theories – some of which may not support the premise of moral responsibility – each must fall within one of two general categories: the conventional, and the intrinsic. On the one hand, if morality is conventional, then persons are only morally responsible by convention and moral agents are not ontologically significant. On the other hand, if morality is categorically intrinsic, then arguably moral agents have an inherent duty to conform to it. But what does it mean to say that morality is categorically intrinsic?

Realism

There are many things (particulars) that are red. For instance, there are red cars, red apples, and red signs. But what makes them red? The philosophical discipline of realism submits that all red particulars are red in virtue of the existence of a Universal– a single abstract entity that is a part of all red particulars. More precisely, Universals are repeatable entities that can be exemplified simultaneously by different particulars, where entities are things like properties, like red, relations, like behind, and kinds, like dog.

Realists: Someone who subscribes to the doctrine of Universals.

Realists argue that attribute agreement\footnote{How an entity, like an apple, can have attributes, like the color red} is due to relationships between entities and Universals. Furthermore, they contend that the subject-predicate\nomenclature{subject-predicate}{A complete sentence contains two parts: a subject and a predicate. The subject is what the sentence is about, while the predicate tells something about the subject.} grammar of sentences reinforces this view, e.g., in the sentence “My apple is red,” the predicate “is red” seems to be picking out something (a noun)– call it redness– and linking it to the subject, “my apple.” Let us examine how a realist might account for following two sentences:

My apple is red.
My friend’s apple is red.

In each sentence, there are arguably two Universals being exemplified. First, there is the Universal property, redness, that is being exemplified by the two particulars, my apple and my friend’s apple. Second, there is the Universal kind, apple-ness, that is also being exemplified by my apple and my friend’s apple. The redness being exemplified by the entity, my apple, is numerically identical² with the redness being exemplified by the entity, my friend’s apple. Likewise, the apple-ness that is being exemplified by my apple is numerically identical with the appleness that is being exemplified by my friend’s apple.

In the sentence, “This apple is red,” a metaphysical realist maintains that this denotes “This apple exemplifies redness,” where redness is a Universal that inheres in this apple. More generally, a realist claims that any subject-predicate sentence of the form “$a$ is $F$” can be paraphrased as “$a$ exemplifies $F$-ness” where $a$ is an entity\footnote{An entity can be a particular or a repeatable} and $F$-ness is a Universal. In the same way, a realist could maintain that morality, like color, is a Universal\nomenclature{Universal}{A type, a property, or a relation that can simutaneously inhere in more than one thing.}. In the same way that an apple can exemplify a color like redness, a person can exemplify a virtue (a moral excellence) like honesty, e.g., “Bob exemplifies honesty.” However, it turns out that the existence of Universals is one of the most important disputes in the discipline of metaphysics.

Paradox If “$a$ is $F$” denotes “$a$ exemplifies $F$-ness,” then we can construct a paradox\nomenclature{paradox}.{A statement that contradicts itself.} Let $F$ mean “does not exemplify itself” and so $F$-ness could be the Universal property, non-selfexemplification.\footnote{Examples: redness is non-selfexemplify, noncorporeal-ness is self-exemplifying.} Thus, for some $a$, the proposition “$a$ exemplifies non-selfexemplification,” should be either \True or \False. But the existence of this property leads immediately to a paradox, for if a thing does exemplify itself, then it does not exemplify itself and if a thing does not exemplify itself, then it does exemplify itself. It is self-contradictory.

On the one hand suppose that an entity, $a$, does exemplify itself. Then, since it is a property an entity exemplifies when it does not exemplify itself, $a$ does not exemplify itself. On the other hand, suppose that $a$ does not exemplify itself. Then, it does exemplify itself. It is similar to the sentence, “This sentence is a lie.” If it is \True, then it is \False; and if it is \False, then it is \True. It is self-contradictory.

To avoid this paradox, restrictions on what constitutes a Universal must be imposed, such as denying that non-selfexemplification is an actual Universal. However, as a result of this, the realist cannot take it for granted that “$a$ is $F$” means that $F$ must be selecting a Universal, $F$-ness. This is not necessarily a problem for morality, but the realist cannot simply claim, without qualification, that if a sentence can take on a subject-predicate form that necessarily means that the existence of Universals can be assumed to account for things like attribute-agreement. For instance, “Socrates is honest” does not necessarily mean that the adjective, honest, is pick out out a Universal entity, honesty.

A vicious regress? There is perhaps a deeper problem with the realist’s account. This problem can be illustrated by pointing out that in the statement, “a exemplifies $F$-ness,” the word exemplifies itself picks out a Universal– a repeatable entity that serves a relational\footnote{For example, next to in “Socrates is next to Plato” is also a Universal relation} role. After all, consider that two entities can enter into a numerically identical asymmetric relationship, exemplification. To explore this further, let us consider the following two sentences:

Socrates exemplifies courage.
Plato exemplifies honesty.

In each sentence above, there is the Universal entity, exemplification, that is being exemplified by the particulars, “Socrates exemplifies courage” and “Plato exemplifies honesty.” The exemplification exemplified by “Socrates exemplifies courage” is numerically identical with the exemplification exemplified by “Plato exemplifies honesty.”

$$ \operatorname{F}(n) = \text{a set of propositions ${p_1, p_2, \ldots, p_n}$.} $$ where $p_1$ = $a$ exemplifies $!1$ $F$-ness and $p_i = { p{i-1} }$ exemplifies $!_i$ exemplification. More generally, if exemplifies picks out a Universal, then $F(n)$ denotes a set of $n$ propositions for any sentence of the form “$a$ is $F$.”

Example:

$F(3)$ generates three propositions, the third of which is ${$ $a$ exemplifies$_1$ $F$-ness. $}$ exemplifies$_2$ exemplification. $}$ exemplifies$_3$ exemplification.

Moreover, $F(\infty)$ denotes an infinite regress of never-ending propositions and the worry is that this regress is vicious. Whether this is the case is disputable, but one presumes there is a right answer and, if so, Realism’s veracity hinges on this issue.

A vicious infinite regress: An infinite regress is vicious if each proposition $p_i$ in the series requires the support of proposition $p_{n+1}$. If this is the case, then the infinite series of regressions required to provide such support can never be finished\footnote{Unless it provably converges to a final proposition.}, e.g., $p_1$ requires $p_2$, but $p_2$ requires $p_3$, ad infinitum. Consequently, the realist’s solution to problems like attribute agreement ("$a$ is $F$" $\rightarrow$ “a exemplifies $F$-ness” where $F$-ness $\in$ set of Universal entities) would not be valid.

Nominalism

There is competing philosophical doctrine to realism, called nominalism, which disclaims the existence of Universals. Nominalists believe that realism needlessly\footnote{Occam’s Razor\nomenclature{Occam’s Razor}{Entities should not be multiplied needlessly; the simplest of two competing theories is to be preferred.}} introduces the strange notion of Universals, exhibits incoherency, and is burdened by a busy ontology\footnote{For instance, “$a$ exemplifies $F$-ness” embodies an infinite number of propositions}. In its place they propose a supposedly simpler but sufficient account for the apparent categorical structure of reality.

Nominalists submit that only actual particulars\footnote{Particulars like this book, that apple, and Bob’s car} have independent existence. After all, how could a repeatable entity, like redness, simultaneously inhere in multiple entities, like an apple and a firetruck, such that the redness in the apple is numerically identical to the redness in the firetruck? The Realist is introducing a strange idea that entities with independent existence have non-local, unbounded physical presence. So, realists claim that the sentence, “a is F” denotes “a exemplifies F-ness” where F-ness is a Universal. In response, nominalists counter that “a is F” denotes “a is a member of the set of F things.” For instance, “Socrates is courageous” denotes “Socrates is a member of the set of courageous things.” To try to avoid the strangeness of Universals, Nominalists introduce sets in their place.

Set: An unordered collection of entities considered as a whole. The identity conditions for sets is such that, if $a$ and $b$ are sets, they are numerically identical iff for each entity in $a$ that entity is in b. Using mathematical language, $a \equiv b$.

Abstract reference For simple subject-predicate sentences, the nominalist’s account as already given would seem sufficient. However, how might they interpret a sentence including abstract parts, like “Honesty is a virtue?” The realist would contend that this can be understood to mean, “Honesty exemplifies being a virtue,” honesty and virtue picking out abstract Universals. But what might nominalists say? One possible interpretation might be, “The set of honest acts is a subset of the set of virtued acts.” While this gives nominalists a little trouble, their response might seem sufficient enough. However, their problems do not end here.

Set equivalency How might a nominalist deal with the supposition that all the things that have hearts also have kidneys and vice verse? In other words, the set of things that have hearts and set of things that have kidneys have the same members. But according to the identity conditions for sets, that means that the set of things that have hearts is identical to the set of things that have kidneys. Consequently, having a heart is the same thing as having a kidney (the set of things that have hearts$\equiv$ the set of things that have kidneys), which is plainly absurd.

What is the nominalist’s response to this? Generally, they invoke the terminology of possible entities. Instead of “$a$ is $F$” denoting “$a$ is a member of the set of $F$-things,” it denotes “$a$ is a member of the set of possible $F$-things.” The set of possible things that have hearts and set of possible things that have kidneys do not have the same members, therefore they are not equivalent. Unfortunately, part of the justification for nominalism was that it has a simpler ontology than realism, but the introduction of possible worlds somewhat undermines this effort.

Conventionalism Even if nominalism did offer a satisfactory account of attribute agreement and subject-predicate discourse, it is none-the-less nominative. Accordingly, there is no entity, honesty, there are only sets of resembling things. The nominalist posits that the sentence, \textquotedblleft My friend is honest\textquotedblright{} simply means that my friend resembles other persons that are called honest. This does not appear to give ontological significance to honesty. Thus, we come back to where we started: if right and wrong is conventional, then is being a moral agent ontologically significant?

Criteria for moral agency

“I know only that what is moral is what you feel good after and what is immoral is what you feel bad after.” - Ernest Hemmingway

What is the criteria for being a moral agent?\footnote{And thus be subject to ascriptions of moral responsibility.} By definition, a moral agent is a being who can make moral judgments. Thus, at a minimum, it would seem a moral agent should have the capacity to make rational decisions– that is, reason about the effects their decisions have on reality. So, how do persons measure up to this criterion? At the very least, they seem to have the capacity to reason about a limited domain of reality, and so are perhaps morally responsible within the limits circumscribed by that domain. But putting all of that aside, one can instead ask, “Do moral agents exist at all?” That is, are there subjects making decisions?

Persistence

To consider this question, let us analyze Chisholm’s essay, Identity through Time. In this essay, he distinguishes two ways in which entities persist.

Intactly persisting entities In the first way, entities persist in a numerically identical sense such that an entity, $a$, at time $t_1$ is the same as entity, $b$, at time $t_2$ iff the set of parts for entity $a$ at time $t_1$ is equivalent to the set of parts for entity $b$ at time $t_2$. Therefore, if $a \equiv b$, said entity persisted, at a minimum, $|t_2-t_1|$ units of time.

Non-intactly persisting entities If an entity changes even a single part\footnote{Even a single atom!}, it is not strictly the same. Therefore, since physical entities are in a constant state of flux, when one says physical entity $a$ at time $t_1$ is the same as physical entity $b$ at time $t_2$ (where $t_2 \neq t_1$), we do not mean that entities $a$ and $b$ are numerically identical. Rather, we mean to say that they are conventionally the same.

To clarify, let us consider the thought experiment of the Ship of Theseus, which raises the question of whether a ship which has had all its parts replaced retains the same identity. If the ship at t$_{\text{1}$} has different parts than the ship at t$_{\text{2}$} then they are not numerically identical; so, if the ships are identical, a non-intactly persisting sense of identity is being used. Chisholm contends that the question is not is it the same ship but does it constitute the same ship? And if it constitutes the same ship, that means it satisfies the Theory of Persistence.

Theory of persistence : A composite entity persists if conditions $1$ and $2$ are satisfied:

The precedent composite entity evolved from a antecedent composite entity. Note: Evolved denotes that after each successive change, the antecedent composite must have at least one part in common with the precedent composite.
The composite entity must satisfy additional quantifiable criteria, e.g., the criteria that it must have the same sailing schedule (in the case of a ship’s persistence).

In practice, how might this theory provide a solution for the Ship of Theseus dilemma? First, we must determine whether it satisfies, from initial state to final state, the evolution criterion. And second, it must satisfy additional criteria, e.g., has the same sailing schedule. For example, if the evolution of the ship undergoes fission\footnote{That is, if it evolves into more than one ship} then the evolution criterion by itself cannot determine which ship constitutes the Ship of Theseus. However, if only one of the ships after the fission has the same sailing schedule as the ship before the fission, then according to the has the same sailing schedule convention, that ship constitutes the Ship of Theseus.

Persons

It is not even clear how persons should be defined, but providing an unambiguous definition for it would help to resolve many controversial. For instance, to resolve the dispute over abortion, one could categorically determine when a bundle of cells constitutes person-hood.\footnote{And therefore qualifies for legal consideration as a person} There appears to be a consensus that motile sperm cells and ova are not persons, but that newborn babies are. What was the critical turning point in which it from from just being a bundle of cells to being a person who has a bundle of cells? Expending considerable time and effort could be avoided if a definitive categorical account of this problem was given, but like with the pile of sand, such a definition seems problematically conventional.

Chisholm on persons In light of the Theory of Persistence, what can be said about persons? As with the Ship of Theseus, are person’s identities also conventional? Or are they like sub-atomic particles which intactly persist? In Which Physical Thing Am I, Chisholm he contends that persons are not conventional. In support of this, he asks you the reader to consider a hypothetical situation in which you are about to undergo an operation. The doctor presents you with two options: you can either choose the expensive option where you are subjected to total anesthesia during the operation, or you can choose the cheaper option. The cheaper option consists of the doctor giving you a pill before the operation to induce complete amnesia so that during the operation there is no memory of you, and giving you a pill after the operation so that everything that transpired during the operation is forgotten.

The question is, would it be reasonable for you to choose the cheap option? Chisholm contends that it would not be reasonable because you would believe that you would be the person enduring the operation. Furthermore, you would not be persuaded that this is not the case by any convention that denies that the person on the table is you. In other words, you would not consider your identity a matter of convention.

Critical response to Chisholm How might Chisholm’s arguments be disputed? In the situation he presented, perhaps you would believe that you and the person enduring the operation in the cheaper option is the same person. Furthermore, perhaps no convention could persuade you from this conviction. However, your conviction could simply be based on a convention that you deeply subscribe to. At this point, it does not seem like Chisholm has established that persons warrant special categorical distinction.

Bundle Theory and Ego Theory In the essay Divided Minds and the Nature of Persons, Parfit contrasts two competing views of persons, Bundle Theory and Ego Theory. Bundle Theory proposes that unity of consciousness cannot be explained by referring to persons; Bundle Theorists maintain that persons only exist as a feature of language. Persons are a convention– they are a label for a bundle of things like thoughts, memories, and sensations that are unified by causal kinds of relations. There are no subjects in which these bundles inhere. But the Ego Theorist proposes that these subjects really do exist– as subjects of experiences– and that they have a bundle of things like thoughts, memories, and sensations.

In support of Bundle Theory, Parfit presents the reader with a hypothetical situation in which a replica of you is created, atom by atom.\footnote{Like the transporter in Star Trek– beam me up Scotty!} At one end of the spectrum, if one-hundred percent of your atoms are replaced, this is clearly a replica of you. At the other end, if only one percent of your atoms are replaced, this is clearly you. Thus, in between these two extremes there must be a critical turning point where it goes from being you to being a replica of you. However, this appears implausible because any point chosen would seem arbitrary. How could a single atom make a difference? It is similar to the question, When does sand become a pile of sand? It is absurd to think that, at one point it is not a pile, and with the addition of a single grain of sand, it becomes a pile.\footnote{Although the definition for “a pile” could be based on structure rather than quantity, e.g., “A pile is a collection of objects laid on top of each other.”}

Ego Theorists have difficulty responding to this question because they believe that persons are subjects of experiences, not just bundles of things. As such, there must be some special unknown thing that makes you who you are that may not be included in the replication. Bundle Theorists have little problem answering these questions, however, because they deny that there is a you– only a bundle of things. If an exact copy of the bundle of things that constitutes you is made, what we choose to call this thing, you or a replica of you, is a matter of convention.

On split-brain studies Parfit believes that empirical results of split-brain case studies reinforce Bundle Theory. In this study, patients are shown a placard whose left half is blue and right half is red where each eye is allowed to see only one of the halves. Curiously, the indication is that patients end up expressing two separate streams of consciousness, one in which there is an awareness of seeing only blue and another in which there is an awareness of seeing only red. That seems to indicate that within each of these patients there are not one but two subjects of experience.

Problematically, Ego Theorists define a person as a subject of experience yet patients seem to be expressing two. Consequently, the Ego Theorist must now propose that subjects of experiences need not be persons at all, which seems like a difficult stretch to make. But in Bundle Theory, since persons are just bundles of things it does not seem odd that in this bundle there could be a pair of mental states, one in which there is an awareness of seeing only red, and another in which there is an awareness of seeing only blue. There is no subject in which awareness inheres.

Ascribing moral responsibility

Responsibility: A detachable burden easily shifted to the shoulders of God, Fate, Fortune, Luck or ones neighbor. In the days of astrology it was customary to unload it upon a star. – Ambrose Bierce, The Devil’s Dictionary, 1911

Under what circumstances can a moral agent’s conduct be ascribed moral responsibility? At a minimum, it would seem to require that its conduct be a product of being able to select, in a way that is unconstrained by causal agencies, a specific action among a set of possible actions– that is, a product of exercising a free will. For instance, an agent’s decision to go shopping could be a product of a free will if it was possible for the agent to have decided not to go shopping. Conversely, if an agent stumbles off a building and free falls towards the ground, the act of free falling is certainly not a product of free will. Under the circumstances, the agent had no choice but too fall.

Four-dimensionalism

It is often said that entities\footnote{Such as persons!} have location and extension in a three-dimensional space. That is, physical entities can be specified in a three-dimensional Cartesian coordinate system (see \Cef{fig:3dcart}), $x$, $y$, and $z$, which provides three spatial dimensions– respectively, length, height, and width, e.g., a cubic entity extends $m$ units along the $x$, $y$, and $z$ axes.

\begin{figure} \caption{Three-dimensional Cartesian coordinate system} \label{fig:3dcart} \includegraphics[scale=0.85,bb = 0 0 200 100, draft, type=eps]{3d_cartesian_coordinate_system.png} \end{figure}

However, our coordinate system is missing a crucial detail: at what time, $t$, will a particular entity be at a particular $x$, $y$, and $z$, and how long will it endure? Therefore, a physical entity’s location and extension is more precisely specified in a four-dimensional coordinate system, $x$, $y$, z,and t, which provides three spatial dimensions\footnote{Length is $x$, height is $y$, and width is $z$} and one temporal dimension\footnote{Duration is $t$}, e.g., a hyper-cubic entity extends $m$ units along the $x$, $y$, $z$, and $t$ axes, or a person may extend an average\footnote{Since a person’s dimensions (height, etc.) can change over time, a life-time average is being used} of $\approx 0.6$ meters along the $x$-axis, an average of $\approx 1.8$ meters along the $y$-axis, an average of $\approx 0.3$ meters along the $z$-axis, and $\approx 85.6$ years along the $t$-axis.

Four-dimensional space-time has taken on a new-found importance with the advent of modern science. Under this model, when persons observe an entity at a particular time, they observe a single time-slice of it. More precisely, they observe a three-dimensional spatial “slice” (which is perpendicular to the temporal dimension) of a four-dimensional space-time entity. For example, when they observe a two-dimensional spatial slice of a sphere, they observe a disk; when they observe a three-dimensional spatial slice of a hyper-sphere\footnote{A hypersphere is a four-dimensional sphere– the set of points a distance $r$ away from a center point in four-dimensions}, they observe a sphere.

At any time $t$, persons can only observe a three-dimensional spatial “slice” located at time $t$ of a four-dimensional space-time entity\footnote{Actually, persons only see a two-dimensional plane but can infer a third, depth, by using other clues like shading}. However, if they had access to a fifth-dimension, they could possibly rotate in such a way in five-dimensions to instantly observe the entirety\footnote{The entire time line of entities would be viewable as snap-shots} of four-dimensional entities. From this ``God’s eye’’ vantage point, the passage of time would represent a vector through a four-dimensional space-time\footnote{Perhaps some entities experience a different passage of time which could be represented by a different four-dimensional vector}.

Foreknowledge and choice Is this model of reality incompatible with moral agency? If free agents are immutable four-dimensional space-time entities, then would that not denote that their freedom is illusionary? After all, they are unchanging– there does not appear to be an opportunity for choice, there is only the opportunity for existence. But perhaps this notion of timelessness\footnote{The passage of time could possibly be an illusion} is too challenging to our intuitions, so let us instead ask the question, does infallible foreknowledge\footnote{Such as knowledge acquired by seeing a person’s entire as a static higher-dimensional structure} remove the possibility for choice? In other words, given that (1) a free agent, given two options a and b, can freely choose between a and b, and (2) there is infallible foreknowledge pertaining to which option, a or b, the free agent will choose, are (1) and (2) compatible?

On the one hand, it seems infallible foreknowledge forces the free agent’s hand: the free agent must act in accordance with the foreknowledge. It has no choice; it cannot do otherwise. On the other hand, infallible knowledge about the future is arguably no more a limitation on choice than infallible knowledge about the past as long as that foreknowledge is not based on causal determination\footnote{Causal determination: the doctrine that everything which happens is caused by antecedent conditions}. And indeed, a four-dimensional space-time is not necessarily causally determined.

Causal Determinism

Supposing that reality is not an immutable four-dimensional space-time, let us turn our attention to the question of causal determination. Causal determinism is the doctrine that all events are the inevitable result of antecedent conditions, including the actions of persons. The totality of existence is a game of cosmic billiards where, given an initial state as input, the final state could be output by a computable function.

Is causal determinism compatible with moral agency? In causal determinism, there is only one possible future; acts of apparent choice is the mechanical expression of a supposed moral agent’s heredity and past environment. Since its future is determined, that would denote that moral responsibility is not possible, for at the very least a moral agent must have the power to make choices between morally good and bad actions.

An argument from absurdity: the coherency of choice For the sake of argument, assume choice is not an illusion. How then might we offer a rational account of it? But if something is rational, does that not signify that it can be explicated in terms of its antecedent conditions? How else can a physical process be rationally explained if not by cause-and-effect? But if choice can be understood in terms of its antecedent conditions, then it is not free– it was determined by cause-and-effect. This is problematic for those who wish to maintain that choice is a coherent concept.

Concluding thoughts

While metaphysical inquiry has not removed the possibility that people belong to a special categorical class called persons, it has served to undermine the criteria for moral agency, such as being able to make choices. But what would be entailed by the realization that moral agency is not a legitimate concept? Would society collapse in on itself or would we go about our daily lives contented by the fact that, at least it feels like we are free moral agents? That, at the very least, it is a useful fiction?

And more generally, consciousness. ↩︎
$a$ and $b$ are numerically identical iff $a$ and $b$ refer to the same entity ($a \equiv b$). For example, if Jim’s friend is Bob, then Jim’s friend is numerically identical to Bob (Jim’s friend $\equiv$ Bob). ↩︎

Time series analysis of a confidentiality measure for Encrypted Search

Tue, 02 Nov 2021 00:00:00 +0000

Introduction
Encrypted search model
Threat model: known-plaintext attack
Confidentiality measure
- Forecasting model
Data description
Time series analysis of $\{\pi_t\}$
Future work: dynamic regression on co-variates
Conclusion

Introduction

In cloud computing, it is tempting to store confidential data on (untrusted) cloud storage providers. However, a system administrator may be able to compromise the confidentiality of the data, threatening to prevent further adoption of cloud computing and electronic information retrieval in general.

The primary challenge is a trade-off problem between confidentiality and usability of the data stored on untrusted systems. Encrypted Search attempts to resolve this trade-off problem.

Encrypted Search allows authorized search agents to investigate presence of specific search terms in a confidential target data set, such as a database of encrypted documents, while the contents, especially the meaning of the target data set and search terms, are hidden from any unauthorized personnel, including the system administrators of a cloud server.

Essentially, Encrypted Search enables oblivous search. For instance, a user may search a confidential database stored on an untrusted remote system without other parties being able to determine the information need of the user searched (and on more sophisticated systems, they are also unable to determine which documents were relevant to the information need).

We denote any untrusted party that has full access to the untrusted remote system (where the confidential data is stored) the adversary.¹

Despite the potential of Encrypted Search, perfect confidentiality is not theoretically possible. There are many ways confidentiality may be compromised. In this paper, we consider an adversary whose primary objective is to comprehend the confidential information needs of the search agents by analyzing their history of encrypted queries.

A simple measure of confidentiality is given by the proportion of queries the adversary is able to comprehend. We consider an adversary that employs a known-plaintext attack. However, since the confidentiality is a function of the history of queries, different histories will result in different levels of confidentiality over time.

We apply time series analysis to estimate the forecast distribution of the confidentiality measure. The forecast distribution provides the framework to estimate important security-related questions such as “what will our mean confidentiality six months from now be?”

We are interested in reasonably medium-term forecasts so that we can plan accordingly for the future, e.g., determining how frequently passwords should be reset to try to maintain a base level of confidentiality. Resetting them too frequently poses an independent set of problems, both from a security and usability standpoint, but failing to reset them when the risk of being compromised is too high defeats the central purpose of Encrypted search.

Encrypted search model

An information retrieval process begins when a search agent submits a query to an information system, where a query represents an information need. In response, the information system returns a set of relevant objects, such as documents, that satisfy the information need.

An Encrypted Search system may support many different kinds of queries, but we the query model is a sequence-of-words.

The adversary is given by the following definition.

Definition 1 The adversary is an untrusted agent that is able to observe the sequence of queries submitted by authorized search agent.

The objective of the Encrypted Search system is to prevent the adversary from being able to comprehend the sequence of queries.

A hidden query represents a confidential information need of an authorized search agent that is suppose to be incomprehensible to the adversary.

The primary means by which Encrypted Search is enabled is by the use of cryptographic trapdoors as given by the following definition.

Definition 2 Search agents map plaintext search keys to some cryptographic hash, denoted trapdoors.

A trapdoor for a plaintext search key is necessary to allow an untrusted Encrypted Search system to look for the key in a corresponding confidential data set.

The Encrypted Search system uses a simple substitution cipher in which each search key is mapped to a unique trapdoor signature. The simple substitution cipher is denoted by \[\operatorname{h} \colon \mathbb{X} \mapsto \mathbb{Y}\,,\] where $\mathbb{X}$ is the set of plaintext search keys and $\mathbb{Y}$ is the set of trapdoors.

Since $\operatorname{h}$ is one-to-one, it is possible to undo the substitution cipher by some function denoted by \[\operatorname{g} \colon \mathbb{Y} \mapsto \mathbb{X}\] such that \[x = \operatorname{g}(\operatorname{h}(x))\] for every $x \in \mathbb{X}$.

In a time series, we have one entitty and $T$ measurements of it over time. A random time series is a sequence of random variables \[\{Y_1,Y_2,\ldots,Y_T\},\] typically denoted by $\{Y_t\}$ where $t$ is the time index, which can continuous or discrete.

The measurements are $d$ dimensional and may be continuous, discrete, or some mixture. Frequently $d=1$, which we denote a univariate time series, and the measurements are continuous.

We use upper-case to denote random variables and lower-case to denote realizations, thus $Y_t$ is a random value and $y_t$ is the realization of $Y_t$.

Thus, a realization of the time series $\{Y_t\}$ is given by denoted by $\{y_t\}$.

The time series of plaintext keyword searches submitted by the search agents is denoted by $\{x_t\}$. It is a $d=1$ dimensional time series with a discrete time index and a discrete response.

The adversary may may directly observe $\{x_t\}$. Instead, he observes a time series of ciphers.

The cipher $\{c_t\}$ is a discrete time and discrete response time series defined as \[c_t = \operatorname{h}(x_t).\]

Since the time series of plaintext is a priori non-deterministic, we model it as a random time series $\{X_t\}$ such that \[\Pr(X_j = x_j | X_1 = x_1,\ldots,X_{j-1} = x_{j-1}).\] That is to say, our plaintext language model does not incorporate other kinds of information, such as who the search agent is or what time of day it is. In section Future work, we consider extensiosn of the model.

Since $\{c_t\}$ is a function of $\{x_t\}$, we may model the ciphers as a random time series $\{ C_t \}$ where $C_j = \operatorname{h}(X_j)$.

Threat model: known-plaintext attack

The primary source of information is given by the observable time series of ciphers $\{c_t\}$, which is induced by the unobserved time series of plaintext $\{x_t\}$.

Other potential sources of information, such as side-channel information, is not included in the model we consider in this paper. See section Future work for some preliminary thoughts on this expanded topic.

In our threat model, the adversary is interested in estimating $\{x_t\}$. However, the adversary is only able to observe $\{c_t\}$. Thus, the adversary’s objective is to infer the plaintext from the ciphers using frequency analysis attacks, in particular a known-plaintext attack.

In a known-plaintext attack, the objective of the adversary is to learn how to undo the substitution cipher $\operatorname{h}$ with $\operatorname{g}$.

The inverse substitition cipher $\operatorname{g}$ is not known to the adversary.

A maximum likelihood estimator of $\operatorname{g}$ is given by \[\begin{align*} \hat{\operatorname{g}} = \operatorname{arg\,max}_{\operatorname{g} \in G} \Pr(X_1 &= \operatorname{g}(c_1)) \\ \prod_{t=2}^{T} \Pr(X_t = &\operatorname{g}(c_t) | X_{t-1} = \operatorname{g}(c_{t-1}), \ldots, X_1 = \operatorname{g}(c_1)) \end{align*}\] where $G$ is the set of all possible mapping functions from ciphers $\mathbb{Y}$ to plaintexts $\mathbb{X}$.

If two plaintexts $x,x' \in \mathbb{X}, x \neq x'$, may be exchanged without changing the probability of $\{x_t\}$, then they are indistinguishable and $\hat{\operatorname{g}}$ is inconsistent. However, the adversary does not need to be perfect for the confidentiality measure to be compromised. If some of the plaintexts are inexchangeable, then the adversary may learn something about $\{x_t\}$ by observing $\{c_t\}$.

The greater the uniformity of $\{X_t\}$ the greater the variance of $\hat{\operatorname{g}}$. At the limit of maximum uniformity, where every pair of plaintext is exchangeable, the adversary can learn nothing about $\{x_t\}$ by observing $\{c_t\}$. Natural languages have a high degree of non-uniformity and so the primary concern of the adversary is the divergence between the true distribution and the known-plaintext distribution.

The adversary knows some approximation of $\{X_t\}$.

The known-plaintext distribution may be used to solve an approximation of the MLE $\hat{\operatorname{g}}$.

In a known-plaintext attack, the adversary substitutes the unknown true distribution with the known-plaintext distribution and solves the MLE under this substituted distribution.

Confidentiality measure

We are interested in measuring the degree of confidentiality as given by the following definition.

Definition 3 Given a time series $\{c_{t'}\}$, the confidentiality measure is a time series $\{\pi_t\}$ defined as the fraction of ciphers in $\{c_{t'}\}$ that the adversary successfully maps to plaintext where $t' = N t$. That is, \[\label{eq:accuracy} \pi_t = \frac{\delta_t}{N t}\,,\] where \[\delta_t = \sum_{t'=1}^{N t} [\operatorname{g}(c_{t'}) = \hat{\operatorname{g}}(c_{t'})]\,.\]

Note that $N$ denotes the fact that we take one measurement of the confidentiality every time a multiple of $N$ ciphers are observed.

The measure $\pi_t$ can be understood as the marginal probability that the adversary is able to decode an incoming cipher to plaintext at around time $t$. However, far more revealingly, the adversary may go back through the history of ciphers and decode proportion $\pi_t$ to plaintext.

If we specify that $\pi^*$ is the minimum confidentiality measure we wish to maintain, then it is essential that we stop generating $\{c_t\}$ at or before time $T^*$ where \[T^* = \operatorname{arg\,min}_{T} \pi_T > \pi^*.\] That is, we stop generating $\{c_{t'}\}$ before the amount of information in it is sufficient for the adversary to decode more than proportion $\pi^*$ of the data. We do not need to stop Encrypted search queries at time $T^*$, we only need to change the cipher, i.e., substitute the mapping function $\operatorname{h}$ that maps plaintexts to ciphers with some other mapping function, which is typically done by requiring users to change passwords periodically. This is where forecasting $\{\pi_t\}$ plays a central role.

Forecasting model

As a function of a random time series $\{C_{t'}\}$, we may model $\pi_t$ as being generated by the random time series $\{\Pi_t\}$. If $\pi_{t}$ is not known, i.e., $\Pi_{t}$ has not been observed, then $\Pi_{t}$ is a probability distribution on the measure at time $t$. If $\pi_{1},\pi_{2},\ldots,\pi_{T}$ is given, then $\Pi_{T+h|T}$ is a conditional distribution² known as the $h$-step forecast distribution at time $T$ whose expectation is denoted by $\pi_{T+h|T}$.

Our primary interest is in forecasting an observed time series $\{\pi_t\}$, e.g., if we observe $\{\pi_1,\pi_2,\ldots,\pi_T\}$, we wish to estimate the mean of the $h$-step forecast $\pi_{T+h|T}$. Since $\pi_{T+h|T}$ is not known, we seek an estimator $\hat{\pi}_{T+h|T}$.

Data description

The accuracy $\{\pi_t\}$ of the adversary is the single entity we are observing and we have $T$ measurements of it over logical time.

The confidentiality data $\{\pi_t\}$ depends upon two other time series, the plaintext (keyword searches) $\{x_t\}$ and the ciphers $\{c_t\}$, which Alex Towell generated in 2016 using the following steps:

The parameters of a Bigram language model were estimated from a large corpus of plaintext. (The source of the particular corpus used has been lost.)
The estimated Bigram language model was conditionally sampled from to generate plaintexts $\{x_t\}$.
Each plaintext $x_t$ was cryptographically hashed to a cipher $c_t = \operatorname{h}(x)$ to generate ciphers $\{c_t\}$.

Note that $\{x_t\}$ and $\{c_t\}$ are not the primary time series of interest in our analysis. Rather, our primary interest is in the confidentiality measures $\{\pi_t\}$. To generate this time series, the following steps were taken:

The function $\operatorname{g}$ that maps ciphers to plaintext is estimated after every $N=50$ observations of the cipher time series using a MLE under a unigram language model (some information in the bigram model is not being used by the estimator, which reduces its efficiency) on a different corpus judged to be similiar to the one used to generate $\{x_t\}$. Thus, the unigram MLE of $\operatorname{g}$ at time $T$ is given by \[\hat{\operatorname{g}}_T = \operatorname{arg\,max}_{\operatorname{g} \in G} \prod_{t=1}^{T} \hat{\Pr}(X_t = \operatorname{g}(c_t)).\]

Note that $\hat{\operatorname{g}}_T$ is inconsistent since it does not converge in probability to $\operatorname{g}$ as a consequence of the adversary’s estimation of $\Pr(X_t)$ with $\hat{\Pr}(X_t)$.

This inconsistency was motivated out of a desire to be more realistic, since an adversary who is performing the known-plaintext attack cannot in practice know the underlying distribution of $\{x_t\}$ used to generate the keyword searches.
The confidentiality measure at time $t$, denoted by $\pi_t$, is computed using $\hat{\operatorname{g}}_{t}$.

Time series analysis of $\{\pi_t\}$

It seems clear that the adversary’s accuracy at a particular time will be correlated with lagged (previous) values of its accuracy and the closer in time they are the more heavily correlated they will generally be (barring exceptions like seasonality).

We partition the data into a training set and a test set. We will not look at the test set until later when we evaluate the model. Here is a quick glimpse of the training set data:

## [1] 0.358159 0.351208 0.347271 0.346403 0.352666 0.350445

Visualization and stationary transformations

If the time series data can be transformed to meet the stationary conditions, then the ARIMA model for the (correlated) residuals is generally a reasonable choice. A stationary time series is given by:

The mean is not a function of time.
The variance is constant.
The autocorrelation is a function of lag rather than time.

A plot of the training partition of $\{\pi_t\}$ is shown in figure 1.

A non-stationary time series plot.

It appears non-stationary. A plot of the sample ACF and PACF are shown in figure 2.

Highly correlated sample ACF and PACF.

We see that the ACF indicates that $\{\pi_t\}$ has significant autocorrelation. While this is clearly non-stationary, the variance seems constant and thus a transformation to make the variance more uniform, such as a log-transformation, seems unnecessary.

Since there is not necessarily an obvious pattern in the data, we will avoid the use of procedures like fitting a regression model (for detrending) and instead try some order of differencing. Differencing is a non-parametric approach that can often transform a non-stationary time series into a stationary one, where the $d$-th difference of $\{\pi_t\}$ is denoted by $\nabla^d\{\pi_t\}$. Moreover, since it is non-parametric, differencing has the added benefit of being able to dynamically respond to changes in the data, unlike with regression which treats the trend as deterministic.

Time plots of $\nabla \{\pi_t\}$.

In figure 3, we plot the differenced process $\nabla\{\pi_t\}$. We see that the trend has been removed, the values are centered around zero, and the variance is constant. We believe this may be stationary. We perform the augmented Dickey-Fuller test[@noauthor_dickeyfuller_2021] as a more objective measure.

## Augmented Dickey-Fuller Test
## Dickey-Fuller = -20.37, Lag order = 15, p-value = 0.01
## alternative hypothesis: stationary

The $p$-value of the Dickey-Fuller hypothesis test is less than $0.01$, which we consider to be very strong evidence against the null hypothesis of non-stationary data. Bolstered by this test, we proceed with trying to find an ARIMA model for the residuals.

ARIMA model selection

There are perhaps three primary reasons we would want to infer a general model for $\{\pi_t\}$: prescription, description, and in our case, prediction.

Guided by the principle of parsimony, we have a bias for simpler models, i.e., Occam’s razor. As justification for this bias, consider the following. Assume there is some unknown process $M$ that generated data $\{\pi_t\}$. If we have parametric model $M'$ with many degrees of freedom (dimension of parameter space), we may find parameters for it that fit it to $\{\pi_t\}$ with a very small sum of squared residuals.

However, if $M'$ is unnecessarily complex, it is unlikely to generalize very well, i.e., on new data $M$ and $M'$ may diverge significantly. In this case, we say that $M'$ is overfitted to the observed data $\{\pi_t\}$. Of course, if a simpler model cannot even sufficiently model the observed data $\{\pi_t\}$, it is hard to justify as an approximation of $M$. Thus, we have a variance-bias trade-off[@bias_variance].

Since ARIMA models are parameterized by $p$, $d$, and $q$, which respectively specify the order of the autoregression component, the order of the difference, and the order of the moving average component, we have a bias for ARIMA models with relatively small $p$, $d$, and $q$. Note that there are many heuristics that model this bias, such as the Akaike information criterion (AIC), but we will decide upon a subset of candidate models that leans heavily on a more hands-on analysis.

A plot of sample ACF and PACF of the differenced time series $\nabla \{\pi_t\}$ is shown in figure 4.

Sample ACF and PACF of $\nabla \{\pi_t\}$, which appear far more stationary.

Since the ACF cuts off after lag $1$ and the PACF decays exponentially, we speculate that $\nabla\{\pi_t\}$ may be an MA(1) process. We use the EACF plot to try to help determine other possible orders of the ARIMA model:

## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o o o o o o x x x o o o o
## 1 x x o o o o x o o o o o o o
## 2 x x x o o o x o o o o o o o
## 3 x x x x o o o o o o o o o o
## 4 x x x x x o o o o o o o o o
## 5 x x x x x x o o o o o o o o
## 6 x x x x x o x o o o o o o o
## 7 x x x x x x x o o o o o o o

Ignoring the set small set of $x$’s above the main diagonal of zeros, we see that $\{\nabla \pi_t\}$ seems to be compatible with $\operatorname{ARMA}(0,1)$,$\operatorname{ARMA}(0,2)$, and $\operatorname{ARMA}(1,2)$. We will analyze these three.

Model construction and evaluation

A good model will yield residuals $\{e_t\}$ that look like zero mean white noise:

They are are uncorrelated. If there are correlations, the residuals contain information that may be used to estimate a better model.
They have zero mean. If they have a non-zero mean, then the model (and forecasts) are biased.
They have constant variance.

IMA(1,1)

This is the simplest model of the three, and the simplest ARIMA model that seemed compatible with the data. When we fit the $\operatorname{ARIMA}(0,1,1)$ model to the time series data, we get the following result.

## ma1
## -0.5705635

To assess whether the model results in uncorrelated residuals, we inspect figure 5.

Plots to conduct stationary assessments of $\operatorname{ARMA}(1,1)$.

The histogram looks symmetric about $0$ but the Q-Q suggests a lack of normality in the residuals. More telling, plots of the sample ACF and PACF of the residuals are evidence of high correlation. We reject this model.

IMA(1,2)

When we fit $\operatorname{ARIMA}(0,1,2)$ to the time series data, we get the following results.

## ma1 ma2
## -0.55254743 -0.04124893

To assess whether the model results in uncorrelated residuals, we inspect the plots in figure 6.

Plots to conduct stationary assessments for $\operatorname{ARMA}(0,1,2)$.

The histogram looks symmetric about $0$ but, again, the Q-Q suggests a lack of normality in the residuals. Note that this is not strictly required. More telling, plots of the sample ACF and PACF of the residuals are evidence of high correlation. We reject this model.

ARIMA(1,1,2)

When we fit the $\operatorname{ARIMA}(1,1,2)$ model to the time series data, we get the following results.

## ar1 ma1 ma2
## 0.8839643 -1.4423574 0.4689439

To assess whether the model results in uncorrelated residuals, we inspect figure 7.

Plots to conduct stationary assessments of $\operatorname{ARMA}(1,1,2)$.

The histogram looks symmetric about $0$ but, again, the Q-Q suggests a lack of normality in the residuals. However, the sample ACF and PACF seem reasonable. We perform the Ljung-Box hypothesis (lag = 10) test on the residuals of the model for a more objective measure of white noise.

## Box-Ljung test
## data: model.3$residuals
## X-squared = 10.333, df = 7, p-value = 0.1705

The null hypothesis is that the residuals for the model are white noise. The test reports a $p$-value of $0.171$, which we consider to be strong evidence in support of the white noise hypothesis.

We choose this model. Since only one model seemed like a reasonable fit, measures like AIC were not needed.

What follows ia a summary of the chosen model.

## ARIMA(1,1,2)
##
## Coefficients:
## ar1 ma1 ma2
## 0.8840 -1.4424 0.4689
## s.e. 0.0276 0.0338 0.0266
##
## sigma^2 estimated as 1.088e-05: log likelihood=17177.64
## AIC=-34347.27 AICc=-34347.26 BIC=-34322.1

We present it in the more familiar form given by \[(1 - 0.884 \operatorname{B}) \nabla Y_t = (1 + 1.442 \operatorname{B}- 0.469 \operatorname{B}^2) e_t\] or, equivalently, \[(1 - 0.884 \operatorname{B}) \nabla Y_t = (1 - 0.273 \operatorname{B}) (1 + 1.715 \operatorname{B}) e_t\] where $\{e_t\}$ is given by zero mean white noise, \[e_t \sim \operatorname{WN}(\mu=0,\sigma=0.0033).\]

Forecasting

One of the primary goals of this time series analysis is forecasting, or predicting, the future accuracy of the adversary, i.e., $\hat{\pi}_{T+h|T}$. At $T=5000$, we perform a forecast up to $h=1000$ steps ahead, $\hat{\pi}_{T+h|T}$. See figure 8.

Forecasting the future with the training set and test sets superimposed.

The held out test data remains within the $80\%$ prediction interval for most of the time. All things considered, this seems like a reasonable forecast. However, due to what we believe may be a model misspecification, we think the prediction intervals are too wide and we have reason to believe that, in the long run, $\{\pi_t\}$ will decrease in variance and hover around some asymptotic limit. We explore this more in the next section.

Incorporating a priori information

We have a priori knowledge that we wish to incorporate into the model. Assuming the plaintext distribution $\{x_t\}$ is static and the algorithm that converts plaintext to ciphers is fixed, we observe the following:

The accuracy of the adversary, $\pi_t$, is a measure between $0$ and $1$.
Under ideal conditions, acquiring more knowledge by observing a larger sample is not expected to harm the adversary’s accuracy $\pi_t$, in which case the expectation $\pi_t$ would be a monotonically increasing function that has an asymptotic limit $c \leq 1$.

Of course, at different points in time the adversary’s accuracy may change due to, say, the presence of significant unaccounted covariates.

An ideal model for these axioms may be something like the Gompertz model or even a scaled, relocated, and shifted cumulative distribution function (cdf). However, for the sake of model simplicity, we assume a logarithmic form, which allows us to use linear regression. It does not have an asymptotic limit, but we hypothesize that it is a reasonable approximation for most finite time-horizons of interest.

Thus, we suppose the time series $\{\pi_t\}$ has the functional form \[\pi_t = \beta_0 + \beta_1 \log t.\]

Instead of i.i.d. “errors” (deviations from the mean), we have reason to believe the errors are correlated. We choose to model these errors in the $\operatorname{ARIMA}$ family, such that $\{\Pi_t\}$ is a random process of the form \[\Pi_t = \beta_0 + \beta_1 \log t + \eta_t\] where \[\eta_t \sim \operatorname{ARIMA}(p,d,q).\]

Actually, this is not quite true, since according to [@rob_arimax], “The presence of lagged values \[...\] means that $\beta_1$ can only be interpreted conditional on the value of previous values of the response variable, which is hardly intuitive.”

No matter, we press on and fit the model to the data, which yields the sought after ARIMA regression errors for $\{\eta_t\}$ and estimates for the parameters.

## Regression with ARIMA(1,1,2) errors
##
## Coefficients:
## ar1 ma1 ma2 xreg
## 0.8521 -1.4012 0.4355 0.0292
## s.e. 0.0234 0.0277 0.0211 0.0202
##
## sigma^2 estimated as 6.828e-06: log likelihood=45279.87
## AIC=-90549.73 AICc=-90549.73 BIC=-90513.68

We have used the same model as before, except with the dynamic regression on the logarithm of time $t$. It turns out that, for the dynamic regression, $\operatorname{ARIMA}(2,1,2)$ does better and enjoys tighter prediction intervals as well, but it was not significantly better.

To assess whether the model results in uncorrelated residuals, we insepect figure 9.

Plots to conduct stationary assessments

## Box-Ljung test
## data: reg_model$residuals
## X-squared = 10.279, df = 6, p-value = 0.1134

At the $5\%$ significance level, this model barely passes. We see that \[\hat{\Pi}_t = 0.029 \log t + \eta_t\] where \[\eta_t \sim \operatorname{ARIMA}(5,1,1)\] with the above specified estimated coefficients and \[e_t \sim \operatorname{WN}(\mu=0,\sigma=0.0026).\]

We show a time series plot of the model with both the training set (in black) and the test set (in green) superimposed onto it in figure 10.

Forecast distribution of more appropriate theoretical model of the time series.

We have forecasted much further into the future. The forecast seems reasonable, as it follows the subtle positive non-linear trend.

When we compare it to the previous ARIMA model, we see that the subtle positive trend is not captured by the model. We can address this potential shortcoming by forcing the ARIMA model to include the drift term. When we do this, we get the following results.

## ar1 ma1 ma2 drift
## 8.472755e-01 -1.396547e+00 4.324823e-01 4.695287e-06

We see that the drift term is a positive value near $0$, but over a sufficiently long period of time it adds up, as demonstrated by the figure 11

Forecast dsitribution of the ARIMA model with drifting.

The $\operatorname{ARIMA}$ model with drift adds a linear element to the autoregression, which theoretically is not appropriate.

Future work: dynamic regression on co-variates

In our time series analysis, the forecasting model only used lagged values of the confidentiality measure to forecast future values and we made no attempt to discover any other co-variates. Therefore, it extrapolated trends but ignored any other information such as side-channel information that may help or hinder the adversary’s efforts to decode the ciphers.

At a time $t'$, the adversary may learn something about the system other than observing the time series of ciphers, $\{C_t\}$. This information may be incorporated into the time series model through an autoregression that has predictor variables other than just lagged components of the measure on the adversary’s accuracy, $\{\pi_t\}$. The estimated paramters of the dynamic autoregressive model may also be used to explain the effect such predictor variables have on confidentiality.

A potentially interesting model is given by the data \[(t, \pi_t, I_t)\] where $t$ denotes time index, $\pi_t$ denotes the adversary’s accuracy at time $t$, and $I_t$ denotes the information measure of the $t$-th observation, defined as \[I_t = \log_2 \frac{1}{\Pr(\operatorname{g}(c_{t'}))}.\]

Observe that lagged components of $I_t$ may be used to make the regression a function of entropy $H_t$ as well.

When the entropy is reduced or an informative observation comes in, this may have a larger impact on the time series $\{\pi_t\}$ and ideally we would incorporate this effect into the model.

The information gain does not necessarily need to be related to any of the time series previously mentioned, either. For instance, suppose the adversary, through side-channel information, acquires the knowledge that a certain cipher $c'$ maps to some smaller subset $\mathbb{W} \subset \mathbb{X}$. This also may be modeled as an information gain or entropy reduction, since the distribution of ciphers $\{c_t\}$ has less entropy given this information.

Conclusion

The statician George Box once wrote, “All models are wrong, some are useful.” If we include drifting in the ARIMA model, it eventually predicts impossible futures. More to the point, it is not a good match for the theoretical model, as its bias is a function of time $t$.

The logarithmic model performs better in this regard, as it takes an inordinately long time (1,531,520,000,000,000 steps) to reach impossible values, although the prediction interval obtains it much more quickly. In addition, it more closely matches the features of the theoretical underlying distribution.

That said, there is still a lot to be said of the $\operatorname{ARIMA}()1,1,2)$ model, given its simplicity. The adversary takes a very long time before it starts to seem like the simple model may be negatively biased.

Recall that if we specify that $\pi^* = 0.44$ is the minimum confidentiality we wish to maintain, a reasonable policy may be to use the latest observation to forecast the future to estimate where $\pi_t = \pi^*$, which is given by \[\hat{T}^* = \operatorname{arg\,min}_{T} \pi_{T\,|\,11000} > \pi^*.\]

$\pi^*$ vs $T^*$.

Inspecting figure 12, we see that $\hat{T}^* \approx 20000$. That is, to maintain $\pi^* > 0.44$, a password reset should occur before $T^* \approx 20000$.

Interestingly, the prediction intervals are far less forgiving and if we used those as a pessimistic estimator of $T^*$, we would nearly instantaneouls need to do a password reset.

To be a useful measure, it would seem that the uncertainty should be lower. Possibly, as we discuss in section Future work, we could incorporate other covariates that help reduce the uncertainty. Or, perhaps we need to impose a more realistic dynamic trend. After all, the logarithm is not a particularly good fit for the data either, it is simply potentially better than the other alternatives.

A system administrator being a typical example.↩︎
$\Pi_{T+h}$ given $\Pi_1 = \pi_1,\ldots,\Pi_T = \pi_T$.↩︎

SIUe - STAT 581 - Problem Set 8

Sat, 30 Oct 2021 08:18:32 -0500

Problem 1
- Part (a)
- Part (b)
- Part (c)
  - Interpretation
- Part (d)
- Part (e)
- Part (f)
- Part (g)
- Part (h)
Problem 2
- Part (a)
- Part (b)
- Part (c)
- Part (d)
  - Interpretation
- Part (e)
Appendix: code

mixed.test = function(A,B,y)
{
av=anova(lm(y~A*B))
F.a = av$`Mean Sq`[1]/av$`Mean Sq`[3]
p.value = pf(F.a,df1=av$Df[1],df2=av$Df[3],lower.tail = FALSE)
table1 = matrix(c(av$`Sum Sq`[1],av$`Sum Sq`[2],av$`Sum Sq`[3],av$`Sum Sq`[4],
av$Df[1],av$Df[2],av$Df[3],av$Df[4],
av$`Mean Sq`[1],av$`Mean Sq`[2],av$`Mean Sq`[3],av$`Mean Sq`[4]),nrow = 4)
dimnames(table1) = list(c("Fixed Effect A","Random Effect B","Interaction AB","Error"),
c("SS","df","MS"))
print(table1)
table2 = matrix(c(F.a,p.value),nrow = 1)
dimnames(table2) = list(c(""),c("F-test for fixed effect","p-value"))
print(table2)
a=nlevels(A)
b=nlevels(B)
n=length(y) / a / b
var.hat = av$`Mean Sq`[4]
var.interaction.hat = (av$`Mean Sq`[3]-av$`Mean Sq`[4])/n
var.block = (av$`Mean Sq`[2]-av$`Mean Sq`[3])/n/a
table3 = matrix(c(var.hat,var.interaction.hat,var.block),nrow=1)
dimnames(table3) = list(c(""),c("error.var","interaction.var","block.var"))
print(table3)
}

Problem 1

An experiment is designed to test for systematic differences in the hardness measurements provided by two devices (fixed effect, factor $A$). Ten specimens are randomly selected (random effect, factor $B$). Each device is used to make $n=3$ hardness measurements on each specimen. The data is available on Blackboard as an Excel File.

Part (a)

Write the model for this mixed effects design, defining the fixed effect parameters, and the random effect parameters.

The mixed-effects model is given by \[ Y_{i j k} = \mu + \tau_i + \beta_j + (\tau\!\beta)_{i j} + \epsilon_{i j k} \; \begin{cases} i = 1,\ldots,a=2\\ j = 1,\ldots,b=10\\ k = 1,\ldots,n=3 \end{cases} \] where the fixed effect parameters are \[ \tau_1,\ldots,\tau_a \qquad \left(\sum \tau_i = 0\right) \] and the random effect parameters are \[ \sigma_\beta^2, \sigma_{\tau\!\beta}^2, \sigma^2. \]

Part (b)

Create an interaction plot to display the device effect on hardness measurement.

First, we load the data:

library("readxl")
data = read_excel("handout8data.xlsx")
A = as.factor(na.omit(data$device))
B = as.factor(na.omit(data$specimen))
y = na.omit(data$hardness)
a = length(levels(A))
b = length(levels(B))
N = length(y)
n = N / (a * b)
df.A = a-1
df.B = b-1
df.AB = (a-1)*(b-1)
c("a"=a,"b"=b,"n"=n,"N"=N,"df.A"=df.A,"df.B"=df.B,"df.AB"=df.AB)

## a b n N df.A df.B df.AB
## 2 10 3 60 1 9 9

The interaction plot is given by:

# remember that we are testing whether operator differences are
# generalizable to a larger population of parts.
interaction.plot(B,A,y)

Part (c)

Use a mixed model likelihood approach to test for a systematic difference in the measurements of the two devices. Compute the $F_0$ statistic, and the $p$-value. Provide an interpretation, stated in the context of the problem.

We compute the statistic with:

library("lme4")

## Loading required package: Matrix

contrasts(A)=contr.sum
mixed.mod = lmer(y ~ A + (1|B) + (1|A:B))

## boundary (singular) fit: see help('isSingular')

anova(mixed.mod)

## Analysis of Variance Table
## npar Sum Sq Mean Sq F value
## A 1 0.41667 0.41667 0.3121

We see that $F_0 = .312$ ($p = .579$).

Interpretation

Based on these observed test statistics, the experiment finds that factor $A$ (device) has no effect on response $y$ (hardness measurement of specimen).

Part (d)

Compute estimates of the fixed effect parameters.

The formula for the estimates of the fixed effect parameters $\{\tau_i\}$ are given by \[ \hat{\tau}_i = \bar{y}_{i \cdot\cdot} - \bar{y}_{\cdot\cdot\cdot} \] for $i=1,\ldots,a=2$. We use the following R code to compute $\hat{\tau}_1$:

coef(summary(mixed.mod))

## Estimate Std. Error t value
## (Intercept) 49.95000000 0.4282105 116.6482481
## A1 0.08333333 0.1491662 0.5586608

Since $\tau_1 + \tau_2 = 0$, this means $\hat{\tau}_1 = .083$ and $\hat{\tau}_2 = -\hat{\tau}_1 = -.083$.

Part (e)

Now write an $F_A$ statistic as a ratio of mean squares.

We use Dr. Neath’s custom function \[ \operatorname{mixed.test} : [\text{fixed factors}] \times [\text{random factors}] \times [\text{responses}] \mapsto [\text{statistics}] \] to compute the $F_A$ statistic:

mixed.test(A,B,y)

## SS df MS
## Fixed Effect A 0.4166667 1 0.4166667
## Random Effect B 99.0166667 9 11.0018519
## Interaction AB 5.4166667 9 0.6018519
## Error 60.0000000 40 1.5000000
## F-test for fixed effect p-value
## 0.6923077 0.4269057
## error.var interaction.var block.var
## 1.5 -0.2993827 1.733333

The $F$-test statistic for the factor $A$ effect is given by \[ F_A = \frac{\operatorname{MS_{A}}}{\operatorname{MS_{A B}}} = .692, \] which has the reference distribution \[ F(\rm{df}_A=a-1,\rm{df}_{A B} = (a-1)(b-1)) = F(1,9). \] Thus, the $p$-value is given by $\Pr\{F(1,9) > F_A\} = .427$.

Part (f)

Write the algebraic formula for the $F_0$ statistic from a block design on the sample means.

\[ F_0 = \frac {b \sum_{i=1}^{a}(\bar{Y}_{i\cdot\cdot} - \bar{Y}_{\cdot\cdot\cdot})^2/(a-1)} {\sum_{i=1}^{a}\sum_{j=1}^{b}( \bar{Y}_{i j\cdot}- \bar{Y_{i \cdot\cdot}}- \bar{Y}_{\cdot j \cdot}+ \bar{Y}_{\cdot\cdot\cdot} )^2 / (a-1)(b-1)}. \]

Part (g)

Show computationally that (e) and (f) lead to the same test statistic.

We must use the means of the response over repeated measurements of the block (specimen) at a given factor level $A$. These values have already been computed for us handout8data.xlsx:

# the variables below represent the same data, only with repeat measurements
# summarized by their sample mean.
d = as.factor(na.omit(data$d))
s = as.factor(na.omit(data$s))
h.means = na.omit(data$h)

Now, we do the calculations:

# below we run a randomized block design with the sample means as the
# response variables.
# note that the test for a fixed factor effect is equivalent to that
# from the mixed.test function.
rcbd.mod = lmer(h.means ~ d + (1|s))
anova(rcbd.mod)

## Analysis of Variance Table
## npar Sum Sq Mean Sq F value
## d 1 0.13945 0.13945 0.6954

We see that this result matches the result in (e), i.e., mix.test(A,B,y) produced the same result.

Part (h)

Use the result from (g) to argue why interaction mean squares is the appropriate error term.

Repeat measurements are summarized by a sample mean. The test statistic for a block design then leads to the interaction mean squares under a mixed model as the error term.

Problem 2

A mixed effects design is used to investigate the effects of operator (fixed effect, factor A) and machine (random effect, factor B) on the breaking strength of a synthetic fiber. There are $a=3$ operators under investigation. A random sample of $b=4$ machines is selected, and each operator produces $n=2$ samples on each of the selected machines. The data is available on Blackboard as an Excel File.

Part (a)

State the expected value for each of the mean squares.

\[\begin{align*} E(\operatorname{MS_{A}}) &= \sigma^2 + n \sigma_{\tau \! \beta}^2 + \frac{b n \sum_{i=1}^{a}\tau_i^2}{a-1},\\ E(\operatorname{MS_{B}}) &= \sigma^2 + n \sigma_{\tau \! \beta}^2 + a n \sigma_{\beta}^2,\\ E(\operatorname{MS_{A B}}) &= \sigma^2 + n \sigma_{\tau \! \beta}^2,\\ E(\operatorname{MS_{E}}) &= \sigma^2. \end{align*}\]

Part (b)

Compute unbiased estimates for the random effect parameters.

A = as.factor(na.omit(data$op))
B = as.factor(na.omit(data$mach))
y = na.omit(data$strength)
mixed.test(A,B,y)

## SS df MS
## Fixed Effect A 160.33333 2 80.166667
## Random Effect B 12.45833 3 4.152778
## Interaction AB 44.66667 6 7.444444
## Error 45.50000 12 3.791667
## F-test for fixed effect p-value
## 10.76866 0.01034401
## error.var interaction.var block.var
## 3.791667 1.826389 -0.5486111

The random effect parameters are given by $\sigma^2$, $\sigma_\beta^2$, and $\sigma_{\tau\beta}^2$. Estimators for these parameters are given by \[\begin{align*} \hat\sigma^2 &= \operatorname{MS_{E}} = 3.792,\\ \hat\sigma_{\beta}^2 &= \frac{\operatorname{MS_{B}}-\operatorname{MS_{A B}}}{a n} = -0.549,\\ \hat\sigma_{\tau\!\beta}^2 &= \frac{\operatorname{MS_{A B}}-\operatorname{MS_{E}}}{n} = 1.826. \end{align*}\]

Part (c)

Use the result from (a) to argue why interaction mean squares is the appropriate error term.

Under the null hypothesis \[ H_0 : \tau_1 = \cdots = \tau_a = 0, \] $E(\operatorname{MS_{A}}) = \sigma^2 + n \sigma_{\tau \! \beta}^2 + \frac{b n \sum_{i=1}^{a}\tau_i^2}{a-1} = \sigma^2 + n \sigma_{\tau\!\beta}^2$. The scaling requires a denominator with the same expected value. Thus, $\operatorname{MS_{A B}}$ is the appropriate error term.

Part (d)

Perform a test for operator effects. Compute the $F_A$ statistic, and the $p$-value. Provide an interpretation, stated in the context of the problem.

From part (b), $F_A = \frac{\operatorname{MS_{A}}}{\operatorname{MS_{A B}}} = 10.769$, which has a $p$-value of $.010$.

Interpretation

The experiment finds that factor $A$ (operator) does have an effect on the fiber strength.

Part (e)

Now, use the idea of random factors as an experimental unit to explain why interaction mean squares is the appropriate error term. In particular, comment on how taking repeat measurements on a selected level of a random factor does not increase the pertinent sample size.

Taking repeat measurements at each randomly selected level may serve to increase the measurement accuracy, but does not increase the pertinent sample size. (However, taking repeat measurements does allow us to learn more about the treatment effect at those selected levels, e.g., a therapy may work well for you but is not effective on average.)

Appendix: code

mixed.test = function(A,B,y)
{
av=anova(lm(y~A*B))
F.a = av$`Mean Sq`[1]/av$`Mean Sq`[3]
p.value = pf(F.a,df1=av$Df[1],df2=av$Df[3],lower.tail = FALSE)
table1 = matrix(c(av$`Sum Sq`[1],av$`Sum Sq`[2],av$`Sum Sq`[3],av$`Sum Sq`[4],
av$Df[1],av$Df[2],av$Df[3],av$Df[4],
av$`Mean Sq`[1],av$`Mean Sq`[2],av$`Mean Sq`[3],av$`Mean Sq`[4]),nrow = 4)
dimnames(table1) = list(c("Fixed Effect A","Random Effect B","Interaction AB","Error"),
c("SS","df","MS"))
print(table1)
table2 = matrix(c(F.a,p.value),nrow = 1)
dimnames(table2) = list(c(""),c("F-test for fixed effect","p-value"))
print(table2)
a=nlevels(A)
b=nlevels(B)
n=length(y) / a / b
var.hat = av$`Mean Sq`[4]
var.interaction.hat = (av$`Mean Sq`[3]-av$`Mean Sq`[4])/n
var.block = (av$`Mean Sq`[2]-av$`Mean Sq`[3])/n/a
table3 = matrix(c(var.hat,var.interaction.hat,var.block),nrow=1)
dimnames(table3) = list(c(""),c("error.var","interaction.var","block.var"))
print(table3)
}

Survey: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures (2009)

Sat, 30 Oct 2021 08:18:32 -0500

Introduction
Accelerated Critical Sections (ACS)
Trade-offs
Comparison: ACS vs SCMP
Conclusion

Introduction

In the paper, “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” the authors, Suleman, Mutlu, Qureshi, and Patt, essentially concern themselves with the problem popularly revealed in Amdahl’s law, \[ \operatorname{speedup}(N) = \frac{1}{1 - P + \frac{P}{N}}, \] where $\operatorname{speedup}$ is the total execution time of a single-threaded version of a program divided by the total execution time an $N$-threaded implementation of the same program.a

Accordingly, the asymptotic speedup of a program as a function of $N$ is determined by the $1-P$ term, i.e., $\operatorname{speedup}(N)$ goes to $1/(1-P)$ as $N$ goes to infinity. Thus, even if 99% of the time the program can be ran in parallel, the maximum possible $\operatorname{speedup}$ is only $1/(1-0.99) = 100$. Thus, to achieve scalable performance with increasing thread counts, the sequential portion of code must be aggressively reduced no matter how small it is.

Accelerated Critical Sections (ACS)

The authors point out that a common occurrence of thread serialization is caused by the presence of critical sections, defined as code blocks guarded by locks to enforce mutual exclusion.

To mitigate the effect of critical section serialization (see Figure 1), they propose Accelerated Critical Sections (ACS). Specifically, using an asymmetric multi-core processor in which one of the cores is four times the size of the other cores, use the large core to quickly execute critical sections on behalf of the smaller cores.

Figure 1: Critical section

Trade-offs

There are several key performance trade-offs with respect to ACS.

Trade-off #1

Faster critical section execution versus fewer threads. That is, the large core takes up an area equivalent to 4 small cores; therefore, fewer cores are available for thread execution. However, there are two forces which conspire to mitigate this disadvantage.

First, as the transistor budget increases, the marginal cost of replacing multiple small cores with a single large core decreases. For example, suppose we have an area budget of 32 small cores. We can use that budget to either make a system with 32 small cores, or alternatively a system with 1 large core and 28 small cores. The system with the large core case has (not counting the large core in this fraction since it can only be used to dispatch critical sections) the number of cores compared to the system with 32 small cores. This is much more favorable than if we only had an area budget of 8 small cores.

Second, as critical section contention grows, increasing thread counts provide diminishing (or even negative) returns. Contention tends to grow in line with increasing core counts therefore trading multiple small cores for a single large core becomes a better deal as the transistor budget increases. In general, ACS may only improve performance if the gain in executing critical sections faster on the large core overcomes the advantage of having more threads which becomes ever more likely with increasing transistor budgets.

Trade-off #2

Cache misses caused by private data as opposed to cache misses caused by shared data. In particular, if private data on a small core is referenced inside the critical section, then that data needs to be transferred from the small core to the large core3. However, critical sections frequently access shared memory, thus if we usually only execute code that accesses shared memory with the large core, that will lead to improved cache utilization (reduced cache misses) when accessing shared memory.In general, if accessing shared data is more common than accessing private data in a critical section–which is not unlikely–then the trade-off tips in favor of ACS.

Trade-off #3

The communications overhead incurred by having the large core execute a critical section on behalf of a small core. In particular, when a thread on a small core encounters a critical section, it issues a request to the large core to execute the specified critical section. When the large core receives this request, it is placed on a queue (Critical Section Request Buffer–CSRB) and executed by the large core when it reaches the top. This communication overhead is avoided if the critical section is simply executed on the local small core. However, this overhead may mitigated by the fact that if multiple cores are trying to read or write a shared lock, then the overhead of having to synchronize on this lock (e.g., propagating the lock’s state to remote caches when it changes) can be significant. If the lock is usually only accessed by the large core then the need to synchronize the state of the lock is reduced (fewer cache misses).

Trade-off #4

If a program has many disjoint critical sections, where each critical section can be accessed by multiple threads simultaneously without contention, then executing them exclusively on the large core will cause them to unnecessarily execute sequentially; this is what the authors call “false serialization.” To mitigate this issue, they propose adding additional circuitry to heuristically estimate whether a critical section is being falsely serialized.

The estimation uses the following quantification: count the number of requests in the CSRB for which the lock address for the enqueued request’s critical section is different from the lock address for the incoming request’s critical section. If the count is greater than one, there are at least two independent critical sections already waiting to be executed in the CSRB. Add this count to a counter for the incoming lock address.

Alternatively, if the count is only one, then decrement the lock address’ counter. If a counter reaches a maximum threshold, the ACS is disabled for that lock address. Thus when a thread happens upon that lock, it will try to acquire it and execute the critical section locally. The hope is the false serialization rate is not so high that it overcomes the benefit of faster critical section execution.

Comparison: ACS vs SCMP

With the above trade-offs in mind, the authors set out to see how their proposed ACS system compares to two other kinds of systems: a symmetric CMP (SCMP), a multi-core processor in which each core is uniform, and an asymmetric CMP (ACMP), a multi-core processor in which one of the cores is larger like in the ACS system except that the ACMP system’s large core is not used for accelerating critical sections.

To compare these systems, they compare their respective execution times on a carefully chosen benchmark suite. To make the comparison fair, they must control for any hardware differences that may lead to one system having an advantage for reasons unrelated to their ACS framework, e.g., use identical cores (with a few exceptions since the ACS requires hardware support), memory configurations, transistor budgets, and so forth.

To control for this, they simulate each system in software rather than real silicon and provide them approximately identical designs and constraints. On each system (ACS, SCMP, and ACMP), they run the same benchmark 4 suite on different area budgets (in units of “small-cores”), e.g., in benchmarks ran on systems with a budget of 8 small cores, the SCMP system consists of 8 small cores and the ACMP and ACS systems consist of 4 small cores and one large core.

They expect that, for reasons related to the previously mentioned trade-offs (e.g., decreasing marginal costs), as the area budget increases they will see a gradual improvement in the ACS systems compared to the SCMP and ACMP systems.

Additionally, the benchmark suite can be divided into two broad categories: programs that use coarse-grained locking, 10 critical sections or less, and programs that use fine-grained locking. The two categories stress different aspects of the previously mentioned trade-offs, and they expect that their ACS system will compare more favorably in coarse-grained locking benchmarks (fine-grained locking reduces critical section contention, so benefits from ACS are reduced).

On the course-grained benchmarks, the ACS system generally compares better than the other two even when the area budget is only 8 small cores; it wins five out seven of the benchmarks.

One of the benchmarks it compares less favorably in is a parallelized implementation of quicksort. Quicksort experiences very little contention on critical sections (and thus benefits from higher thread counts) and also frequently accesses private data (which must be transferred to the ACS’s large core).

However, as the area budget increases ACS compares more favorably. Indeed, on systems with an area budget of 32 small cores, the ACS system significantly improves performance by 42% compared to SCMP and by 31% compared to ACMP. On the fine-grained benchmarks, as the authors expected the ACS system compares less favorably.

On systems with an area budget only 8 small cores, ACS reduces execution time on average by 20% compared to ACMP but increases execution time by a significant 37% compared to SCMP. Critical section contention was so low that SCMP was able to take effective advantage of its larger core count. However, as the area budget scales up, critical section contention should increase and the ACS system gains some lost ground.

Indeed, when the area budget reaches 32, ACS outperforms SCMP by 17% and ACMP by 13%. In addition, on many of the benchmarks performance on ACS scaled more consistently with core quantity than on SCMP and ACMP. For instance, on the SCMP system it was usually necessary to reduce the optimal thread count to something well below the total number of cores to prevent critical section contention from degrading performance, but with the ACS system improvement would often be seen up to the maximum of 32 cores with room for growth.

Conclusion

ACS outperforms both SCMP and ACMP on all representative (coarse-grained and fine-grained) critical-section benchmarks on processors with 16 or more cores, and it even performs well on more modest processors (especially in the coarse-grained benchmarks). In light of this and the reasonable argument that ACS will continue to extend its lead as core counts and critical-section contention rise hand in hand, ACS is a promising area of future development.

Multiprocessor synchronization: tournament-Peterson lock

Sat, 30 Oct 2021 00:00:00 +0000

Multiprocessor synchronization is a notoriously tricky subject matter. Unlike with a single thread of execution, in a shared-resource system, where resources are shared among multiple independent processors, we must think very hard about how the critical sections where such shared resources are accessed.

Definition: Peterson’s algorithm is a concurrent programming algorithm for mutual exclusion that allows two or more processes to share a single-use resource without conflict, using only shared memory for communication.

In this post, we generalize the result and provide a Java-based solution.

Generalization to a power of 2 locks

A way to generalize the two-thread Peterson lock is to arrange a number of 2-thread Peterson locks in a binary tree. Suppose $n$ is a power of two. Each thread is assigned a leaf lock which it shares with one other thread. Each lock treats one thread as thread 0 and the other as thread 1. In the tree-lock’s acquire method, the thread acquires every two-thread Peterson lock from that thread’s leaf to the root. The tree-lock’s release method for the tree-lock unlocks each of the 2-thread Peterson locks that thread has acquired, from the root back to its leaf. At any time, a thread can be delayed for a finite duration. (In other words, threads can take naps, or even vacations, but they do not drop dead.)

Theorem: The tournament-Peterson lock guarantees mutual exclusion.

Proof: $n$ threads require a tree structure with $n-1$ nodes, $n/2$ of which will be leaf nodes. Each node represents a 2-thread Peterson lock, and each leaf node may serve as the first Peterson lock for two particular threads. If thread $A$ acquires the lock for the leaf node, it will be promoted to the next layer in the tree (and if said leaf node had another thread $B$ trying to acquire the lock, it will remain there; the state of this leaf node will at this point be: flag[B] = true, flag[A] = true, and victim = B. Therefore, thread B will not be promoted to the next level until thread A sets flag[A] = false or victim = A. Other threads $C, D, \ldots$ may also be trying to acquire their respective leaf node locks. Regardless, only half of the $n$ threads can acquire their respective leaf node locks (since each Peterson lock ensures mutual exclusion) to be promoted to the next layer. We repeat this process for all $$ \lceil \log_2 n/2\rceil = \lceil log_2 n – 1 \rceil $$ layers, after which point only one thread can remain. Therefore, the root node represents the final lock, for which only two other threads may acquire access to at any given moment. Of these two threads, the thread that acquires this final lock will have effectively acquired the actual lock.

Since each node is mutually exclusive for two threads, and the tree structure only allows 2 threads to reach each node, then each subtree must ensure mutual exclusion to its root node. Therefore, the root of the entire tree must also ensure mutual exclusion. ∎

Theorem: The tournament-Peterson lock guarantees freedom from deadlock.

Proof: When a thread A releases a lock, for each of the nodes in its path from the root to its leaf node, it sets the flag variable for each Peterson lock corresponding to it (A) to false. Thus, for each node that it visited in its path, the other node that “lost” must be able to progress since it’s no longer the case that

 flag[A] == true && victim == ~A

is true, therefore each one must escape the while loop and therefore be promoted to the next level. Therefore, it must be deadlock free. ∎

Theorem: The tournament-Peterson lock guarantees freedom from starvation.

Proof: Starvation freedom guarantees deadlock freedom, but it is not the case that deadlock freedom guarantees starvation freedom. So, in answer (4b) above, while we may have shown that deadlock is not possible, have we shown that starvation is also not possible? No. So, let’s consider this case separately. For something to be starvation free, each thread trying to acquire a lock must eventually acquire the lock. We know that for starvation to happen, a thread must be bypassed forever. Each Peterson lock itself is starvation free, so it cannot happen at the level of a single node. Immediately, then, we see that it simply cannot happen: at each layer in the tree, the thread to arrive earlier (such that it is not the victim) must be promoted. Recursively, then, we see that each subtree is starvation free, therefore the entire tree is starvation free. ∎

Tournament-Peterson Lock

Here is the source code solution.

/**
* @title Tournament-Peterson lock
*
* @author Alex Towell (lex@metafunctor.com)
* @file TournamentPeterson.java
* @since 1.6
* @date 2/8/2011
* @course CS 590-002 Multiprocessor Synchronization
* @desc Implementation of an n-thread lock using a binary tree of 2-thread
* Peterson locks. The number of threads the lock correctly works with
* must be specified upon Tournament lock instantiation.
*
* @require Lock.java, Peterson.java, ThreadID.java
*
* @precondition
* each thread must have a ThreadID in the range of 0 to threads-1,
* where threads is an integer passed to the constructor.
*
* @note I modified the Peterson lock slightly so that its constructor
* accepts a single int parameter for determining the flag-thread
* mapping.
*/
package TournamentPetersonLock;
/**
* Tournament tree lock.
*
* Tournament tree of Peterson locks to provide mutually exclusive access to
* a critical section for n threads.
*
* Each Peterson lock is a 2-thread lock, so the Tournament tree is a binary
* tree of such locks which only permits one thread from each of its left and
* right subtrees to pass to it.
*
* @note the number of threads the lock correctly works with must be
* specified upon construction of the Tournament lock
*
* @precondition each thread using the Tournament tree must have a unique
* thread ID in the range of [0, threads-1]
*
* @see Lock
* @see Peterson
* @see ThreadID
*/
class Tournament implements Lock
{
/**
* Construct a Tournament lock for the specified number of threads.
*
* @param threads the number of threads to configure the lock for
*/
public Tournament(int threads)
{
this.threads = threads;
this.locks = new Peterson[threads]; // note: root will be at index 1,
// so locks array size is locks+1
// instantiate and configure each Peterson lock
createLocks(1, threads/2, threads/2);
}
/**
* Used by current thread to request a lock on a critical section protected
* by this Tournament lock.
*
* @precondition thread does not have the lock
* @postcondition thread acquired the lock
*/
public void lock()
{
// start at the leaf node lock for current thread
int index = getLeafLock();
// root is index 1, so exit the while loop when index 0 (index of
// unitialized node lock) is reached.
while (index != 0)
{
locks[index].lock();
index /= 2;
}
}
/**
* Release a lock on a critical section (releases each of the Peterson
* locks a thread acquired).
*
* @precondition thread has the lock
* @postcondition thread released the lock
*
* @see #unlock(int)
*/
public void unlock()
{
unlock(getLeafLock()); // call unlock(int) on leaf node for thread
}
/**
* Private method helper for unlock().
*
* @param index
* unlock all nodes along the path from the root to the leaf node
* corresponding to current thread
*/
private void unlock(int index)
{
if (index != 0)
{
unlock(index/2);
locks[index].unlock(); // post-order: unlock after recursive call
// to unlock from root down to leaf
}
}
/**
* Private helper function that creates Peterson locks for subtree rooted
* at specified index.
*
* Each Peterson lock is created such that when a thread calls lock() or
* unlock(), it will be assigned the correct flag index into each of the
* Peterson nodal locks of the Tournament tree.
*
* @precondition index > 0
* @postcondition every lock in the subtree rooted at index will be
* configured correctly
*
* @param index
* subtree's rooted index (array-based binary tree) to create locks
* for
* @param lessThan
* if a thread arrives at this lock node, any thread ID < lessThan
* will be assigned to index 0 in the Peterson lock flag array,
* otherwise it will be assigned to index 1
* @param size
* the size of left or right subtree of parent at specified index
*/
private void createLocks(int index, int lessThan, int size)
{
if (index < threads)
{
// instantiate Peterson lock at specified node
locks[index] = new Peterson(lessThan);
// instantiate the left and right subtrees of this node
size /= 2;
createLocks(getLeftChild(index), lessThan - size, size);
createLocks(getRightChild(index), lessThan + size, size);
}
}
/**
* Returns the index of the leaf node for current thread.
*
* This private helper method will map a ThreadID to the proper leaf
* node index in the Tournament tree of Peterson locks.
*/
private int getLeafLock()
{
return (threads + ThreadID.get())/2;
}
/**
* Returns the index of the left child node of the parent node at
* specified index.
*
* @param index index of parent node
* @return int
* index of left child node, note that if return value is >= threads
* then node at index is a leaf
*/
private static int getLeftChild(int index)
{
return 2*index;
}
/**
* Returns the index of the right child node of the parent node at
* specified index.
*
* @param index index of parent node.
* @return int
* index of right child node, note that if return value is >= threads
* then node at index is a leaf
*/
private static int getRightChild(int index)
{
return 2*index+1;
}
/**
* Array of 2-thread Peterson locks.
*
* @see Peterson
*/
private Peterson[] locks;
/**
* The number of threads constructed to work with.
*
* @see #Tournament(int)
*/
private int threads;
}

Review: A Symbolic Representation of Time Series, with Implications for Streaming Algorithms

Sat, 30 Oct 2021 00:00:00 +0000

In [1], the authors present a method for constructing a symbolic (nominal) representation for real-valued time series data. A symbolic representation is desirable because then it becomes possible to use many of the effective algorithms that require symbolic representation, like hashing and Markov models.

The authors claim that one of the most useful time series operations is measuring the similarity between two time series data sets. To do this on the original time series, the Euclidean distance formula can be used. Therefore, for a time series transformation to be useful, distance measures applied to the corresponding transformations should provide some guaranteed lower bound on the true distance. This is a basic requirement for almost all time series algorithms in data mining. Non-symbolic transformations like Discrete Fourier Transform (DFT) and Piecewise Aggregate Approximation (PAA) models have this lower-bounding property. However, the authors claim no previously proposed symbolic representations do, which limits their usefulness.

Additionally, the authors observe that most raw time series data sets have very high dimensionality. This is problematic because time series mining algorithms are $\mathcal{O}(cn)$, where n is the number of dimensions. Therefore, preferably any transformations on the original time series will reduce the dimensionality to a more manageable size. Unfortunately, the authors observe, previously proposed symbolic representations preserve the original time series dimensionality.

Next, the authors present their symbolic representation, SAX (Symbolic Aggregate approXimation), which addresses each of the previously mentioned shortcomings of symbolic representations.

SAX is unique in that it uses an intermediate transformation, PAA, and then nominalizes the PAA representation into a sequence of characters’a string. By using the intermediate PAA representation, SAX enjoys two benefits:

It is able to exploit the dimensionality reducing properties of PAA, and
It provably lower bounds PAA, and PAA provably lower bounds the original time series. Therefore, through the transitivity relationship, SAX provably lower bounds the original time series.

Having the SAX representation offers many benefits in addition to those mentioned previously. For instance, when extracting subsequences of size $n$ from the original time series $T$, a sliding window approach can be used. However, this means $|T| - n + 1$ subsequences must be stored. When working with the SAX representation, however, the symbolic representation generalizes many subsequences such that many subsequences that were distinct in the original time series are identical in SAX. This allows one to compress the subsequence data using techniques like run-length encoding. This compressed form may be quite advantageous, especially if memory constraints are a factor.

To transform a time series into SAX, first the time series must be transformed into PAA. Next, an additional discretizing transformation is applied on the PAA. This is done in such a way as to make each discrete symbol appear with an equal probability in the string. This can be easily done if we observe that the original normalized time series has a normal distribution, therefore we simply need to partition the possible values in a time series into $a$ intervals ($a$ represents the number of discrete symbols in our alphabet–larger values of a correspond to more granular approximations), where the range of values in each interval has an equal probability of occurring (thus, for example, the interval containing the mean will likely be relatively small). Determining where the intervals begin and end can be easily done, since the time series has a normal distribution. Once the intervals are calculated, the PAA can be discretized into a string of symbols, ${a, b, c, \ldots}$, by assigning symbol $a$ to PAA coefficients that fall within the first interval, symbol $b$ to PAA coefficients that fall within the second interval, and so on until every PAA coefficient has been discretized. At this point, the original time series has been transformed into SAX.

To show that the SAX distance measures can be made to lower bound the original time series, the authors first demonstrate distance measure for the PAA, which is similar to the Euclidean distance formula:

$$ \operatorname{DR}(\bar{Q},\bar{C}) = \sqrt{\frac{n}{w} \sum_{i=1}^w (\bar{q}_i-\bar{c}_i )^2}. $$

$\operatorname{DR}$ lower bounds the Euclidean distance formula for the original time series, where $\bar{Q}$ and $\bar{C}$ represent the PAA transformation of time series $Q$ and $C$. With this formula in mind, next they present the distance formula for SAX:

$$ \operatorname{MIN_DIST}(\bar{Q},\bar{C})=\sqrt{\frac{n}{w} \sum_{i=1}^w (\operatorname{dist}(\hat{q}_i-\hat{c}_i))^2} $$

The authors observe that the above two formulas are identical, except for the presence of the inner dist function. Since $\operatorname{dist}(\hat{q} - \hat{c})$ lower-bounds $\bar{q}_i-\bar{c}_i$, $\operatorname{MIN_DIST}$ must lower bound $\operatorname{DR}$. And, by transitivity relation, $\operatorname{MIN_DIST}$ therefore lower-bounds the distance measure for the original time series.

The authors then go on to show how the dist function can be a fast table lookup, where the table is the cross product of the symbols ${a, b, \ldots}$. Each cell in the table represents the distance between the row symbol and the column symbol, and its value can be calculated with a straightforward formula.

Next, the authors conduct a number of experiments to empirically test the validity and performance of SAX when compared to the traditional Euclidean distance formula (on the original time series data) and other representations.

For instance, they show the experimental results generated with hierarchical clustering. The results reveal that the clusters generated from the SAX representation (using MINDIST as the distance measure) are the same as the clusters generated from the original time series (using the classical Euclidean distance formula). Other symbolic representations performed poor by comparison, and the authors stipulate that the superior performance of SAX can be attributed to the smoothing effect caused by dimensionality reduction.

They discuss the results of other experiments as well. In each case, SAX performs comparatively well, and in some cases it does better than the Euclidian distance measure on the original time series.

Next, the authors demonstrate the advantages of having a symbolic representation of a time series. For instance, motif discovery�looking for subsequences with the same general pattern�can be accomplished by hashing (which requires discrete data) subsequences into buckets using a random subset of the subsequence as the key, then searching through the buckets. Note that since SAX provides a symbolic representation, it tends to be resilient to noise (it is not over-fitted to the noise) because the discrete values smooth out most of the variations caused by noise.

Finally, the authors present their conclusion. They emphasize that their experimental results demonstrate that SAX is competitive or superior to other approaches on a variety of classification and clustering problems. Furthermore, they point out that the symbolic nature of SAX opens it up to other domains, like motif discovery which cannot be done on real-valued data. They close with a few remarks on future directions to pursue with SAX; notably, they are curious to see how well other data mining algorithms (that require discrete representations) do when paired with SAX.

Analysis and Critique

The symbolic nature of SAX opens up the possibility of using many fast and efficient algorithms and data structures (that require discrete representations) developed in other disciplines which are unavailable to real-valued representations, like PAA or FFT. However, SAX does so while still providing a lower bound on the distance measures of the original time series, unlike other symbolic representations discussed in the paper. This suggests that SAX is offering a reasonable approximation (to whatever degree is required) of the original time series, which gives us confidence in its results. SAX allows us to use its simplified, dimensionally reduced transformation in memory constrained environments, reasonably confident that the results acquired on its reduced data set will be applicable to the larger, original data set. Since time series data sets tend to be quite large, this advantage should not be overlooked.

However, I would argue the more important properties of SAX have more to do with its symbolic nature and less to do with its dimensionally reduced form. Since it offers an �accurate� (according to the MINDIST lower bounding property) symbolic representation, it is able to effectively generalize the time series data points, permitting more sophisticated analysis of time series that is resilient to noise. For instance, the motif discovery algorithm, PROJECTION [2], is quite promising, and has applicability to a wide range of domains. This algorithm is only possible if the time series has been discretized as required by the hashing function.

Unfortunately, in parts the paper was more difficult to read than it needed to be. This is, of course, not uncommon. While I feel I was able to understand their work, at times I had to take a second, third, or even fourth look at the text to adequately grasp the material presented. In some cases, simple fixes like providing better examples would have sufficed. For instance, in their paper, much of the discussion in Section 3 was about how to construct a symbolic representation of a time series. For much of the discussion, they used an alpha size of three (such that there were only three possible values that a data point could have, a, b, or c). However, the first time they illustrate the dist table (Table 4) for facilitating �distance� measures between symbolic pairs, they use an alphabet of size four. In particular, Figure 5 used an alphabet of size three, and it would have been convenient to be able to compare it to Table 4.

Also, they do not adequately address many of the topics mentioned in their abstract and introduction. For instance, they mention streaming in the title of their paper, and throughout the abstract, but then barely mention it again throughout the remainder of the paper. I strongly believe that SAX is amenable to streaming algorithms, but the authors could have did a better job at demonstrating this.

Conclusion

Overall, despite my criticisms, the paper effectively conveys the main thrust of their accomplishment, SAX, and how it can be put to use. Furthermore, the authors seem to be addressing a rather urgent need (in 2003 and even today), especially in light of the exponentially growing volume of time series data being generated. As the saying goes, we are increasingly data rich but information poor. Thus, we need more effective ways to analyze and find patterns in these huge data sets. The authors of [1] make a convincing case that SAX will help us towards this end.

References

[1] Ramakrishnan Srikant and Rakesh Agrawal. A Symbolic Representation of Time Series, with Implications for Streaming Algorithms. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery (2003), pp. 2-11.

[2] Tompa, M. & Buhler, J. (2001). Finding Motifs Using Random Projections. In proceedings of the 5th Intel Conference on Computational Molecular Biology. Montreal, Canada, Apr 22-25. pp. 67-74.

SIUe - Computational Statistics (STAT 575) - Problem Set 1

Sat, 30 Oct 2021 00:00:00 +0000

Problem 1
Problem 2
- Solution
Problem 3
- Solution: part (a)
- Solution: part (b)
Problem 4
- Solution
Problem 5
- Solution
Problem 6
- Solution

Problem 1

Write your own code and find solution to the equation $x^3 + x - 4 = 0$ using Newton’s method and the secant method. Compare the number of iterations needed for different starting values for the two methods.

Solution: Newton’s method

If we have some function $f : \mathbb{R} \mapsto \mathbb{R}$ and we wish to find a root of $f$, i.e., an $x$ such that $f(x) = 0$, we may use Newton’s method.

We take an initial guess of the root as $x_0$ and try to refine it with a linear approximation of $f$ given by \[ L(f | x_0) = \lambda x.f(x_0) + f'(x_0)(x-x_0). \]

Now, we may approximate a root of $f$ with a root of $L(f|x_0)$, \[ L(f|x_0)(x) = 0, \] which may be rewritten as \[ f(x_0) + f'(x_0)(x-x_0) = 0. \] Solving for a root $x$ of $L(f|x_0)$ we get the result \[ x = x_0 - \frac{f(x_0)}{f'(x_0)}. \]

Hoping that $x$ results in a better approximation of the root of $f$ than $x_0$, we approximate $f$ with $L(f|x)$ and repeat the process.

Generalizing this result, we obtain the iterative procedure \[ x_{i+1} = x_i - \frac{f(x_i)}{f'(x_i)}. \] We continue this process until we obtain some stopping condition, e.g., $|x_{i+1} - x_i| < \epsilon$.

Letting $f(x) = x^3 + x - 4$ and $f'(x) = 3x^2 + 1$ and substituting into the above result, we get the result \[ x_{i+1} = x_i - \frac{x_i^3+x_i-4}{3x_i^2+1}. \]

We implement a general procedure for Newton’s method:

newton_method <- function(f,dfdx,x0,eps,debug=T)
{
n <- 0
repeat
{
x1 <- x0 - f(x0) / dfdx(x0)
n <- n + 1
if (debug==T) { cat("iteration=",n," x=",x1,"\n") }
if(abs(x1 - x0) < eps)
{
break
}
x0 <- x1
}
list(root=x0,iter=n)
}

We take an initial guess of $x_0 = 1$ and $\epsilon = 1 \times 10^{-6}$ and run the following R code to solve for a root of $f$ using Newton’s method:

f <- function(x) { x^3 + x - 4 }
dfdx <- function(x) { 3*x^2 + 1 }
eps <- 1e-6
x0 <- 1
result <- newton_method(f,dfdx,x0,eps)

## iteration= 1 x= 1.5
## iteration= 2 x= 1.387097
## iteration= 3 x= 1.378839
## iteration= 4 x= 1.378797
## iteration= 5 x= 1.378797

We obtain $x \approx 1.3787967$ after $5$ iterations. When we plug that approximate root into $f$ we obtain the result $f(1.3787967) = 7.3825959\times 10^{-9}$, which is approximately zero.

Solution: Secant method

In Newton’s method, we linearize $f$ using the derivative of $f$. If, instead, we use the secant of $f$ with respect to two inputs $x_i$ and $x_{i+1}$, as given by \[ \frac{f(x_{i+1}) - f(x_i)}{x_{i+1}-x_i}, \] we get the iterative procedure \[ x_{i+2} = x_{i+1} - f(x_{i+1})\frac{x_{i+1}-x_i}{f(x_{i+1}) - f(x_i)}, \] which requires two initial values $x_0$ and $x_1$.

We define the secant method as a function given by:

secant_method <- function(f,x0,x1,eps,debug=T)
{
n <- 0
repeat
{
x2 <- x1 - f(x1) * (x1 - x0) / (f(x1) - f(x0))
n <- n + 1
if (debug==T) { cat("iteration=",n," x=",x2,"\n") }
if(abs(x2-x1) < eps)
{
break
}
x0 <- x1
x1 <- x2
}
list(root=x1,iter=n)
}

We let $x_0 = 0$, $x_1 = 1$, and keep everything else the same and run the secant method with the following R code:

x0 <- 0
x1 <- 1
result <- secant_method(f,x0,x1,eps,F)

We obtain $x \approx 1.3787965$ after $7$ iterations. When we plug that approximate root into $f$ we obtain the result $f(1.3787965) = -1.268648\times 10^{-6}$, which is approximately zero.

Note that this is $2$ more iterations than Newton’s method.

Comparison of Newton’s method versus secant method

We perform $10000$ trials to get a better view of how the two methods, Newton and secant, compare over many different initial guesses.

We generate the data with:

n <- 10000
from <- 0
to <- 4
by <- (to-from)/n
newt_sols <- vector(length=n)
sec_sols <- vector(length=n)
i <- 1
for (x0 in seq(from=from, to=to, by=by))
{
newt_sols[i] <- newton_method(f,dfdx,x0,eps,F)$iter
sec_sols[i] <- secant_method(f,x0,x0+1,eps,F)$iter
i <- i + 1
}

We summarize the results and report them with:

cat("mean iterations\n",
"newton => ", mean(newt_sols), "\n",
"secant => ", mean(sec_sols), "\n")

## mean iterations
## newton => 5.819618
## secant => 7.231077

We see that Newton’s method, on average, requires $1.4114589$ fewer iterations before the stopping condition is satisfied.

Problem 2

Poisson regression. The Ache hunting data set has $n = 47$ observations recording is the number of monkeys killed over a period of days with each hunter along with hunter’s age. It is of interest to estimate and quantify the monkey kill rate as a function of hunter’s age. Hunting prowess confers elevated status among the group, so a natural question is whether hunting ability improves with age, and at which age hunting ability is best.

Hand-code Newton-Raphson in R to fit the Poisson regression model \[ \mathit{monkeys}_i \sim \operatorname{Pois}\left(\exp(\log \mathit{days}_i + \theta_1 + \theta_2 \mathit{age}_i + \theta_3 \mathit{age}_i^2)\right). \]

Feel free to use jacobian and hessian in the numDeriv R package. You may need a sets of crude starting values. I run a linear regression for the “empirical log-rates” and get starting values $(5.99, 0.167, 0.001)$. Feel free to use those. Compare your result with glm() function in R using

glm(monkeys~age+I(age^2), family="poisson", offset=log(days), data=d)

Solution

We are given the following data:

d <- read.table("ache.txt", header=T)
n <- length(d$age)
X <- cbind(rep(1,n), d$age, d$age^2)
loglike <- function(theta)
{
sum(dpois(d$monkeys,exp(log(d$days)+X%*%theta),log=T))
}

We generalize the univariate Newton’s method in Problem 1 to the multivariate case. We implement the multivariate Newton-Raphson method with numerical hessian and jacobian with the following R code:

library(numDeriv)
newton_raphson_method <- function(x0,f,eps)
{
n <- 0
x1 <- x0
repeat
{
x1 <- x0 - solve(hessian(f,x0))%*%t(jacobian(f,x0))
n <- n + 1
if (n %% 7 == 0) { cat("iteration=",n," theta=",x1,"\n") }
if (max(abs(x1 - x0)) < eps)
{
break
}
x0 <- x1
}
list(root=x1,iter=n)
}

We use the multivariate Newton-Raphson method to find the MLE of $\theta$ in the poisson regression model:

eps <- 1e-6
theta0 <- c(5.99, 0.167, 0.001) # starting values
theta_mle <- newton_raphson_method(theta0,loglike,eps)$root

## iteration= 7 theta= -1.01133 0.1670442 0.0009996348
## iteration= 14 theta= -7.590958 0.1532685 0.001112005
## iteration= 21 theta= 1.438482 -0.2696711 0.003827083
## iteration= 28 theta= -5.484246 0.1246477 -0.001203418

The MLE of $\theta$ is given by:

theta_mle

## [,1]
## [1,] -5.484245904
## [2,] 0.124647667
## [3,] -0.001203418

We compare the results with the builtin method:

glm(monkeys~age+I(age^2),family="poisson", offset=log(days),data=d)$coefficients

## (Intercept) age I(age^2)
## -5.484245904 0.124647667 -0.001203418

The hand-coded approach and the builtin approach obtain the same point estimate $\hat\theta = (-5.4842, 0.1246, -0.0012)'$.

Problem 3

Logistic and Cauchy distributions are well-suited to the inverse transform method. For each of the following, generate $10,000$ random variables using the inverse transform. Compare your program with the built-in R functions rlogis() and rcauchy(), respectively:

Solution: part (a)

Standard logistic distribution \[ F(x) = \frac{1}{1+e^{-x}} \]

Solve for $x$ in \[\begin{align*} u &= F(x)\\ u &= \frac{1}{1+e^{-x}}\\ x &= \log(u/(1-u)). \end{align*}\]

n <- 10000
us <- runif(n)
d1 <- density(log(us/(1-us)))
d2 <- density(rlogis(n))
plot(d1,col="blue",main="comparison of density plots")
lines(d2,col="red")
legend(x="topright",legend=c("inverse method","built-in"),col=c("blue","red"),
pch=c("-","-"))

Solution: part (b)

Standard Cauchy distribution \[ F(x) = \frac{1}{2} + \frac{1}{\pi} \operatorname{arctan(x)} \]

Solve for $x$ in \[\begin{align*} u &= F(x)\\ u &= \frac{1}{2} + \frac{1}{\pi} \operatorname{arctan(x)}\\ x &= \tan(\pi(u-1/2)). \end{align*}\]

n <- 1000
us <- runif(n)
d1 <- tan(pi*(us-0.5))
d2 <- rcauchy(n=n)
d1 <- d1[d1 > -20 & d1 < 20]
d2 <- d2[d2 > -20 & d2 < 20]
c1 <- rgb(0,0,255, max = 255, alpha = 50, names = "blue")
c2 <- rgb(255,0,0, max = 255, alpha = 50, names = "red")
par(mfrow=c(1,2))
hist(d1,col=c1,freq=F,breaks=50,main="inverse-transform method vs built-in")
hist(d2,col=c2,add=T,freq=F,breaks=50)
legend(x="topright",legend=c("inv","builtin"),col=c("blue","red"),pch=c("-","-"))
plot(density(d1), col="blue",main="density plot")
lines(density(d2), col="red")
legend(x="topright",legend=c("inv","builtin"),col=c("blue","red"),pch=c("-","-"))

Problem 4

Generating $10,000$ random variables from $\operatorname{Geometric}(p)$ distribution based off Bernoulli trials.

Solution

A random variable $X \sim \operatorname{Geometric(p)}$ is given by the number of i.i.d. trials needed to have a success where success occurs with probability $p$.

Thus, we may simulate this distribution with the following R code:

# simulate n realizes of geometric(p)
rgeo <- function(n,p)
{
outcomes <- vector(length=n)
for (i in 1:n)
{
trials <- 0
while (T)
{
trials <- trials + 1
if (rbinom(1,1,p) == 1)
{
break
}
}
outcomes[i] <- trials
}
outcomes
}

When we use this function to draw a sample of $n=10000$ geoemtrically distributed random variables with $p=0.2$, we obtain:

p <- .2
n <- 10000
sample <- rgeo(n,p)
cat("the mean should be approximate 1/p =", 1/p, " and we obtain a mean of ", mean(sample))

## the mean should be approximate 1/p = 5 and we obtain a mean of 5.0425

Problem 5

Generate random values from a Standard Half Normal distribution with pdf, \[ f(x) = \frac{2}{\sqrt{2 \pi}} e^{-x^2/2} , x > 0. \]

For the candidate pdf, choose the exponential density with rate $1$. Verify that your method works via a plot of the true density, and a histogram of the generated values.

Solution

We are given the density of the standard half-normal distribution, \[ \operatorname{dhalfnormal}(x) = \frac{2}{\sqrt{2 \pi}} e^{-x^2/2}, x > 0. \]

We model this density with the following R code:

# density for standard half-normal
dhalfnormal <- function(x) { 2/sqrt(2*pi)*exp(-x^2/2) }

We sample from the exponential distribution $\operatorname{EXP}(\lambda=1)$, with density $g$ and thus we first find the $c$ satisfying \[ c = \max \left\{ \frac{\operatorname{dhalfnormal}(x)}{\operatorname{dexp}(x|\lambda=1)} | x \in \mathbb{R} \right\}, \] which is found to be approximately $c = 1.315489247$.

We implement the standard half-normal sampler, $\operatorname{rhalfnormal}$, using the acceptance-rejection sampling technique with the following R code:

# accept-rejection sampling for standard half-normal
# using exp(rate=1)
rhalfnormal <- function(N)
{
c <- 1.315489247
xs <- vector(length=N)
k <- 1
while (T)
{
x <- rexp(n=1)
if (runif(n=1) < dhalfnormal(x)/(c*dexp(x)))
{
xs[k] <- x
k <- k + 1
if (k == N)
{
break
}
}
}
xs
}

We simulate drawing $n=100000$ samples from the standard half-normal distribution and plotting a histogram of the sample with its density overload in red on top of it with the following R code:

n <- 100000
sample <- rhalfnormal(n)
hist(sample,freq=F,breaks=50,main="standard half-normal")
curve(dhalfnormal(x),add=TRUE,col="red")

We see that the histogram is compatible with being drawn from the overload density.

Problem 6

Use accept-reject to sample from this bimodal density: \[ f(x) \propto 3 e^{-0.5(x+2)^2} + 7 e^{-0.5(x-2)^2} \] The normalizing constant is $25.066$. For your proposal $g(·)$, use a $N(0, 2^2)$ distribution. Verify that your method works via a plot of the true normalized density, and a histogram of the generated values.

Solution

We are given the kernel of the bimodal distribution of interest, \[ \operatorname{ker-bimodal}(x) = 3 e^{-0.5(x+2)^2} + 7 e^{-0.5(x-2)^2}, \] with the normalizing constant $C = 25.0663$ and thus the pdf for the bimodal is given by \[ \operatorname{dbimodal}(x) = \frac{\operatorname{ker}(x)}{C}. \]

We model these two functions with the following R code:

# density for biomodal density
kerbimodal <- function(x) { 3*exp(-0.5*(x+2)^2) + 7*exp(-0.5*(x-2)^2) }
kerbimodal.C <- 25.0663
dbimodal <- function(x) { kerbimodal(x) / kerbimodal.C }

We sample from the normal distribution $N(\mu=0,\sigma^2=2^2)$, with density $g$ and thus we first find the $c$ satisfying \[ c = \max \left\{ \frac{\operatorname{ker-bimodal}(x)}{g(x|\mu=0,\sigma^2=2^2)} | x \in \mathbb{R} \right\}, \] which is found to be approximately $c = 68.35212$.

We implement the bimodal sampler, $\operatorname{rbimodal}$, using the acceptance-rejection sampling technique with the following R code:

# accept-rejection sampling for bimodal distribution with density dbimodal
# using normal(0,2^2).
rbimodal <- function(N)
{
c <- 68.35212
xs <- vector(length=N)
k <- 1
while (T)
{
x <- rnorm(n=1,mean=0,sd=2)
if (runif(n=1) < kerbimodal(x)/(c*dnorm(x,mean=0,sd=2)))
{
xs[k] <- x
if (k == N)
{
break
}
k <- k + 1
}
}
xs
}

We simulate drawing $n=100000$ samples from the bimodal distribution and plotting a histogram of the sample with its density overload in red on top of it with the following R code:

n <- 100000
sample <- rbimodal(n)
hist(sample,freq=F,breaks=50,main="bimodal")
curve(dbimodal(x),add=TRUE,col="red")

We see that the histogram is compatible with being drawn from the overload density.

SIUe - Computational Statistics (STAT 575) - Problem Set 2

Sat, 30 Oct 2021 00:00:00 +0000

Problem 1
- Part (a)
- Part (b)
- Part (c)
  - E-step
  - M-step
- Part (d)
Problem 2
- Part (a)
  - E-step
  - M-step
- Part (b)
- Part (c)

Problem 1

Derive the E-M algorithm for right-censored normal data with known variance, say $2 = 1$. Consider $Y_i$’s that are i.i.d. from a $N(\theta, 1)$, $i=1,2\ldots, n$. We observe $(x_1, \ldots, x_n)$ and $(\delta_1, \ldots, \delta_n)$, where $x_i = \min(y_i,c)$, and $\delta_i = I(y_i < c)$. Let $C$ be the total number of censored (incomplete) observations. We denote the missing data as $\{Z_i : \delta_i = 0\}$.

Part (a)

Derive the complete log-likelihood, $\ell(\theta | Y)$.

The unobserved random variates $\{Y_i\}$ are i.i.d. normally distributed, \[ Y_i \sim f_{Y_i}(y | \theta) \] where \[ f_{Y_i}(y | \theta) = (2 \pi)^{-\frac{1}{2}} \exp\left(-\frac{1}{2}(y - \theta)^2\right). \]

The likelihood function is therefore \[\begin{align} L(\theta | \{y_i\}) &= \prod_{i=1}^{n} (2 \pi)^{-\frac{1}{2}} \exp\left(-\frac{1}{2}(y_i - \theta)^2\right)\\ &= (2 \pi)^{-\frac{n}{2}} \exp\left(-\sum_{i=1}^{n} \frac{1}{2}(y_i - \theta)^2\right). \end{align}\]

Taking the logarithm of $L$, \[\begin{align} \ell(\theta | \{y_i\}) &= \log L(\theta | \{y_i\})\\ &= -\frac{n}{2} \log (2 \pi) - \frac{1}{2} \sum_{i=1}^{n} (y_i - \theta)^2\\ &= -\frac{n}{2} \log (2 \pi) - \frac{1}{2} \sum_{i=1}^{n} y_i^2 + \theta \sum_{i=1}^{n} y_i - \frac{n}{2} \theta^2. \end{align}\]

Anticipating that we will be maximizing the complete log-likelihood with respect to $\theta$, we put any terms that are not a function of $\theta$ into $k$, obtaining the result \[ \ell(\theta | \{y_i\}) = k + \theta \sum_{i=1}^{n} y_i - \frac{n}{2} \theta^2. \]

Part (b)

Show the conditional expectation \[ E(Y | x, \delta=1, \theta^{(t)}) = x \] and \[ E(Y | x, \delta=0, \theta^{(t)}) = E(Y | Y > x) = \theta^{(t)} + \frac{\phi(x-\mu)}{1 - \Phi(x-\mu)} \] where $\phi$ and $\Phi$ are pdf and cdf of standard normal.

The distribution of $Y$ given $\delta = 1$, is uncensored and therefore it is given that $Y$ realized the value $x$. Since the expectation of a constant $x$ is $x$, that means $E(Y | Y = x) = x$.

If $\delta = 0$, $Y$ is censored, i.e., $Y > x$. To take its expectation, we first need to derive the conditional conditional distribution of $Y$ given $Y > x$ and $\theta^{(t)}$.

The probability $\Pr(Y \leq y | Y > x)$ is given by \[ \Pr(Y \leq y | Y > x) = \Pr(x < Y \leq y) / \Pr(Y > x) \] which may be rewritten as \[ \Pr(Y \leq y | Y > x) = \frac{F_Y(y | \theta^{(t)}) - F_Y(x | \theta^{(t)})}{1 - F_Y(y | \theta^{(t)})}. \] where $F_{Y|\theta^{(t)}}$ is the cdf of the normal distribution with $\sigma=1$ and $\mu=\theta^{(t)}$.

We may rewrite $F_{Y|\theta^{(t)}}$ in terms of the standard normal, \[ F_{Y}(y|\theta^{(t)}) = \Phi(y - \theta^{(t)}), \] and thus we may rewrite the conditional distribution of $Y | Y > x$ as \[ \Pr(Y \leq y | Y > x) = \frac{\Phi(y - \theta^{(t)}) - \Phi(x - \theta^{(t)})}{1 - \Phi(x - \theta^{(t)})} \] and thus after further simplifying, we obtain the cdf of $Y | x$, \[ F_{Y|x}(y|\theta^{(t)}) = 1 - \frac{1- \Phi(y - \theta^{(t)})}{1 - \Phi(x - \theta^{(t)})} \] which has a density given by \[ f_{Y}(y|x,\theta^{(t)}) = \frac{\phi(y - \theta^{(t)})}{1 - \Phi(x - \theta^{(t)})} I(y > x). \]

The expectation of $Y|(x,\theta^{(t)})$ is given by \[\begin{align} E(Y|x,\theta^{(t)}) &= \int_{x}^{\infty} y f_{Y}(y|x,\theta^{(t)}) dy\\ &= \int_{x}^{\infty} y \left(\frac{\phi(y - \theta^{(t)})}{1 - \Phi(x - \theta^{(t)})}\right) dy\\ &= \frac{1}{{1 - \Phi(x - \theta^{(t)})}}\int_{x}^{\infty} y \phi(y - \theta^{(t)}) dy. \end{align}\]

Analytically, this is a tricky integration problem. Certainly, it would be trivial to numerically integrate this to obtain a solution, but we seek a closed-form solution.

I searched online, and discovered an interesting way to tackle this integration problem.

Let $f$ and $F$ respectively denote the pdf and cdf of the normally distributed $Y$. Then, \[ \frac{df}{dy} = -(y - \theta) f(y) \] and \[ \int_{a}^{b} \frac{df}{dy} dy = f(b) - f(a). \]

Then, \[\begin{align} E(Y|x,\theta^{(t)}) &= \frac{1}{1 - F(x)}\int_{x}^{\infty} y f(y) dy\\ &= -\frac{1}{1 - F(x)}\int_{x}^{\infty} - (y -\theta^{(t)}) f(y) dy + \frac{\theta^{(t)}}{1-F(x)}\int_{x}^{\infty} f(y) dy\\ &= -\frac{1}{1 - F(x)}\int_{x}^{\infty} \frac{df}{dy} dy + \frac{\theta^{(t)}}{1-F(x)} (1-F(x))\\ &= -\frac{1}{1 - F(x)}\left(f(\infty) - f(x)\right) + \theta^{(t)}\\ &= \frac{f(x)}{1 - F(x)} + \theta^{(t)}. \end{align}\]

We may rewrite the last line as \[ E(Y|x,\theta^{(t)}) = \theta^{(t)} + \frac{\phi(x-\theta^{(t)})}{1 - \Phi(x-\theta^{(t)})}. \]

Part (c)

Derive the $E$-step and $M$-step using parts (a) and (b). Give the updating equation.

E-step

The $E$-step entails taking the conditional expectation of the complete log-likelihood function $\ell(\theta | \{Y_i\})$ given the observed data $\{x_i\}$ and $\{\delta_i\}$.

\[\begin{align} Q(\theta | \theta^{(t)}) &= E_{Y_i | x_i,\delta_i}(\ell(\theta | \{Y_i\})\\ &= E_{Y_i | x_i,\delta_i}\left(k + \theta \sum_{i=1}^{n} Y_i - \frac{n}{2} \theta^2\right)\\ &= k - \frac{n}{2}\theta^2 + \theta \sum_{i=1}^{n} E_{Y_i | x_i,\delta_i}(Y_i). \end{align}\]

We have already solved the expectation of $Y_i$ given $x_i$ and $\delta_i$. We rewrite $Q$ by substituting $E(Y_i | x_i, \delta_i)$ with its previously found solution, \[ Q(\theta | \theta^{(t)}) = k - \frac{n}{2}\theta^2 + \theta \sum_{i=1}^{n} \delta_i x_i + (1-\delta_i) \left(\theta^{(t)} + \frac{\phi(x_i-\theta^{(t)})}{1 - \Phi(x_i-\theta^{(t)})}\right). \]

Letting $C = \sum_{i=1}^{n} (1 - \delta_i)$, $R = \sum_{i=1}^{n} \delta_i x_i$, and separating out all terms that are independent of $\theta^{(t)}$, \[ Q(\theta | \theta^{(t)}) = k - \frac{n}{2}\theta^2 + C \theta \theta^{(t)} + R \theta + \theta \sum_{i=1}^{n} \frac{(1-\delta_i)\phi(x_i-\theta^{(t)})}{1 - \Phi(x_i-\theta^{(t)})}. \]

M-step

We wish to solve \[ \theta^{(t+1)} = \operatorname{arg\,max}_{\theta} Q(\theta | \theta^{(t)}). \] by solving \[ \left. \frac{d Q(\theta | \theta^{(t)})}{d \theta} \right\vert_{\theta=\theta^{(t+1)}} = 0, \] which may be written as \[ -n\theta^{(t+1)} + C \theta^{(t)} + R + \sum_{i=1}^{n} \frac{(1-\delta_i)\phi(x_i-\theta^{(t)})}{1 - \Phi(x_i-\theta^{(t)})} = 0. \]

Solving for $\theta^{(t+1)}$ obtains the updating equation \[ \theta^{(t+1)} = \frac{R}{n} + \frac{C}{n} \theta^{(t)} + \frac{1}{n}\sum_{i=1}^{n}\frac{(1-\delta_i)\phi(x_i-\theta^{(t)})}{1 - \Phi(x_i-\theta^{(t)})}. \] where \[ R = \sum_{i=1}^{n} \delta_i x_i \] and \[ C = \sum_{i=1}^{n} (1-\delta_i). \]

Part (d)

Use your algorithm on the V.A. data to find the MLE of $\mu$. Take the log of the event times first and standardize by sample standard deviation. You may simply use the censored data sample mean as your starting value.

In the following R code, we implement the updating equation derived in the previous step. We encapulsate the procedure into a function that takes its arguments in the form of a censored set, uncensorted set, starting value ($\theta^{(1)}$), and an $\epsilon$ value to control stopping condition.

# assuming the uncensored and censored data are distributed normally,
# we use the EM algorithm to derive an estimator given censored and uncensored
# data.
mean_normal_censored_estimator_em <- function(uncensored,censored,theta,eps=1e-6,debug=T)
{
dev <- sd(log(c(uncensored,censored)))
censored <- log(censored) / dev
uncensored <- log(uncensored) / dev
theta <- log(theta) / dev
n <- length(censored) + length(uncensored)
C <- length(censored)
R <- sum(uncensored)
s <- function(theta)
{
sum <- 0
for (i in 1:C)
{
num <- dnorm(censored[i],mean=theta,sd=1)
denom <- 1-pnorm(censored[i],mean=theta,sd=1)
sum <- sum + (num / denom)
}
sum
}
i <- 1
repeat
{
theta.new <- R/n + C/n * theta + (1/n)*s(theta)
if (debug==T) { cat("theta[", i, "] =",theta,", theta[", i+1, "] =",theta.new,"\n") }
if (abs(theta.new - theta) < eps)
{
theta <- theta.new * dev
theta <- exp(theta)
return(theta)
}
i <- i + 1
theta <- theta.new
}
}

We apply this procedure to the indicated data set.

library(MASS) # has VA data
VAs <- subset(VA,prior==0)
censored <- VAs$status == 0
censored_xs <- VAs[censored,c("stime")]
uncensored_xs <- VAs[!censored,c("stime")]
mu <- mean(uncensored_xs)
cat("mean of the uncensored sample is ", mu, ".")

## mean of the uncensored sample is 112.1648 .

sol <- mean_normal_censored_estimator_em(uncensored_xs,censored_xs,mu)

## theta[ 1 ] = 3.857928 , theta[ 2 ] = 3.424258
## theta[ 2 ] = 3.424258 , theta[ 3 ] = 3.415443
## theta[ 3 ] = 3.415443 , theta[ 4 ] = 3.415286
## theta[ 4 ] = 3.415286 , theta[ 5 ] = 3.415283
## theta[ 5 ] = 3.415283 , theta[ 6 ] = 3.415283

sol

## [1] 65.2625

We see that our estimate of $\theta$ is $\hat{\theta} = 65.2624985$. (The $\theta$ before transforming it to the appropriate scale was $3.415283$.)

This mean is somewhat lower than anticipated, which makes me suspect something is wrong with my updating equation. If I have the time, I will revisit it.

Problem 2

Problem 4.2

Part (a)

There are $N=1500$ gay men in the survey sample where $X_i$ denotes the $i$-th persons response to the number of risky sexual encounters he had in the previous $30$ days. Thus, we observe a sample $\vec{X} = (X_1,X_2,\ldots, X_N)$.

We assume there are $3$ groups in the population, denoted by $z=1$, $t=2$, and $p=3$. Group $1$ members report $0$ risky sexual encounters regardless of the truth where the probability of being a member of group $1$ is denoted by $\alpha$,

Group $2$ members accurately report risky sexual encounters and represent typical behavior where the probability of being a member of group $2$ is denoted by $\beta$. We assume this group’s number of sexual encounters follows a poisson with mean $\mu$.

Group $3$ members accurately report risky sexual encounters and represent high-risk behavior where the probability of being a member of group $3$ is $\gamma = 1-\alpha-\beta$. We assume this group’s number of sexual encounters follows a poisson with mean $\lambda$.

This represents a finite mixture model with a pdf \[ X_i \sim f(x | \vec{\theta}) = \alpha I(x=0) + \beta \operatorname{POI}(x | \mu) + (1-\alpha-\beta)\operatorname{POI}(x | \lambda) \] with a parameter vector \[ \vec{\theta} = (\alpha,\beta,\mu,\lambda)'. \]

Let the uncertain group that the $i$-th person belongs to be denoted by $Z_i$. If we observe group membership data, $X_i | Z_i = z_i$, then \[\begin{align} X_i | Z_i &= 1 \sim I(x=0),\\ X_i | Z_i &= 2 \sim \operatorname{POI}(\mu),\\ X_i | Z_i &= 3 \sim \operatorname{POI}(\lambda), \end{align}\] where \[ Z_i \sim f_{Z_i}(z_i | \vec{\theta}) = \Pr(Z_i=z_i) = \begin{cases} \alpha & z_i = 1,\\ \beta & z_i = 2,\\ \gamma=1-\alpha-\beta & z_i = 3, \end{cases} \] and thus \[ f_{X_i,Z_i}(x_i,z_i | \vec{\theta}) = \alpha I(z_i = 1) + \beta \operatorname{POI}(\mu) I(z_i = 2) + (1-\alpha-\beta) \operatorname{POI}(\lambda) I(z_i=3). \]

The likelihood function is thus given by \[ \mathcal{L}(\vec{\theta} | \vec{X}, \vec{Z}) = \prod_{i=1}^{N} f_{X_i,Z_i}(x_i,z_i | \vec{\theta}), \] which may be rewritten as \[ \mathcal{L}(\vec{\theta} | \vec{X}, \vec{Z}) = \left(\prod_{\{i | z_i = 1\}} \alpha I(x_i=0)\right) \left(\prod_{\{i | z_i = 2\}} \beta \frac{\mu^{x_i} e^{-\mu}}{x_i!}\right) \left(\prod_{\{i | z_i = 3\}} \gamma \frac{\lambda^{x_i} e^{-\lambda}}{x_i!}\right). \]

We wish to rewrite this so that the data is explicitly represented. First, we do the transformation \[ \mathcal{L}(\vec{\theta} | \vec{X}, \vec{Z}) = \left(\prod_{\{i | z_i = 1, x_i = 0 \}} \alpha\right) \prod_{k=0}^{16} \left(\prod_{\{i | z_i = 2, x_i = k\}} \beta \frac{\mu^{k} e^{-\mu}}{k!}\right) \prod_{k=0}^{16} \left(\prod_{\{i | z_i = 3, x_i = k\}} \gamma \frac{\lambda^{k} e^{-\lambda}}{k!}\right). \]

We let $n_{a,b}$ denote the (unobserved) cardinality of $\{i | z_i = a, x_i = b\}$, thus \[ \mathcal{L}(\vec{\theta} | \{n_{j,k}\}) = \alpha^{n_{1,0}} \prod_{k=0}^{16} \beta^{n_{2,k}} \frac{\mu^{k n_{2,k}} e^{-\mu n_{2,k}}}{(k!)^{n_{2,k}}} \prod_{k=0}^{16} \gamma^{n_{3,k}} \frac{\lambda^{k n_{3,k}} e^{-\lambda n_{3,k}}}{(k!)^{n_{3,k}}} \] is the complete likelihood. The complete log-likelihood is thus \[ \ell(\vec{\theta} | \{n_{j,k}\}) = n_{1,0} \log \alpha + \sum_{k=0}^{16} \log \left(\beta^{n_{2,k}} \frac{\mu^{k n_{2,k}} e^{-\mu n_{2,k}}}{(k!)^{n_{2,k}}}\right) + \sum_{k=0}^{16} \log \left(\gamma^{n_{3,k}} \frac{\lambda^{k n_{3,k}} e^{-\lambda n_{3,k}}}{(k!)^{n_{3,k}}}\right) \] which simplies to \[\begin{equation} \begin{split} \ell(\vec{\theta} | \{n_{j,k}\}) = n_{1,0} \log \alpha + \sum_{k=0}^{16} &n_{2,k}(\log \beta + k \log \mu - \mu - \log k!) + \\ &n_{3,k}(\log \gamma + k \log \lambda - \lambda - \log k!). \end{split} \end{equation}\]

Anticipating taking $\frac{d \ell}{d \vec{\theta}}$ to solve for the maximum of the log-likelihood, we remove any terms that are not a function of $\vec{\theta}$, resulting in the kernel \[ \ell(\vec{\theta} | \{n_{j,k}\}) = n_{1,0} \log \alpha + \sum_{k=0}^{16} \left\{ n_{2,k}(\log \beta + k \log \mu - \mu) + n_{3,k}(\log \gamma + k \log \lambda - \lambda) \right\}. \]

E-step

The conditional expectation to solve in the EM algorithm is given by \[ Q(\vec{\theta} | \vec{\theta}^{(t)}) = E(\ell(\vec{\theta})) \] where $\{n_{k,j}\}$ are random and $\{n_j\}$ and $\vec{\theta}^{(t)}$ are given. We rewrite this as \[ Q(\vec{\theta} | \vec{\theta}^{(t)}) = E \left( n_{1,0} \log \alpha + \sum_{k=0}^{16} \left\{ n_{2,k}(\log \beta + k \log \mu - \mu) + n_{3,k}(\log \gamma + k \log \lambda - \lambda) \right\} \right). \]

Using the linearity of expectations, we rewrite the above to \[ Q(\vec{\theta} | \vec{\theta}^{(t)}) = E(n_{1,0}) \log \alpha + \sum_{k=0}^{16} \left\{ E(n_{2,k})(\log \beta + k \log \mu - \mu) + E(n_{3,k})(\log \gamma + k \log \lambda - \lambda) \right\} \] given $\{n_j\}$ and $\theta^{(t)}$.

Consider $E\!\left(n_{2,k} | \{n_j\}, \theta^{(t)}\right)$. To solve this expectation, we must first derive the distribution of $n_{2,k}$.

Suppose $x_j = k$, then probability that the $j$-th person belongs to group $2$ is given by \[ \Pr(Z_j = 2 | x_j = k) = \Pr(Z_j = 2) \Pr(x_j = k | Z_j = 2) / \Pr(x_j = k). \] We note that $\Pr(x_j = k)$ is equivalent to $\pi_k(\vec{\theta})$, $\Pr(Z_j = 2)$ is the definition of $\beta$, and $\Pr(x_j = k | Z_j = 2)$ is $f_{X_j|Z_j}(k | Z_j=2) = \operatorname{POI}(k | \mu)$.

Making the substitutions yields the result \[ t_k(\vec{\theta}) = \Pr(Z_j = 2 | x_j = k) = \beta \operatorname{POI}(k | \mu) / \pi_k(\vec{\theta}). \]

Assuming $\{X_i\}$ are i.i.d., observe that $k \neq 0$, the distribution of $n_{2,k}$ given $n_k$, $\theta^{(t)}$ is binomial distributed with a probability of success $t_k(\vec{\theta}^{(t)})$. Thus, \[ E(n_{2,k}) = n_k t_k(\vec{\theta}^{(t)}). \] The same logic holds for $n_{3,k}$ and $n_{1,0}$, and thus \[ E(n_{3,k}) = n_k p_k(\vec{\theta}^{(t)}) \] and \[ E(n_{1,0}) = n_0 z_0(\vec{\theta}^{(t)}), \] which means \[ Q(\vec{\theta} | \vec{\theta}^{(t)}) = n_0 z_0(\vec{\theta}^{(t)}) \log \alpha + \sum_{k=0}^{16} \left\{ n_k t_k(\vec{\theta}^{(t)})(\log \beta + k \log \mu - \mu) + n_k p_k(\vec{\theta}^{(t)})(\log \gamma + k \log \lambda - \lambda) \right\} \]

M-step

We wish to solve \[ \vec{\theta}^{(t+1)} = \operatorname{arg\,max}_{\vec{\theta}} Q(\vec{\theta} | \vec{\theta}^{(t)}). \] by solving \[ \left. \nabla Q(\vec{\theta} | \vec{\theta}^{(t)}) \right\vert_{\vec{\theta}=\vec{\theta}^{(t+1)}} = \vec{0}. \]

We use the Lagrangian to impose the restriction $\alpha + \beta + \gamma = 1$, thus we seek to perform the constrained maximization of \[ Q_l(\vec{\theta},c | \vec{\theta}^{(t)}) = Q(\vec{\theta} | \vec{\theta}^{(t)}) + c(1-\alpha-\beta-\gamma). \]

Thus, when we solve for $\alpha$, \[ \frac{\partial Q_l}{\partial \alpha} = \frac{n_0 z_0(\theta^{(t)})}{\alpha} - c = 0, \] we get the result \[ \alpha^{(t+1)} = \frac{1}{c} n_0 z_0(\theta^{(t)}). \]

Similar results hold for $\beta$ and $\gamma$, obtaining \[ \beta^{(t+1)} = \frac{1}{c} \sum_{k=0}^{16} n_k t_k(\theta^{(t)}). \] and \[ \gamma^{(t+1)} = \frac{1}{c} \sum_{k=0}^{16} n_k p_k(\theta^{(t)}). \]

This does not look too promising until we realize that \[ n_0 z_0(\theta^{(t)}) + \sum_{k=0}^{16} n_k t_k(\theta^{(t)}) + \sum_{k=0}^{16} n_k p_k(\theta^{(t)}) = N. \]

Thus, $c (\alpha^{(t)}+\beta^{(t)}+\gamma^{(t)}) = N$, which means $c = N$ since $\alpha^{(t)}+\beta^{(t)}+\gamma^{(t)} = 1$. Making this substitution obtains the result \[\begin{align} \alpha^{(t+1)} &= \frac{1}{N} n_0 z_0(\theta^{(t)})\\ \beta^{(t+1)} &= \frac{1}{N} \sum_{k=0}^{16} n_k t_k(\theta^{(t)})\\ \gamma^{(t+1)} &= \frac{1}{N} \sum_{k=0}^{16} n_k p_k(\theta^{(t)}). \end{align}\]

Solving an estimator for $\mu$ at iteration $(t+1)$, \[\begin{align} \left. \frac{\partial Q_l}{\partial \mu} \right\vert_{\mu=\mu^{(t+1}} &= 0\\ \sum_{k=0}^{16} n_k t_k(\theta^{(t)})(k/\mu^{(t+1)}-1) &= 0\\ \frac{1}{\mu^{(t+1)}} \sum_{k=0}^{16} n_k t_k(\theta^{(t)}) k &= \sum_{k=0}^{16} n_k t_k(\theta^{(t)})\\ \mu^{(t+1)} &= \frac{\sum_{k=0}^{16} k n_k t_k(\theta^{(t)})}{\sum_{k=0}^{16} n_k t_k(\theta^{(t)})}. \end{align}\]

The same derivation essentially follows for $\lambda$, and thus \[ \lambda^{(t+1)} = \frac{\sum_{k=0}^{16} k n_k p_k(\theta^{(t)})}{\sum_{k=0}^{16} n_k p_k(\theta^{(t)})}. \]

Part (b)

Estimate the parameters of the model, using the observed data.

# we observe n = (n0,n1,...,n16)
ns <- c(379,299,222,145,109,95,73,59,45,30,24,12,4,2,0,1,1)
N <- sum(ns)
# theta := (alpha, beta, mu, lambda)'
# note that there is an implicit parameter gamma s.t.
# alpha + beta + gamma = 1
# the initial value assumes each category z, t, or p
# is equally probable, and so we let
# (alpha^(0),beta^(0)) = (1/3,1/3)
# and mu^(0) and lambda^(0) are just arbitrarily chosen to be 2 and 3,
# with the insight that group 3 is more risky than group 2.
theta <- c(1/3,1/3,2,3)
# theta := (alpha, beta, mu, lambda)
Pi <- function(i,theta)
{
res <- 0
if (i == 0)
res <- theta[1]
res <- res + theta[2] * theta[3]^i * exp(-theta[3])
res <- res + (1 - theta[1] - theta[2]) * theta[4]^i * exp(-theta[4])
res
}
z0 <- function(theta)
{
theta[1] / Pi(0,theta)
}
t <- function(i,theta)
{
theta[2] * theta[3]^i * exp(-theta[3]) / Pi(i,theta)
}
p <- function(i,theta)
{
(1-theta[1] - theta[2]) * theta[4]^i * exp(-theta[4]) / Pi(i,theta)
}
# update algorithm, based on EM algorithm
update <- function(theta,ns)
{
# note: n0 := ns[1] instead of ns[0] since R does not use zero-based indexes
alpha <- ns[1] * z0(theta) / N
beta <- 0
mu_num <- 0
mu_denom <- 0
lam_num <- 0
lam_denom <- 0
for (i in 0:16)
{
ti <- t(i,theta)
pi <- p(i,theta)
beta <- beta + ns[i+1] * ti
mu_num <- mu_num + i * ns[i+1] * ti
mu_denom <- mu_denom + ns[i+1] * ti
lam_num <- lam_num + i * ns[i+1] * pi
lam_denom <- lam_denom + ns[i+1] * pi
}
beta <- beta / N
mu <- mu_num / mu_denom
lam <- lam_num / lam_denom
c(alpha,beta,mu,lam)
}
em <- function(theta,ns,steps=10000,debug=T)
{
for(i in 1:steps)
{
theta = update(theta,ns)
if (debug==T)
{
if (i %% 1000 == 0) { cat("iteration =",i," theta = (",theta,")'\n") }
}
}
theta
}
# solution theta = (alpha, beta, mu, lambda)
sol <- em(theta,ns,10000,T)

## iteration = 1000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 2000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 3000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 4000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 5000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 6000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 7000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 8000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 9000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 10000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'

We see that the solution is $0.1221661, 0.5625419, 1.4674746, 5.9388889$.

Part (c)

Estimate the standard errors and pairwise correlations of your parameters, using any available method.

We have chosen to use the Bootstrap method.

# ns = (379,299,222,145,109,95,73,59,45,30,24,12,4,2,0,1,1)
# 379 responded 0 encounters
# 299 responded 1 encounters
# 222 responded 2 encounters
# ...
# 1 responded 16 encounters
#
# to resample, we resample from the data set that includes each
# persons response, as determined by ns.
data <- NULL
for (i in 1:length(ns))
{
data <- append(data,rep((i-1),ns[i]))
}
make_into_counts <- function(data)
{
ns <- NULL
for (i in 0:16)
{
ni <- data[data == i]
l <-length(ni)
ns <- append(ns,l)
}
ns
}
m <- 1000 # bootstrap replicates
em_steps <- 100
theta.bs <- em(theta,ns,em_steps,F)
thetas <- rbind(theta.bs)
for (i in 2:m)
{
indices <- sample(N,N,replace=T)
resampled <- make_into_counts(data[indices])
theta.bs <- em(theta,resampled,em_steps,F)
thetas <- rbind(thetas,theta.bs)
if (i %% 100 == 0)
{
cat("iteration ", i, "\n")
print(cov(thetas))
}
}

## iteration 100
## [,1] [,2] [,3] [,4]
## [1,] 0.0003903744 -0.0001438786 0.0017263560 0.001571458
## [2,] -0.0001438786 0.0004087653 0.0004993799 0.001813153
## [3,] 0.0017263560 0.0004993799 0.0152443668 0.015694549
## [4,] 0.0015714585 0.0018131530 0.0156945490 0.040863019
## iteration 200
## [,1] [,2] [,3] [,4]
## [1,] 0.0004023631 -0.0001703146 0.001697344 0.001484011
## [2,] -0.0001703146 0.0004447321 0.000419080 0.001903868
## [3,] 0.0016973442 0.0004190800 0.014398069 0.015896531
## [4,] 0.0014840108 0.0019038683 0.015896531 0.044389192
## iteration 300
## [,1] [,2] [,3] [,4]
## [1,] 0.0003956313 -0.0001727811 0.0016888749 0.00159774
## [2,] -0.0001727811 0.0004628827 0.0003008066 0.00181497
## [3,] 0.0016888749 0.0003008066 0.0137001437 0.01583219
## [4,] 0.0015977404 0.0018149703 0.0158321853 0.04443171
## iteration 400
## [,1] [,2] [,3] [,4]
## [1,] 0.0003966960 -0.0001719888 0.0016451275 0.001548858
## [2,] -0.0001719888 0.0004684167 0.0003622689 0.002003458
## [3,] 0.0016451275 0.0003622689 0.0137218437 0.016201510
## [4,] 0.0015488580 0.0020034577 0.0162015099 0.046175397
## iteration 500
## [,1] [,2] [,3] [,4]
## [1,] 0.0004135674 -0.0001873768 0.0016694824 0.001624097
## [2,] -0.0001873768 0.0004681143 0.0003021341 0.001786532
## [3,] 0.0016694824 0.0003021341 0.0136412731 0.015939754
## [4,] 0.0016240973 0.0017865318 0.0159397544 0.043573619
## iteration 600
## [,1] [,2] [,3] [,4]
## [1,] 0.0004111393 -0.0001937508 0.001653108 0.001534535
## [2,] -0.0001937508 0.0004822347 0.000260584 0.001829480
## [3,] 0.0016531076 0.0002605840 0.013451057 0.015368992
## [4,] 0.0015345352 0.0018294799 0.015368992 0.042992734
## iteration 700
## [,1] [,2] [,3] [,4]
## [1,] 0.0004259534 -0.0001994680 0.0017338084 0.001631373
## [2,] -0.0001994680 0.0004943474 0.0002033533 0.001795465
## [3,] 0.0017338084 0.0002033533 0.0136916465 0.015681755
## [4,] 0.0016313734 0.0017954653 0.0156817555 0.043016583
## iteration 800
## [,1] [,2] [,3] [,4]
## [1,] 0.0004134958 -0.0001932351 0.0016624262 0.001607135
## [2,] -0.0001932351 0.0004925245 0.0002242567 0.001774913
## [3,] 0.0016624262 0.0002242567 0.0131371751 0.015149211
## [4,] 0.0016071351 0.0017749127 0.0151492114 0.042293737
## iteration 900
## [,1] [,2] [,3] [,4]
## [1,] 0.0004139774 -0.0002014407 0.0016479052 0.001583003
## [2,] -0.0002014407 0.0004968611 0.0001793993 0.001649937
## [3,] 0.0016479052 0.0001793993 0.0129764348 0.014710759
## [4,] 0.0015830026 0.0016499365 0.0147107588 0.041445055
## iteration 1000
## [,1] [,2] [,3] [,4]
## [1,] 0.0004196823 -0.0001981646 0.0016697031 0.001606301
## [2,] -0.0001981646 0.0004857090 0.0001668595 0.001625460
## [3,] 0.0016697031 0.0001668595 0.0129729349 0.014645101
## [4,] 0.0016063014 0.0016254599 0.0146451012 0.041083606

cov.bs <- cov(thetas)
cor.bs <- cor(thetas)

The Bootstrap estimator of the covariance matrix is given by

## [,1] [,2] [,3] [,4]
## [1,] 0.0004196823 -0.0001981646 0.0016697031 0.001606301
## [2,] -0.0001981646 0.0004857090 0.0001668595 0.001625460
## [3,] 0.0016697031 0.0001668595 0.0129729349 0.014645101
## [4,] 0.0016063014 0.0016254599 0.0146451012 0.041083606

and the correlation matrix is given by

## [,1] [,2] [,3] [,4]
## [1,] 1.0000000 -0.43891212 0.71558269 0.3868409
## [2,] -0.4389121 1.00000000 0.06647277 0.3638764
## [3,] 0.7155827 0.06647277 1.00000000 0.6343647
## [4,] 0.3868409 0.36387642 0.63436466 1.0000000

Let’s try using the Hessian of the observed information matrix.

library(numDeriv)
loglike <- function(theta)
{
s <- 0
for (x in data)
{
s <- s + log(theta[1]*as.numeric(x==0) +
theta[2]*dpois(x,theta[3]) +
(1-theta[1]-theta[2])*dpois(x,theta[4]))
}
s
}
mle <- c(0.1221661,0.5625419,1.4674746,5.9388889)
solve(-hessian(loglike,mle))

## [,1] [,2] [,3] [,4]
## [1,] 0.0003799048 -1.909698e-04 1.438556e-03 0.001184057
## [2,] -0.0001909698 4.657702e-04 7.132638e-05 0.001417409
## [3,] 0.0014385560 7.132638e-05 1.111722e-02 0.011376017
## [4,] 0.0011840568 1.417409e-03 1.137602e-02 0.034664940

SIUe - Computational Statistics (STAT 575) - Problem Set 4

Sat, 30 Oct 2021 00:00:00 +0000

Problem 1

Use Metropolis Hasting algorithm to generate $Y \sim \operatorname{GAM}(\alpha, 1)$, where $\alpha > 1$. Note $\alpha$ need not to be an integer. Consider the proposal distribution $g$, which is the density of $\operatorname{GAM}(a,b)$, where $a=\lfloor \alpha \rfloor$ and $b = a/\alpha$.

Part (a)

Implement your accept-reject algorithm and Metropolis-hastings algorithms to get a sample of $10000$ from $Y \sim \operatorname{GAM}(\alpha=2.5,1)$.

Acceptance-rejection sampler

We implement the density and sampler functions, respectively $\rm{dgamma1}$ and $\operatorname{rgamma1}$.

Suppose $X \sim \operatorname{GAM}(\alpha,\beta)$ where $\alpha$ is the shape parameter and $\beta$ is the rate parameter. Then, $X$ has a density given by \[ f_X(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}. \]

Suppose $Y \sim \operatorname{GAM}(\alpha,1)$, $\alpha \geq 1$. This is hard to sample from. However, $Z \sim \operatorname{GAM}(a,b)$ where $a = \lfloor \alpha \rfloor$ and $b = a / \alpha$, is easy to sample from, since \[ Z = \sum_{i=1}^{a} X_i \sim \operatorname{GAM}(a,b) \] where $X_i \sim \operatorname{EXP}(b)$.

We may thus use acceptance-rejection sampling to sample $z \sim f_Z$ and then $u \sim U(0,1)$ and accept $z$ as a realization from $f_Y$ if $u \leq \frac{f_Y(z)}{c f_Z(z)}$ where, optimally, \[ c = \max_y \left\{\frac{f_Y(y)}{f_Z(y)}\right\}, \] but in general $c$ must only satisfy $f_Y(y) / c f_Z(y) \leq 1$.

Note that we may use their respective kernels instead, since the ratio of the kernels is proportional to $f_Y(y)/f_Z(y)$. The ratio of the kernels is $h(y) = y^{\alpha-a} e^{(b-1)x}$. Thus, we seek $c = \max_y h(y)$ which may be obtained by solving for the input $y^*$ that maximizes $\log h$, which is the stationary point \[ \left. \frac{d}{dy}\log h(y) \right\vert_{y = y^*} = 0, \] and then substituting that into the ratio and simplifying, yielding the result \[ c = h(y^*) = (\alpha/e)^{\alpha-\lfloor \alpha \rfloor}. \] Observe that if $\alpha$ is an integer, then $c = 1$ and $h(y) = 1$, in which case the target distribution and the candidate distribution are identical.

We implement the acceptance-rejection algorithm with the following code:

# The density function for random variates in the family
# GAM(shape=alpha,rate=1)
dgamma1 <- function(x, alpha) {
dgamma(x, shape = alpha, rate = 1)
}
# An acceptance-rejection sampling procedure for random variates in the family
# GAM(shape=alpha,rate=1)
accept.reject.rgamma1 <- function(n, alpha) {
a <- floor(alpha)
rate <- a/alpha
q <- (alpha/exp(1))^(a - alpha)
ys <- vector(length = n)
for (i in 1:n) {
repeat {
y <- sum(rexp(a, rate)) # draw candidate
if (runif(1) <= q * y^(alpha - a) * exp(y * a/alpha - y)) {
ys[i] <- y
break
}
}
}
ys
}

We sample from $\operatorname{GAM}(2.5,1)$ with the the acceptance-rejection method with:

alpha <- 2.5
m <- 10000
accept.reject.samp <- accept.reject.rgamma1(m, alpha)

Metropolis-Hastings algorithm

Here is our implementation of the Metropolis-Hastings algorithm:

# A sampling procedure for random variates in the family
# GAM(shape=alpha,rate=1) using Metropolis-Hastings algorithm
metro.hast.rgamma1 <- function(n, alpha, burn = 0) {
a <- floor(alpha)
rate <- a/alpha
# density for random variates in the family GAM(shape=alpha,rate=1)
f <- function(x) {
dgamma(x, shape = alpha, rate = 1)
}
g <- function(x) {
dgamma(x, shape = a, rate = rate)
}
m <- n + burn
ys <- vector(length = m)
ys[1] <- sum(rexp(a, rate))
for (i in 2:m) {
v <- sum(rexp(a, rate)) # draw from g
u <- ys[i - 1]
R <- f(v) * g(u)/(f(u) * g(v))
if (runif(1) <= R) {
ys[i] <- v
} else {
ys[i] <- u
}
}
ys[(burn + 1):m]
}

We sample from the Metro-Hastings algorithm with the following code:

metro.hast.samp <- metro.hast.rgamma1(m, alpha)

Part (b)

Check on mixing and convergence using plots. Run multiple chain and compute the Gelman-Rubin statistics. You may pick any reasonable burn-in.

We plot the histograms with:

par(mfrow = c(1, 2))
hist(accept.reject.samp, freq = F, breaks = 50, main = "acceptance-rejection")
lines(seq(0.01, 15, by = 0.01), dgamma1(seq(0.01, 15, by = 0.01), alpha), col = "blue",
lwd = 2)
hist(metro.hast.samp, freq = F, breaks = 50, main = "metropolis-hastings")
lines(seq(0.01, 15, by = 0.01), dgamma1(seq(0.01, 15, by = 0.01), alpha), col = "blue",
lwd = 2)

Both samplers seem to be compatible with the density.

We plot the of the Metropolis-Hastings and acceptance-rejection samplers with:

par(mfrow = c(1, 2))
plot(metro.hast.samp, pch = "·", xlab = "t", ylab = "Y", main = "metropolis-hastings")
plot(accept.reject.samp, pch = "·", xlab = "t", ylab = "Y", main = "acceptance-rejection")

Both of these look good, as neither remain at or near the same value for many iterations. They also both quickly move away from their initial values.

We plot the ACFs with:

par(mfrow = c(1, 2))
acf(metro.hast.samp)
acf(accept.reject.samp)

Both samplers seem to have very little autocorrelation. If an uncorrelated sample is extremely important, to be safe, taking every other sample point would probably be sufficient.

We implement the Gelman-Rubin statistic with:

# samps: should be an L x J matrix, where L is the length of the samples and J
# is the number of samples (independent chains).
gelman.rubin <- function(samps) {
L <- nrow(samps)
J <- ncol(samps)
x.bar <- apply(samps, 2, mean)
B <- var(x.bar) * L
W <- mean(apply(samps, 2, var))
((L - 1)/L * W + B/L)/W
}

Next, we compute Gelman-Rubin statistics on the computed independence chains.

chains <- 1000
samps <- matrix(nrow = m, ncol = chains)
for (i in 1:chains) {
samps[, i] <- metro.hast.rgamma1(m, alpha, burn = 1000)
}
gelman.rubin.stat <- gelman.rubin(samps)
print(gelman.rubin.stat)

## [1] 1.000009

We see that the Gelman-Rubin statistic is given by \[ R = 1.0000087. \]

We adopt the rule of thumb that if $\sqrt{R} < 1.1$, the burn-in and chain length are sufficient. We compute $\sqrt{R}$ to be \[ \sqrt{R} = 1.0000044, \] and thus are satisfied with our burn-in choice and chain length.

Part (c)

Estimate $\operatorname{E}(Y^2)$ using the generated chain. Compare with the estimate you get with acceptance-rejection sampling (Exam 1).

Theoretically, \[ \operatorname{E}(Y^2) = \frac{\Gamma(2+\alpha)}{\Gamma(\alpha)} = \frac{\Gamma(4.5)}{\Gamma(2.5)} = 8.75. \]

We estimate $\operatorname{E}(Y^2)$ using the acceptance-rejection and Metropolis-Hastings by taking the square of each element in the samples they generated and then taking the mean:

tab <- matrix(nrow = 2, ncol = 1)
rownames(tab) <- c("acceptance-rejection", "metropolis-hastings")
colnames(tab) <- c("mean")
tab[1] <- c(mean(accept.reject.samp^2))
tab[2] <- c(mean(metro.hast.samp^2))
knitr::kable(data.frame(tab))

	mean
acceptance-rejection	8.743824
metropolis-hastings	8.703302

Both are quite close to the true value of $8.75$.

Problem 2 (Problem 7.1)

Rework the textbook example. Consider the mixture normal $\delta N(7,0.5^2) + (1-\delta) N(10,0.5^2)$.

Part (a)

Simulate $200$ realizations from the mixture distribution with $\delta = 0.7$. Draw a histogram of these data.

We implement the density and sampler for the mixture distribution with:

dmix <- function(x, delta) {
delta * dnorm(x, 7, 0.5) + (1 - delta) * dnorm(x, 10, 0.5)
}
rmix <- function(n, delta) {
xs <- vector(length = n)
for (i in 1:n) {
xs[i] <- ifelse(runif(1) < delta, rnorm(1, 7, 0.5), rnorm(1, 10, 0.5))
}
xs
}

We generate a sample and plot its histogram with:

n <- 200
delta <- 0.7
data <- rmix(n, delta)
hist(data, freq = F)

Part (b)

Now assume $\delta$ is unknown. Implement independence chain MCMC procedure to simulate from the posterior distribution of $\delta$, using your data from part (a).

lmix <- Vectorize(function(delta, xs) {
if (delta < 0 || delta > 1) {
return(0)
}
p <- 1
for (x in xs) {
p <- p * dmix(x, delta)
}
p
}, "delta")
logmix <- Vectorize(function(delta, xs) {
if (delta < 0 || delta > 1) {
return(-Inf)
}
logp <- 0
for (x in xs) {
logp <- logp + log(dmix(x, delta))
}
logp
}, "delta")

A sample $\{x_t\}$ drawn from the mixture normal with density $\operatorname{dmix}$ is observed with likelihood $\operatorname{lmix}(\delta|\vec{x})$ with respect to $\delta$ with prior distribution $p(\delta)$. Thus, the posterior distribution is given by \[ p(\delta|\vec{x}) \propto p(\delta) \operatorname{lmix}(\delta|\vec{x}). \]

In the independence chain MCMC, we may use the prior as the proposal density, $f(\delta) = p(\delta|\vec{x})$ and $g = p$, and thus \[ R = \frac{f(\delta^{*}) g(\delta^{(t)})}{f(\delta^{(t)})g(\delta^{*})} = \frac{p(\delta^{*}|\vec{x}) p(\delta^{(t)})}{p(\delta^{(t)}|\vec{x}) p(\delta^{*})} \] which may be rewritten as \[ R = \frac{p(\delta^{*}) \operatorname{lmix}(\delta^{*}|\vec{x}) p(\delta^{(t)})}{p(\delta^{(t)}) \operatorname{lmix}(\delta^{(t)}|\vec{x}) p(\delta^{*})} = \frac{\operatorname{lmix}(\delta^{*}|\vec{x})}{\operatorname{lmix}(\delta^{(t)}|\vec{x})}. \]

Numerical imprecision

Suppose we have a data type $T$ that models real numbers. Since computers are physical, $T$ can only represent a finite set of numbers.

In the likelihood function, \[ \operatorname{lmix} \colon \mathbb{R} \times 2^{\mathbb{R}} \mapsto \mathbb{R}, \] if we model $\mathbb{R}$ with $T$, i.e., \[ \operatorname{lmix} \colon T \times 2^T \mapsto T, \] then if the true value of the likelihood function applied to a sufficiently large sample is some value $p \in (0,\epsilon)$ where $\epsilon$ is the smallest representable positive number of type $T$, the best we can do is round $p$ to $0$ or $\epsilon$. As a consequence, the likelihood function evaluates to $0$ on any sufficiently large sample size.

Suppose $\epsilon = 2^{-K}$. If we use the log-likelihood instead, \[ \operatorname{logmix} \colon T \times 2^T \mapsto T \] then, for instance, $\log_2 \epsilon = -K$ where $-K$ is very likely to be at least approximately representatable by $T$, and much smaller values as well. We cannot map many of these log-likelihoods back to a likelihood, but as long as we only need to work with log-likelihoods, this is not a problem.

With the above in mind, we replace the likelihood function with the log-likelihood function to significantly increase the space of samples we can work with.

delta.estimator.ic <- function(n, data, delta0 = runif(1), burn = 0) {
m <- n + burn
deltas <- vector(length = m)
deltas[1] <- delta0
for (i in 2:m) {
delta <- runif(1) # draw candidate from prior
delta.old <- deltas[i - 1]
log.R <- logmix(delta, data) - logmix(delta.old, data)
if (log(runif(1)) <= log.R) {
deltas[i] <- delta
} else {
deltas[i] <- delta.old
}
}
deltas[(burn + 1):m]
}

Part (c)

Implement a random walk chain with $\delta^* = \delta^{(t)} + \epsilon_t$ with $\epsilon \sim \operatorname{UNIF}(-1,1)$.

We observe that $\epsilon_t \sim \operatorname{UNIF}(-1,1)$ for $t=0,\ldots,t$ are the only random components. Thus, the conditional distribution of $\epsilon_{t+1}$ given $\epsilon_{t}$ is \[ \epsilon_{t+1} \sim f(\delta^* - \delta^{(t)}) \] where $f$ is the density of $\operatorname{UNIF}(-1,1)$.

delta.estimator.rw <- function(n, data, delta0 = runif(1), burn = 0) {
m <- n + burn
deltas <- vector(length = m)
deltas[1] <- delta0
for (i in 2:m) {
delta.old <- deltas[i - 1]
delta <- delta.old + runif(1, -1, 1)
log.R <- logmix(delta, data) - logmix(delta.old, data)
if (log(runif(1)) <= log.R) {
deltas[i] <- delta
} else {
deltas[i] <- delta.old
}
}
deltas[(burn + 1):m]
}

Part (d)

Reparameterize the problem letting $U = \log\left(\delta/(1-\delta)\right)$ and $U^* = u(t) + \epsilon_t$. Implement a random walk chain with $U$ as in Equation (7.8) page 208.

logit <- function(delta) {
log(delta/(1 - delta))
}
logit.inv <- function(u) {
exp(u)/(1 + exp(u))
}
logit.inv.J <- function(u) {
exp(u)/(1 + exp(u))^2
}
delta.estimator.u.rw <- function(n, data, delta0 = runif(1), burn = 0) {
m <- n + burn
u <- vector(length = m)
u[1] <- logit(delta0)
for (i in 2:m) {
u.old <- u[i - 1]
u.star <- u.old + runif(1, -1, 1)
R <- lmix(logit.inv(u.star), data) * logit.inv.J(u.star)/(lmix(logit.inv(u.old),
data) * logit.inv.J(u.old))
if (runif(1) <= R) {
u[i] <- u.star
} else {
u[i] <- u.old
}
}
logit.inv(u[(burn + 1):m])
}

Part (e)

Compare the estimates and convergence behavior of three algorithms.

We do not do a burn-in, since we are interested in seeing how quickly the three methods converge. We only plot chains of length $1000$.

We generate the data sets with:

chain <- 1000
burn <- 0
deltas.ic <- delta.estimator.ic(chain, data, burn = burn)
deltas.rw <- delta.estimator.rw(chain, data, burn = burn)
deltas.u.rw <- delta.estimator.u.rw(chain, data, burn = burn)
tab <- matrix(nrow = 3, ncol = 1)
rownames(tab) <- c("independence chain", "random walk", "reparameterized random walk")
colnames(tab) <- c("mu")
tab[1, ] <- mean(deltas.ic)
tab[2, ] <- mean(deltas.rw)
tab[3, ] <- mean(deltas.u.rw)
knitr::kable(data.frame(tab))

	mu
independence chain	0.6887205
random walk	0.6959496
reparameterized random walk	0.6950262

As the table of estimations shows, all three methods provide a good estimate of $\delta$. Next, we consider their convergence and mixing behavior.

We plot the histograms with:

par(mfrow = c(1, 3))
hist(deltas.ic, freq = F, breaks = 50, main = "independence chain")
hist(deltas.rw, freq = F, breaks = 50, main = "random walk")
hist(deltas.u.rw, freq = F, breaks = 50, main = "reparameterized random walk")

The reparameterized random walk metropolis has a histogram that is most compatible with normality, i.e., characteristic bell curve with a mode at $\delta = 0.7$. That said, all three histograms arguably satisfy normality with approximately the same mean at $\delta = 0.7$.

We plot the sample paths with:

par(mfrow = c(1, 3))
plot(deltas.ic, pch = "·", type = "l", xlab = "t", ylab = "delta", main = "independence chain")
plot(deltas.rw, pch = "·", type = "l", xlab = "t", ylab = "delta", main = "random walk")
plot(deltas.u.rw, pch = "·", type = "l", xlab = "t", ylab = "delta", main = "reparameterized random walk")

We see that the random walk demonstrates relatively poor mixing. It has a high rejection rate (stays at the same level for long periods of time), causing it to explore the support of the likelihood slowly.

The sample path of the indepedence chain also can be said to demonstrate poor mixing.

The reparameterized random walk exihibits good mixing, vigorously jiggling around the true value.

We plot the ACFs with:

par(mfrow = c(1, 3))
acf(deltas.ic, main = "independence chain")
acf(deltas.rw, main = "random walk")
acf(deltas.u.rw, main = "reparameterized random walk")

They all decay quickly, but in order of increasing autocorrelation: the reparameterized random walk, the independence chain, and the random walk.

In particular, the reparameterized random walk shows autocorrelation that decays quite rapidly with respect to lag time.

Problem 3

Consider an i.i.d. sample $X_1,\ldots,X_n$ from $N(\mu,\sigma^2)$. Consider the Bayesian analysis to estimate $\mu$ and $\tau = (\sigma^2)^{-1}$. We put prior $\mu \sim N(m,p^{-1})$ and $\tau \sim \operatorname{GAM}(a,b)$.

Part (a)

Write out the posterior distribution of $(\mu,\tau)|\vec{x}$. You may ignore the normalizing constant.

Note that the posterior distribution is given by \[ \pi(\mu,\tau|\vec{x}) = f(\vec{x}|\tau,\mu)f(\tau)f(\mu) / Z \] where $Z = f(\vec{x})$ is the normalizing constant. We may rewrite this as \[ \pi(\mu,\tau|\vec{x}) \propto L(\tau,\mu|\vec{x})f(\tau)f(\mu). \]

The likelihood function of $(\mu,\tau)$ conditioned on the data $\vec{x}$ is given by the product of the normal density on the sample, \[ L(\mu,\tau|\vec{x}) = \prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left\{-\frac{1}{2 \sigma^2}(x_i-\mu)^2\right\} \] which may be simplified to \[ L(\mu,\tau|\vec{x}) = \left(2 \pi \sigma^2\right)^{-n/2} \exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i-\mu)^2\right\}. \] Observing $\tau = 1/\sigma^2$, we obtain the result \[ L(\mu,\tau|\vec{x}) \propto \tau^{n/2} \exp\left\{-\frac{\tau}{2}\sum_{i=1}^{n}(x_i-\mu)^2\right\}. \] The prior for $\mu$ is $f(\mu) \propto \exp\left(-\frac{p}{2}(\mu - m)^2\right)$ and the prior for $\tau$ is $f(\tau) \propto \tau^{a-1} \exp(-b \tau)$. Putting it all together, we obtain the result \[ \pi(\mu,\tau|\vec{x}) \propto \tau^{n/2+a-1} \exp\left\{-\frac{\tau}{2}\sum_{i=1}^{n}(x_i-\mu)^2-\frac{p}{2}(\mu - m)^2 - b \tau\right\}. \]

Part (b)

Show the posterior conditional distribution of $\mu|(\tau,\vec{x})$ is \[ N\left(\frac{n \tau \bar{x} + p m}{ n \tau + p}, \frac{1}{n \tau + p}\right) \] and the posterior conditional distribution of $\tau|(\mu,\vec{x})$ is \[ \operatorname{GAM}\!\left(a+n/2,b+n/2[s^2+(u -\bar{x})^2]\right). \] —

Distribution of $\mu|(\tau,\vec{x})$

The conditional distribution of $\mu|(\tau,\vec{x})$ is given by \[ \pi(\mu | \tau,\vec{x}) = \frac{\pi(\mu,\tau|\vec{x})}{\pi(\tau|\vec{x})} \] where \[ \pi(\tau|\vec{x}) = \int_{-\infty}^{\infty} \pi(\mu,\tau|\vec{x}) d\mu. \] Note that $\pi(\tau|\vec{x})$ is not a function of $\mu$, and thus \[ \pi(\mu | \tau,\vec{x}) \propto \pi(\mu,\tau|\vec{x}). \] We discard any factor that is not a function of $\mu$, \[ \pi(\mu | \tau,\vec{x}) \propto \exp\left\{-\frac{\tau}{2}\sum_{i=1}^{n}(x_i-\mu)^2\right\} \exp\left(-\frac{p}{2}(\mu - m)^2\right). \]

Now, we wish to show that this is the kernel of the given normal distribution. We do this by discarding any factors that are not a function of $\mu$ and rewriting the result that fits the pattern of a normal kernel \[ \pi(\mu | \tau,\vec{x}) = \exp\left\{-\frac{1}{2 k_1}\left(\mu - k_2\right)^2\right\} \] with a mean $k_2$ and variance $k_1$.

\[\begin{align} \pi(\mu | \tau,\vec{x}) &\propto \exp\left\{-\frac{\tau}{2}\left[\sum x_i^2 - 2 \mu \sum x_i + n \mu^2\right] - \frac{p}{2}\left[\mu^2 - 2\mu m + m^2\right]\right\}\\ &\propto \exp\left\{-\frac{1}{2}\left[- 2 n \tau \bar{x}\mu + n \tau \mu^2 + p \mu^2 - 2 p m \mu\right]\right\}\\ &\propto \exp\left\{-\frac{1}{2}\left[(n \tau + p)\mu^2 - (2 p m + 2 n \tau \bar{x})\mu\right]\right\}\\ &\propto \exp\left\{-\frac{n \tau + p}{2}\left[\mu^2 - \frac{2(p m + n \tau \bar{x})}{n \tau + p}\mu\right]\right\}. \end{align}\]

Completing the square, we obtain \[\begin{align} \pi(\mu | \tau,\vec{x}) &\propto \exp\left\{-\frac{n \tau + p}{2}\left[\mu - \frac{p m + n \tau \bar{x}}{n \tau + p}\right]^2 - \left[\frac{p m + n \tau \bar{x}}{n \tau + p}\right]^2\right\}\\ &\propto \exp\left\{-\frac{n \tau + p}{2}\left[\mu - \frac{p m + n \tau \bar{x}}{n \tau + p}\right]^2\right\}. \end{align}\]

Thus, we see that this is the kernel of a normal density with mean \[ \frac{p m + n \tau \bar{x}}{n \tau + p} \] and variance \[ \frac{1}{n\tau + p}. \]

Distribution of $\tau|(\mu,\vec{x})$

The conditional distribution of $\tau|(\mu,\vec{x})$ is given by \[ \pi(\tau | \mu,\vec{x}) = \frac{\pi(\mu,\tau|\vec{x})}{\pi(\mu|\vec{x})} \] where \[ \pi(\mu|\vec{x}) = \int_{-\infty}^{\infty} \pi(\mu,\tau|\vec{x}) d\tau. \] Note that $\pi(\mu|\vec{x})$ is not a function of $\tau$, and thus \[ \pi(\tau | \mu,\vec{x}) \propto \pi(\mu,\tau|\vec{x}). \] We discard any factor that is not a function of $\tau$, \[ \pi(\tau | \mu,\vec{x}) \propto \tau^{n/2+a-1} \exp\left\{-\frac{\tau}{2}\sum_{i=1}^{n}(x_i-\mu)^2 - b \tau \right\}. \]

We seek to match it to the kernel of the gamma density for $\operatorname{GAM}(\alpha,\beta)$, which is given by \[ \tau^{\alpha-1}\exp(-\beta \tau). \] So, we rewrite the above as \[\begin{align} \pi(\tau | \mu,\vec{x}) &\propto \tau^{n/2+a-1} \exp\left\{-\frac{\tau}{2}\sum(x_i-\mu)^2 - b \tau \right\}\\ &\propto \tau^{n/2+a-1} \exp\left\{-\left(\frac{1}{2}\sum(x_i-\mu)^2 + b\right) \tau \right\}. \end{align}\] Thus, we see that $\alpha = n/2+a$ and $\beta = b + \frac{1}{2}\sum(x_i-\mu)^2$. The $\beta$ is not in the form requested, so we continue, focusing strictly on $\beta$.

We may rewrite $\sum(x_i-\mu)^2$ as \[\begin{align} \sum(x_i-\mu)^2 &= \sum(x_i- \bar{x} + \bar{x} - \mu)^2\\ &= \sum(x_i- \bar{x})^2 + \sum(\bar{x} - \mu)^2 - 2 \sum(x_i - \bar{x})(\bar{x} - \mu)\\ &= \sum(x_i- \bar{x})^2 + n(\bar{x} - \mu)^2 - 2 \left(\sum x_i - n \bar{x}\right)(\bar{x} - \mu). \end{align}\] Since $\sum x_i - n \bar{x} = 0$, we may drop the last term, \[ \sum(x_i-\mu)^2 = \sum(x_i- \bar{x})^2 + n(\bar{x} - \mu)^2. \]

We note that $s^2 = \frac{1}{n}\sum(x_i - \bar{x})^2$ and thus \[ \sum(x_i-\mu)^2 = n(s^2 + (\mu-\bar{x})^2), \] which shows that $\beta = b + \frac{n}{2}(s^2 + (\mu-\bar{x})^2)$ and $\alpha = n/2+a$ and is thus the kernel of \[ \operatorname{GAM}\!\left(n/2+a,b + n(s^2 + (\mu-\bar{x})^2)/2\right). \]

Part (c)

First, generate some ``observed’’ sample data using $x = rnorm(200,mu=5,sigma=2)$. Hand-code Gibbs Sampler algorithm to sample $(\mu,\tau)$ from the posterior using $x$. You make take prior parameters $a = 0.0001; b = 0.0001; p = 0.0001; m = 0$. Use the estimated posterior mean and compare your estimates with the true parameters $\mu = 5$ and $\tau = 0.25$.

We generate the sample with:

x <- rnorm(200, mean = 5, sd = 2)

We implement the Gibbs sampling with the function:

mu.tau.gibbs <- function(n, x, burn = 1000, theta0 = NULL, p = 1e-04, m = 0, a = 1e-04,
b = 1e-04) {
x.mu <- mean(x)
x.s2 <- var(x)
x.n <- length(x)
rmu <- function(tau) {
mean <- (x.n * tau * x.mu + p * m)/(x.n * tau + p)
var <- 1/(x.n * tau + p)
rnorm(1, mean = mean, sd = sqrt(var))
}
rtau <- function(mu) {
rgamma(1, shape = a + x.n/2, rate = b + x.n * (x.s2 + (mu - x.mu)^2)/2)
}
prior <- function() {
c(rnorm(1, m, 1/p), rgamma(1, a, b))
}
N <- n + burn
thetas <- matrix(nrow = N, ncol = 2)
if (is.null(theta0)) {
thetas[1, ] <- prior()
} else {
thetas[1, ] <- theta0
}
for (i in 1:(N - 1)) {
tau.new <- rtau(thetas[i, 1])
mu.new <- rmu(tau.new)
thetas[i + 1, ] <- c(mu.new, tau.new)
}
thetas <- thetas[(burn + 1):N, ]
mu.est <- mean(thetas[, 1])
tau.est <- mean(thetas[, 2])
sigma.est <- sqrt(1/tau.est)
list(theta.dist = thetas, mu.est = mu.est, tau.est = tau.est, sigma.est = sigma.est)
}

We use the Gibbs sampler to estimate $(\mu,\tau)$ with:

# set up hyper-parameters
a <- 1e-04
b <- 1e-04
m <- 0
p <- 1e-04
res <- mu.tau.gibbs(1e+05, x, burn = 10000, p = p, a = a, m = m, b = b)
mu.est <- round(as.numeric(res$mu.est), digits = 4)
tau.est <- round(as.numeric(res$tau.est), digits = 4)
var.est <- round(1/tau.est, digits = 4)
c(mu.est, tau.est)

## [1] 4.8905 0.2633

We estimate $(\mu,\tau)$ to be $(4.8905, 0.2633)'$. Therefore, we estimate that the data is being sampled from the normal distribution \[\begin{equation} X_i \sim N(\mu=4.8905,\sigma^2=3.7979). \end{equation}\]

Additional analysis

Out of curiosity, we decided to plot the marginals of the sample superimposed with their respective conditional densities with:

x.mu <- mean(x)
x.n <- length(x)
x.s2 <- var(x)
mu.dist <- res$theta.dist[, 1]
tau.dist <- res$theta.dist[, 2]
hist(mu.dist, freq = F, breaks = 50, main = "marginal of mu", xlab = "mu")
lines(seq(4, 6, by = 0.01), dnorm(seq(4, 6, by = 0.01), mean = (x.n * tau.est * x.mu)/(x.n *
tau.est + p), sd = sqrt(1/(x.n * tau.est + p))))

hist(tau.dist, freq = F, breaks = 50, main = "marginal of tau", xlab = "tau")
lines(seq(0.15, 0.4, by = 0.01), dgamma(seq(0.15, 0.4, by = 0.01), shape = a + x.n/2,
rate = b + x.n * (x.s2 + (mu.est - x.mu)^2)/2))

SIUe - STAT 581 - Problem Set 9

Sat, 30 Oct 2021 00:00:00 +0000

Problem 1
Problem 2
- Part (a)
- Part (b)
- Part (c)
  - Interpretation
- Part (d)
- Part (e)
- Part (f)
Appendix: Code

Problem 1

The surface finish of metal parts made on $a=4$ machines is under investigation. Each machine can be run by one of $b=3$ operators. Because of the location of the machines, operators are specific to a particular machine. Therefore, a nested design with fixed factors is used. Each operator produces $n=2$ samples. The data is availabe on Blackboard as an Excel File.

Part (a)

Explain the difference between crossed factors and nested factors.

Factors $A$ and $B$ are crossed in an experimental design if the levels of $B$ are the same at each level of $A$.

Factor $B$ is nested within factor $A$ if the levels of $B$ are different for each of the levels of factor $A$.

Part (b)

Write the model for a nested design. Provide algebraic formulas for the estimates $\hat{\tau}_i$ and $\hat{\beta})_{j(i)}$.

Given that $A$ and $B$ are fixed factors, \[ Y_{i j k} = \mu + \tau_i + \beta_{j(i)} + \epsilon_{i j k} \begin{cases} i = 1,\ldots,a\\ j = 1,\ldots,b\\ k = 1,\ldots,n \end{cases} \] where $\{\tau_i\}$ ($\sum_i \tau_i = 0$) with $\operatorname{df_{A}}=a-1$ and $\{\beta_(j(i))\}$ ($\sum_j \beta_{j(i)} = 0$ for $i=1,\ldots,a$) with $\operatorname{df_{B}} = a(b-1)$ are fixed effects.

Estimators are given by \[ \hat{\tau}_i = \bar{Y}_{i\cdot\cdot} - \bar{Y}_{\cdot\cdot\cdot} \] and \[ \hat{\beta}_{j(i)} = \bar{Y}_{i j \cdot} - \bar{Y}_{i \cdot\cdot}. \]

We see that level factors of $A$ are compared with $\{\hat{\tau}_i\}$ while level factors of $B$ are compared only with the same level of $A$, e.g., at the $i$-th level of $A$, level factors of $B$ are compared with $\{\beta_j(i) \}$.

Part (c)

Compute the $F_A$ statistic for testing factor $A$ effects, and the $F_{B(A)}$ statistic for testing nested factor $B$ effects. Compute the $p$-values. Provide an overall interpretation, stated in the context of the problem.

For fixed factor $A$ and fixed factor $B$ nested in $A$, \[ F_A = \operatorname{MS_{A}} / \operatorname{MS_{E}} \] and \[ F_{B(A)} = \operatorname{MS_{B(A)}} / \operatorname{MS_{E}} \] where \[\begin{align*} \operatorname{MS_{A}} &= \frac{b n \sum_{i=1}^{a} \hat{\tau}_i^2 }{a-1},\\ \operatorname{MS_{B(A)}} &= \frac{n \sum_{i=1}^{a} \sum_{j=1}^{b} \hat{\beta}_{j(i)}^2}{a(b-1)},\\ \operatorname{MS_{E}} &= \frac{\sum_{i=1}^{a} \sum_{j=1}^{b} \sum_{k=1}^{n} (Y_{i j k} - \bar{Y}_{i j \cdot})}{a b (n-1)}. \end{align*}\]

Under the model $\tau_1 = \cdots \tau_a = 0$, \[ F_A \sim F(\operatorname{df_{A}},\operatorname{df_{E}}) = F(a-1,a b (n-1)) \] and under $\beta_{j(i)} = 0$ for all $i,j$, \[ F_{B(A)} \sim F(\operatorname{df_{B(A)}},\operatorname{df_{E}}) = F(a(b-1),a b (n-1)). \]

We perform these computations in R with:

library("readxl")
data = read_excel("handout9data.xlsx")
A = as.factor(na.omit(data$mchine))
B = as.factor(na.omit(data$operator))
y = na.omit(data$surface)
# use contrasts to define parameter restrictions for the fixed effects in the
# model
contrasts(A)=contr.sum
contrasts(B)=contr.sum
# to fit a model with a nested fixed effect, we use the / notation within the
# aov command.
nested.mod = aov(y ~ A/B)
summary(nested.mod) # compute F statistics for factor effects

## Df Sum Sq Mean Sq F value Pr(>F)
## A 3 3618 1205.9 14.271 0.000291 ***
## A:B 8 2818 352.2 4.168 0.013408 *
## Residuals 12 1014 84.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that $F_A = 14.271$ ($p$-value $= .000$) and $F_{B(A)} = 4.168$ ($p$-value $= .013$).

Interpretation

The experiment finds that the machine has an effect on surface finish. Also, the experiment finds that the operators within a machine has an effect on surface finish.

Part (d)

Compute estimates for each of the effect parameters. Identify which machine performs best, and which operator performs best on each machine. (Higher scores of response are preferred.)

Comparing machines

# compute parameter estimates
estimates = dummy.coef(nested.mod)
estimates$A

## 1 2 3 4
## 2.833333 17.333333 -3.333333 -16.833333

library("multcomp")
# The following commands are used to perform pairwise comparisons between machines.
# The glht command should be familiar from previous sections on multiple comparisons.
compare.A = glht(nested.mod,linfct = mcp(A="Tukey"))
c.m = summary(compare.A,test=adjusted("none"))
c.m

##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: aov(formula = y ~ A/B)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 2 - 1 == 0 14.500 5.307 2.732 0.01819 *
## 3 - 1 == 0 -6.167 5.307 -1.162 0.26785
## 4 - 1 == 0 -19.667 5.307 -3.706 0.00300 **
## 3 - 2 == 0 -20.667 5.307 -3.894 0.00213 **
## 4 - 2 == 0 -34.167 5.307 -6.438 3.22e-05 ***
## 4 - 3 == 0 -13.500 5.307 -2.544 0.02576 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)

cld(c.m)

## 1 2 3 4
## "a" "b" "a" "c"

When we look at the output, we see that group $a$ (machine 4) is less than groups $b$ (machine 1 and 3) and $c$ (machine 2), and group $b$ is less than group $c$.

So, we have machine 4 as the worst, machines 1 and 3 as equal and better than machine 4, and machine 2 as the best.

We summarize with a picture:

Comparing machine $\times$ operator

library("lsmeans")
# comparisons only involving operators (B) within machines (A)
x.mod = aov(y~A*B)
lsmeans(x.mod, pairwise ~ A:B, adjust="none")

Okay, there was a lot there (so I suppressed the output), but we’re only looking at comparisons of operators per machine:

test	est	$p$
1 1 - 1 2	-13.5	0.1677
1 1 - 1 3	19.0	0.0610
1 2 - 1 3	32.5	0.0041
———	—–	——
2 1 - 2 2	13.5	0.1677
2 1 - 2 3	23.5	0.0252
2 2 - 2 3	10.0	0.2980
———	—–	——
3 1 - 3 2	27.0	0.0124
3 1 - 3 3	30.0	0.0068
3 2 - 3 3	3.0	0.7500
———	—–	——
4 1 - 4 2	-3.5	0.7100
4 1 - 4 3	-10.0	0.2980
4 2 - 4 3	-6.5	0.4930

The best operator in machine $1$ is operator $2$.

The best operator in machine $2$ is operator $1$. (There is some evidence that operators $1$ and $2$ may have the same effect, though).

The best operator in machine $3$ is operator $1$.

The best operator in machine $4$ is operator $3$. This last one is the most problematic, since the data is compatible with all of them having the same effect, but if we had to choose one, it would be operator $3$.

Part (e)

Explain why it is not possible to directly compare operators across machines in the above design.

We cannot directly compare operators (factor $B$ levels) across machines (factor $A$ levels) since we only have data for operator performance with respect to a particular machine ($A$ level).

However, we can compare operators within the same machine, or compare operator $\times$ machine combinations.

Problem 2

A nested design is used to study the burning rate of propellant from three production processes (fixed effect, factor $A$). Four batches of propellant are randomly selected from each of the processes (random effect, factor $B$), and $n=3$ determinations of burning rate are made on each batch. The data is availabe on Blackboard as an Excel File.

Part (a)

Provide the algebraic formulas for $\operatorname{MS_{A}}$, $\operatorname{MS_{B(A)}}$, and $\operatorname{MS_{E}}$.

\[\begin{align*} \operatorname{MS_{A}} &= \frac{b n \sum_{i=1}^{a} \hat{\tau}_i }{a-1},\\ \operatorname{MS_{B(A)}} &= \frac{n \sum_{i=1}^{a} \sum_{j=1}^{b} \hat{\beta}_(j(i)}{a(b-1)},\\ \operatorname{MS_{E}} &= \frac{\sum_{i=1}^{a} \sum_{j=1}^{b} \sum_{k=1}^{n} (Y_{i j k} - \bar{Y}_{i j \cdot})}{a b (n-1)}. \end{align*}\]

Part (b)

State the expected value for each of the mean squares.

\[\begin{align*} E(\operatorname{MS_{A}}) &= \sigma^2 + n \sigma_{\beta}^2 + \frac{b n}{a-1} \sum_{i=1}^{a} \tau_i^2,\\ E(\operatorname{MS_{B(A)}}) &= \sigma^2 + n \sigma_{\beta}^2,\\ E(\operatorname{MS_{E}}) &= \sigma^2. \end{align*}\]

Part (c)

Test for differences between the production processes. Write the $F_A$ statistic as a ratio of mean squares. Compute $F_A$ and the $p$-value. Provide an interpretation, stated in the context of the problem.

Under the null model $H_0 : \tau_1 = \cdots = \tau_a = 0$, $\operatorname{MS_{A}}$ and $\operatorname{MS_{B(A)}}$ have the same expected value, and thus an appropriate test statistic for testing evidence of $H_0$ is \[ F_A = \frac{\operatorname{MS_{A}}}{\operatorname{MS_{B(A)}}}. \]

We load the data in R with:

A = as.factor(na.omit(data$proc))
B = as.factor(na.omit(data$batc))
y = na.omit(data$burn_rate)

We compute $F_A$ in R with:

nested.test(A,B,y)

## SS df MS
## Fixed Effect A 676.0556 2 338.02778
## Random Effect B(A) 2077.5833 9 230.84259
## Error 454.0000 24 18.91667
## F-test for fixed effect p-value
## 1.464322 0.2814697
## error.var B.var
## 18.91667 70.64198

We see that $F_A = 1.464$ ($p$-value $= .281$).

Interpretation

The experiment finds that production process does not have an effect on the burning rate of propellant.

Part (d)

Explain why $\operatorname{MS_{E}}$ is the incorrect error term to use when the nested factor is random. In particular, comment on the pertinent sample size.

I touched on this in part (c) ($\operatorname{MS_{A}}$ and $\operatorname{MS_{B(A)}}$ have the same expected value under the null model $\tau_1 = \cdots = \tau_a = 0$).

In addition, we may also think of the batch as the experimental unit and thus the appropriate error term is then a measure of batch variance.

Finally, taking repeat measurements on each random factor level does not increase the pertinent sample size.

Part (e)

Illustrate how evidence in favor of a process effect would be overstated if $\operatorname{MS_{E}}$ is used when computing the test statistic.

Under the null model, $H_0 : \tau_1 = \cdots = \tau_a = 0$, $E(\operatorname{MS_{A}}) = \sigma^2 + n \sigma_{\beta}^2$, and $E(\operatorname{MS_{E}}) = \sigma^2$. The reference distribution $F(\operatorname{df_{A}},\operatorname{df_{E}})$ assumes the numerator and denominator have the same expected value under $H_0$, thus the test statistic \[ \frac{\operatorname{MS_{A}}}{\operatorname{MS_{E}}} \] will be inflated and consequently overstates the effect of the process (factor $A$).

More generally, since $n \sigma_{\beta}^2 > 0$, even if $H_0$ is not the case, the significance of the evidence in favor of a process effect will necessarily be overstated.

Part (f)

Compute estimates of the batch variance and the measurement variance.

An estimator for the measurement variance (within batches) is $\hat{\sigma}^2 = \operatorname{MS_{E}}$ and an estimator for the batch variance (between batches) is given by $\hat{\sigma}_{\beta}^2 = \frac{\operatorname{MS_{B(A)}} - \operatorname{MS_{E}}}{n}$.

In part (c), nested.test showed the outputs \[ \hat{\sigma}^2 = 18.917 \] and \[ \hat{\sigma}_{\beta}^2 = 70.642. \]

Appendix: Code

# Suppose we want R to give us the results we developed in the notes.
# Below is a user defined function to perform those computations.
nested.test = function(A,B,y)
{
av=anova(lm(y~A/B))
ss.A = av$`Sum Sq`[1]
ss.B = av$`Sum Sq`[2]
ss.error = av$`Sum Sq`[3]
df.A = av$Df[1]
df.B = av$Df[2]
df.error = av$Df[3]
ms.A = ss.A / df.A
ms.B = ss.B / df.B
ms.error = ss.error / df.error
F.a = ms.A / ms.B
p.value = pf(F.a,df1=df.A,df2=df.B,lower.tail = FALSE)
table1 = matrix(c(ss.A,ss.B,ss.error,
df.A,df.B,df.error,
ms.A,ms.B,ms.error),nrow = 3)
dimnames(table1) = list(c("Fixed Effect A","Random Effect B(A)","Error"),
c("SS","df","MS"))
print(table1)
table2 = matrix(c(F.a,p.value),nrow = 1)
dimnames(table2) = list(c(""),c("F-test for fixed effect","p-value"))
print(table2)
a=nlevels(A)
b=nlevels(B)
n=length(y) / a / b
var.hat = ms.error
var.B.hat = (ms.B - ms.error) / n
table3 = matrix(c(var.hat,var.B.hat),nrow=1)
dimnames(table3) = list(c(""),c("error.var","B.var"))
print(table3)
}

Estimating How Confidential Encrypted Searches Are Using Moving Average Bootstrap Method

Sun, 07 Apr 2019 00:00:00 +0000

Encrypted Search - Enabling Standard Information Retrieval Techniques for Several New Secure Index Types While Preserving Confidentiality Against an Adversary With Access to Query Histories and Secure Index Contents

Fri, 01 May 2015 00:00:00 +0000

Mon, 01 Jan 0001 00:00:00 +0000

Homomorphic computational extensions

Mon, 01 Jan 0001 00:00:00 +0000

Introduction

We consider homomorphisms which are based on computational concerns which are used to transform inefficient or lossy computations over some original domain $T$ into a conceptually equivalent group $T^*$ over a restricted set of operations.

If the original problem can be solved using these restricted operations, then we may transform $T$ into $T^*$ and efficiently perform the computations. Sometimes, the entire solution cannot be transformed back to $T$, but the restricted set of functions or operations may still be sufficient, e.g., evaluating $a + c < b + c$ even though $a+c$ or $b+c$ may not be in the domain of $T$.

Motivation

Suppose you have a value type $T$ (e.g., $\color{tan}\texttt{double}$ or something more fancy), but when you apply some operations to it, undesirable behavior is produced, e.g., $\color{tan}\texttt{double}$ overflows due to multiplications or loses too much precision due to round-off error.

In the case of a numeric type like $\color{tan}\texttt{double}$, one option is to use a big number library. However, this approach has some disadvantages:

You may not want to have a dependency on some external library
The big number implementation may be too inefficient.

Assume the computational basis of $T$ is $F(T) = \{*,/,+,-,<,==\}$, which can be used to relatively efficiently implement other operations like $\sin$.

Suppose we map $T$ to a modified type $\hat{T}$ such that the computational basis $F(\hat{T})$ is a proper subset of $F(T)$, say $F(\hat{T}) = \{*,/,<,==\}$. In exchange for this restriction, the undesired behavior may be avoidable, say multiplications and divisions almost never overflow or underflow. If we do not require the operations in basis $F(T)$, this is a good exchange.

Alternatively, if the computational basis $F(T)$ is needed, it may or may not be possible to map $\hat{T}$ back to $T$ without any loss of information.

Example

Suppose we try to compute $a!/b!$ (ratio of factorials). The end result may be a value in the domain of $T$, but intermediate values (e.g., $a!$) may only be in the domain of $\hat{T}$. In this case, we can map the final result back to $T$ without any loss.

The logarithm of the data is a natural candidate for this is mapping $\color{tan}\texttt{double}$ to $\color{tan}\widehat{\texttt{double}}$ with the restricted basis

$$ \{*,/,<,==,=\} $$

such that any of these operations almost never fail and can be performed very efficiently, both in terms of time and space complexity.

Code

template <typename T = double>
struct lg
{
T k;
lg(lg const &) = default;
// default constructs the multiplicative identity
lg() : k(T(0)) {}
lg(T x) : k(log(x)) { assert(0 < x); };
// operator to convert (back) to type T.
operator T() const { return exp(k); }
};
template <typename T>
auto operator*(lg<T> x, lg<T> y) { return lg<T>{x.k + y.k}; }
template <typename T>
auto operator/(lg<T> x, lg<T> y) { return lg<T>{x.k - y.k}; }
template <typename T>
auto operator<(lg<T> x, lg<T> y) { return x.k < y.k; }
template <typename T>
auto operator==(lg<T> x, lg<T> y) { return x.k == y.k; }

	observe `id`	observe `not`	observe `true`	observe `false`
latent `id`	\(p_{1 1}\)	\(p_{1 2}\)	\(p_{1 3}\)	\(p_{1 4}\)
latent `not`	\(p_{2 1}\)	\(p_{2 2}\)	\(p_{2 3}\)	\(p_{2 4}\)
latent `true`	\(p_{3 1}\)	\(p_{3 2}\)	\(p_{3 3}\)	\(p_{3 4}\)
latent `false`	\(p_{4 1}\)	\(p_{4 2}\)	\(p_{4 3}\)	\(p_{4 4}\)

	observe `true`	observe `false`
latent `true`	TPR \(p\)	FNR \(1-p\)
latent `false`	FPR \(1-p\)	TNR \(p\)

metafunctor.com

Gave a SLUUG talk "Demystifying Large Language Models (LLMs) on Linux: From Theory to Application"

Approximations of Solomonoff Induction

Fine-Tuning Tiny LLMs for ElasticSearch DSL

Master's project: Reliability Estimation in Series Systems

Various LLM Research Projects

Entropy Maps

Rate distortion: Bernoulli maps

Bernoulli set-indicator function

Boolean Bernoulli as constant function

Conditional distribution of latent function given observed function

Algorithms

Two-level hash function evaluation

Oblivious entropy maps

Reliability Estimation in Series Systems: Maximum Likelihood Techniques for Right-Censored and Masked Failure Data

The Bernoulli Model: A Closer Look at the Boolean Bernoulli Model

Motivation

Bernoulli Booelean

Binary Channels

False Positives and Negatives

Prediction

Inducing Bernoulli types

Noisy Turing machines: noisy logic gates

Case 1: The Correct Output Is True

Case 2: The Correct Output Is False Given x1 = true and x2 = false

Case 3: The Correct Output Is False Given x1 = false and x2 = true

Case 4: The Correct Output Is False Given x1 = false and x2 = false

Summary

Uses and limits of abstractions

Uses and limits of abstractions

Limited working memories

Incomplete information

Emergent behavior

Abstractions as cognitive scaffolds

Conclusion

Working memory as an inductive bias

Working memory as an inductive bias

Working Memory and Inductive Bias

The Limits of Our Understanding

The Unconscious Mind and LLMs

Bernoulli model

likelihood.model R package

Rate-distorted perfect hash filter

Problem set solutions

Using GPT-4 to Build a Simple HTML File Search Interface

Introduction

Setting up the Flask application

Indexing HTML files with Whoosh

Creating search and browse routes

Displaying search results and browsing files

Adding search history and popular searches

algebraic.mle R package

femtograd R package

Model Selection in Weibull Series Systems

Reliability Estimation in Series Systems

wei.series.md.c1.c2.c3

algebraic.dist R package

R Package: Dynamic failure rate (DFR) distributions

A Boolean Algebra Over Trapdoors

R package: md.tools

On moral responsibility

Introduction

Morality

Realism

Nominalism

Criteria for moral agency

Persistence

Persons

Ascribing moral responsibility

Four-dimensionalism

Causal Determinism

Concluding thoughts

Time series analysis of a confidentiality measure for Encrypted Search

Introduction

Encrypted search model

Threat model: known-plaintext attack

Confidentiality measure

Forecasting model

Data description

Time series analysis of \(\{\pi_t\}\)

Case 2: The Correct Output Is False Given `x1 = true` and `x2 = false`

Case 3: The Correct Output Is False Given `x1 = false` and `x2 = true`

Case 4: The Correct Output Is False Given `x1 = false` and `x2 = false`