row
Alex Towell (atowell@siue.edu)
Refer to the data from Exercise 3.15
A designed experiment is conducted to study the concentration of a solution ([\(y\)]{.math .inline}) over time ([\(x\)]{.math .inline}) in hours.
Problem 1
#setwd("/home/spinoza/filetopia/gdrive/alex/college/grad_math_classes/stat482_regression_analysis_fa2021/hw/hw5")
data = read.table('CH03PR15.txt')
names(data) = c("conc","hours")
head(data)
## conc hours
## 1 0.07 9
## 2 0.09 9
## 3 0.08 9
## 4 0.16 7
## 5 0.17 7
## 6 0.21 7
reg.mod = lm(conc ~ hours, data=data)
reg.mod
##
## Call:
## lm(formula = conc ~ hours, data = data)
##
## Coefficients:
## (Intercept) hours
## 2.575 -0.324
Our estimate of the linear regression function is given by [\[ \hat{Y} = 2.575 - 0.324 x. \]]{.math .display}
Additional analysis
A simple visual analysis shows that this first-order linear regression model is unable to capture the variability in the concentration with respect to hours.
par(mfrow=c(1,2))
plot(xlab="hour",ylab="concentration",data$hours,data$conc,col="blue",pch='*')
lines(data$hours,reg.mod$fitted.values,col="red")
legend(4,3,legend=c("data","regression"),col=c("blue","red"),pch=c('*','-'))
plot(reg.mod$residuals,pch='*',xlab="hour",ylab="residual")
{width=“672”}
A plot of the regression function versus the data shows that it is biased. A plot of the residuals shows a clear pattern, suggesting that we could do better with a more complex model that is able to capture the systemic variability in the data.
u5.hat = predict(reg.mod, newdata=data.frame(hours=5))
u5.hat
## 1
## 0.9553333
We see that [\(\hat{\mu}_5 = \hat{\operatorname{E}}(Y|X=5) = 0.9553333\)]{.math .inline}. We note that this estimate is not compatible with the datam, and thus we would not predict that the mean concentration of the solution at [\(5\)]{.math .inline} hours is 0.9553333.
Variance
A measure of the precision of an estimator is give by its variance. A primary advantage of the linear regression estimator of the mean [\[ \hat{\mu}_5 = b_0 + 5 b_1, \]]{.math .display} compared to the saturated model estimator [\[ \bar{y}_5 = \operatorname{avg}\left(\{y_i : x_i=5\}\right) \]]{.math .display} is that [\(\hat{\mu}_5\)]{.math .inline} has smaller variance, [\(\operatorname{V}(\hat{\mu}_5) < \operatorname{V}(\bar{y}_5)\)]{.math .inline}, i.e., it is a more precise estimator.
Information
The variance of an estimator of [\(\operatorname{E}(Y|X=5)\)]{.math .inline} decreases as the information we have about [\(\operatorname{E}(Y|X=5)\)]{.math .inline} increases. One way in which the information increases is by increasing the sample size. In the case of [\(\hat{y}_5\)]{.math .inline}, we are only using the information in the subset [\(\{(x_i,y_i) : x_i = 5\}\)]{.math .inline}, but in the case of [\(\hat{\mu}_5\)]{.math .inline}, we are using the information in the entire sample [\(\{(x_i,y_i)\}\)]{.math .inline}.
Bias
A measure of the accuracy of an estimator is its bias. If the linear regression model is an appropriate model for the data generating process, it will have a small bias. Moreover, if the true data generating process is given by [\[ Y_i | x_i \overset{\text{indep}}{\sim} \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2) \]]{.math .display} then the estimator of the mean response [\[ \hat{\operatorname{E}}(Y|x) = b_0 + b_1 x \]]{.math .display} is the UMVU estimator for the mean response of the data generating process.
Additional commentary
This is not strictly related to the question posed by this problem set but is rather for my own instruction. Feel free to ignore, but any feedback is quite welcome.
This has to do with a distinction between estimating the data generating process and performing forecasts or predictions with appropriate confidence intervals.
Assuming [\[ Y_i|x_i \overset{\text{iid}}{\sim} \mathcal{N}(\beta_0+\beta_1 x_i, \sigma^2), \]]{.math .display} the UMVU estimator of the data generating process [\(Y|x\)]{.math .inline} is given by [\[ \widehat{Y|x} \sim \mathcal{N}(b_0 + b_1 x, \hat{\sigma}^2) \]]{.math .display} where [\(\hat{\sigma}^2 = \operatorname{MS_E}\)]{.math .inline} and [\((b_0,b_1)'\)]{.math .inline} is the least squares estimator of [\((\beta_0,\beta_1)\)]{.math .inline}.
Observe that [\(\widehat{Y|x}\)]{.math .inline} has a sampling distribution since [\(b_0\)]{.math .inline}, [\(b_1\)]{.math .inline}, and [\(\hat{\sigma}^2\)]{.math .inline} are functions of the random sample [\(\{(x_i,Y_i) : i=1,\ldots,n\}\)]{.math .inline}. Thus, if we wished to, say, perform an individual forecast of [\(Y|x\)]{.math .inline}, confidence intervals must incorporate this sampling variance as described in a previous homework. However, the estimator [\(\widehat{Y|x}\)]{.math .inline} remains the most accurate and precise estimator for the conditional random variable [\(Y|x\)]{.math .inline}.
The prediction problem [\(Y_{h(\text{new})}|x_h\)]{.math .inline} and the estimator of the conditional distribution [\(Y|x\)]{.math .inline} answer different questions whose distinction, at first glance, may not even be recognized. Even now, I am still not convinced my analysis is correct. It seems a bit slippery. Tread carefully!
Problem 2
In the saturated model, the mean response is , but we require at least one of the inputs to have .
There are [\(c=5\)]{.math .inline} levels of the input variable, [\(x_1=1,x_2=3,x_3=5,x_4=7,x_5=9\)]{.math .inline}, where the [\(i\)]{.math .inline}-th input variable has [\(n_i=3\)]{.math .inline} replications for a total of [\(n=\sum_{i=1}^{c} n_i = 3(5) = 15\)]{.math .inline} observations.
The saturated model is given by [\[ Y_{i j} = \mu_i + \epsilon_{i j} \]]{.math .display} where [\(\epsilon_{i j} \overset{\text{iid}}{\sim} \mathcal{N}(0,\sigma^2)\)]{.math .inline} for [\(i=1,\ldots,5\)]{.math .inline} and [\(j=1,\ldots,3\)]{.math .inline}.
We may also write the reduced model, the linear regression model, as [\[ Y_{i j} = \beta_0 + \beta_1 x_i + \epsilon_{i j}. \]]{.math .display}
The sum of squared errors with respect to the full (saturated) model is given by [\[ \operatorname{SS_{PE}}= \operatorname{SS_E}(F) = \sum_{i=1}^{c}\sum_{j=1}^{n_i} (Y_{i j} - \bar{Y}_i)^2 \]]{.math .display} which simplifies to [\[ \operatorname{SS_{PE}}= \sum_{i=1}^{c} (n_i-1) s_i^2, \]]{.math .display} where [\(\operatorname{SS_{PE}}\)]{.math .inline} is called the , which has [\(\operatorname{df_E}(F) = \sum_{i=1}^{c} (n_i-1) = n-c = 15-5 = 10\)]{.math .inline} degrees of freedom.
The sum of squared errors with respect to the reduced (linear regression) model is given by [\[ \operatorname{SS_E}(R) = \sum_{i=1}^{c}\sum_{j=1}^{n_i} (Y_{i j} - \hat{Y}_i)^2 \]]{.math .display} where [\(\hat{Y}_i = b_0 + b_1 x_i\)]{.math .inline}. This is the sum of squares [\(\operatorname{SS_E}\)]{.math .inline} as previously defined, which has [\(\operatorname{df_E}(R) = n-2 = 15-2 = 13\)]{.math .inline} degrees of freedom.
The sum of squares lack of fit is given by [\[ \operatorname{SS_{LF}}= \operatorname{SS_E}(R) - \operatorname{SS_E}(F) = \sum_{i=1}^{c} n_i(\bar{Y}_i - \hat{Y}_i)^2, \]]{.math .display} which has [\(\operatorname{df_E}(R) - \operatorname{df_E}(F) = (n-2) - (n-c) = c-2 = 3\)]{.math .inline} degrees of freedom.
Intuitively, if [\(\operatorname{SS_{LF}}\)]{.math .inline} is small, the difference between the saturated model and the reduced model is small, which suggests that reduced model is a good fit. Since the reduced model, the linear regression model, uses more of the information in the sample, we prefer it over the saturated model for previously stated reasons.
We formalize this intuition with the hypothesis test [\[\begin{align*} H_0 &: \operatorname{E}(Y|x) = \beta_0 + \beta_1 x\\ H_A &: \operatorname{E}(Y|x) \neq \beta_0 + \beta_1 x \end{align*}\]]{.math .display} and the general linear test statistic [\[\begin{align*} F_{\text{LF}}^* &= \frac{\operatorname{SS_{LF}}/ (\operatorname{df_E}(R)-\operatorname{df_E}(F))}{\operatorname{SS_{PE}}/\operatorname{df_E}(F)}\\ &= \frac{\operatorname{SS_{LF}}/ (c-2)}{\operatorname{SS_{PE}}/(n-c)}\\ &= \frac{\operatorname{MS_{LF}}}{\operatorname{MS_{PE}}}, \end{align*}\]]{.math .display} which under the null hypothesis is distributed as [\(F(c-2,n-c)\)]{.math .inline}. The [\(p\)]{.math .inline}-value of the test statistic is thus given by [\[ \Pr\{F(c-2,n-c) \geq F_{\operatorname{LF}}^*\}. \]]{.math .display}
data$fact.hours = as.factor(data$hours)
full.mod = lm(conc ~ fact.hours - 1, data=data)
anova(reg.mod,full.mod)
## Analysis of Variance Table
##
## Model 1: conc ~ hours
## Model 2: conc ~ fact.hours - 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 13 2.9247
## 2 10 0.1574 3 2.7673 58.603 1.194e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the ANOVA table provides all the necessary information. We represent the information in the more familiar form with the following table:
We see that [\(F_{\operatorname{LF}}^*=58.603\)]{.math .inline} which has a [\(p\)]{.math .inline}-value under the null hypothesis of [\(.000\)]{.math .inline}.
Interpretation
At a [\(p\)]{.math .inline}-value [\(\approx .000\)]{.math .inline}, the data is not compatible with the linear regression model given any reasonable significance level. We will need a more flexible model.
plot(data$fact.hours,
full.mod$fitted.values,
type = "b",
ylab = "concentration",
xlab = "hours",
ylim = c(min(data$conc),max(data$conc)))
abline(reg.mod$coefficients,col="red")
points(data$fact.hours,data$conc,pch=16)
{width=“672”}