Skip to main content

Regression Analysis - STAT 482 - Exam 1

row

Alex Towell (atowell@siue.edu)

Problem 1

Experience with a certain type of plastic indicates that a relationship exists between the hardness (measured in Brinell units) of items molded from the plastic (response variable [\(y\)]{.math .inline} ) and the elapsed time (measured in hours) since the termination of the molding process (input variable [\(x\)]{.math .inline}). Sixteen batches of the plastic were made, and from each batch one test item was molded. Each test item was randomly assigned to one of the four predetermined time levels. The data is available on Blackboard as a csv file.

Part (a)

State the simple linear regression model.

Reproduce

[\[ Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]]{.math .display} where [\[\begin{align*} \epsilon_1,\ldots,\epsilon_n &\sim \mathcal{N}(0,\sigma^2)\\ Y_i &= \text{is the $i$-th plastic hardness response (in Brinell units)}\\ x_i &= \text{is the $i$-th elapsed molding time}. \end{align*}\]]{.math .display}

Part (b)

Provide an interpretation for the regression parameter [\(\beta_1\)]{.math .inline}, stated in the context of the problem.

Reproduce

[\(\beta_1\)]{.math .inline} is the difference in mean plastic hardness (in Brinell units) from a [\(1\)]{.math .inline} hour increase in elapsed molding time.

Part (c)

Compute an interval estimate for [\(\beta_1\)]{.math .inline}.

Output

dat = read.csv("./exam1-1.csv")
dat
##    time hardness
## 1    16      195
## 2    16      203
## 3    16      196
## 4    16      206
## 5    24      223
## 6    24      217
## 7    24      217
## 8    24      209
## 9    32      232
## 10   32      235
## 11   32      224
## 12   32      233
## 13   40      257
## 14   40      249
## 15   40      250
## 16   40      249
#dat$time = as.double(dat$time)
reg.mod = lm(hardness ~ time, data=dat)
confint(reg.mod)
##                  2.5 %     97.5 %
## (Intercept) 157.299114 174.300886
## time          1.813919   2.392331

Reproduce

A [\(95\%\)]{.math .inline} confidence interval for [\(\beta_1\)]{.math .inline} is [\([1.814,2.392]\)]{.math .inline}.

Part (d)

Compute a confidence interval for [\(\mu_h\)]{.math .inline} at [\(x_h = 40\)]{.math .inline} hours.

Output

x.h = 40
predict(reg.mod,data.frame(time=x.h),interval="confidence")
##       fit      lwr      upr
## 1 249.925 245.5966 254.2534

Part (e)

Compute a prediction interval for [\(Y_{h(\text{new})}\)]{.math .inline} at [\(x_h = 40\)]{.math .inline} hours.

Output

predict(reg.mod,data.frame(time=x.h),interval="predict")
##       fit      lwr      upr
## 1 249.925 238.7092 261.1408

Part (f)

Explain the difference between a confidence interval and a prediction interval, stated in the context of the problem.

Reproduce

A confidence interval is for the mean plastic hardness (in Brinell units) for all items molded with an elapsed time of [\(40\)]{.math .inline} hours.

A prediction interval is for the mean plastic hardness (in Brinell units) for a particular item molded with an elapsed time of [\(40\)]{.math .inline} hours.

Part (g)

State the equations for [\(E(\ms{R})\)]{.math .inline} and [\(\ms{E}\)]{.math .inline}.

Reproduce

[\[\begin{align*} E(\ms{R}) &= \sigma^2 + \sos{X} \beta_1^2\\ E(\ms{E}) &= \sigma^2. \end{align*}\]]{.math .display}

Part (h)

Use part (g) to explain a motivation behind the [\(F\)]{.math .inline} test for input effects.

Reproduce

Note that [\[ F^* = \frac{\ms{R}}{\ms{E}}. \]]{.math .display} so if [\(\beta_1 \approx 0\)]{.math .inline}, then [\(E(\ms{R}) = E(\ms{E})\)]{.math .inline}. Small [\(F^\)]{.math .inline} indicates data compatible with the null model. Large [\(F^\)]{.math .inline} indicates data supporting the alternative model.

Part (i)

Compute the [\(F^*\)]{.math .inline} statistic and the [\(p\)]{.math .inline}-value. Provide an interpretation of the result, stated in the context of the problem.

Output

anova(reg.mod)
## Analysis of Variance Table
## 
## Response: hardness
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## time       1 5661.6  5661.6  243.27 3.033e-10 ***
## Residuals 14  325.8    23.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reproduce

Since [\(p\)]{.math .inline}-value [\(= .000\)]{.math .inline}, the data is not compatible with the no effect model. We accept the model which includes elapsed molding time as a predictor of hardness.

Part (j)

Compute the coefficient of determination [\(r^2\)]{.math .inline}. Provide an interpretation of the result, stated in the context of the problem.

Output

summary(reg.mod)$r.squared
## [1] 0.9455819

Reproduce

We estimate that [\(95\%\)]{.math .inline} of the variation in hardness is explained by elapsed molding.

Part (k)

State the full model (saturated model) in testing for regression model fit.

Reproduce

There are [\(c=4\)]{.math .inline} levels of the input (predictor) variable [\(x\)]{.math .inline} (time) where the [\(i\)]{.math .inline}-th level of [\(x\)]{.math .inline} has [\(n_i = 4\)]{.math .inline} replications for [\(i=1,\ldots,c\)]{.math .inline}.

The saturated model is given by [\[ Y_{i j} = \mu_i + \epsilon_{i j} \begin{cases} i = 1,\ldots,c\\ j = 1,\ldots,n_i. \end{cases} \]]{.math .display} where [\[\begin{align*} \epsilon_{i j} &\overset{\text{iid}}{\sim} \mathcal{N}(0,\sigma^2)\\ \mu_i &= \text{mean of response at the $i$-th level of $x$}\\ Y_{i j} &= \text{$j$-th response at the $i$-th level of $x$}. \end{align*}\]]{.math .display}

Part (l)

Compute [\(\sos{PE}\)]{.math .inline} and [\(\sos{LF}\)]{.math .inline} for testing the fit of the linear regression model.

Output

dat$fact.time = as.factor(dat$time)

sat.mod = lm(hardness ~ fact.time - 1, data=dat)
anova(reg.mod,sat.mod)
## Analysis of Variance Table
## 
## Model 1: hardness ~ time
## Model 2: hardness ~ fact.time - 1
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     14 325.82                           
## 2     12 299.75  2    26.075 0.5219 0.6062

Reproduce

From the output, we see that [\(\sos{PE} = 299.75\)]{.math .inline} and [\(\sos{LF} = 26.075\)]{.math .inline}.

Part (m)

Compute the [\(F_{LF}\)]{.math .inline} statistic and the [\(p\)]{.math .inline}-value. Provide an interpretation of the result, stated in the context of the problem.

Output

See output for part (l).

Reproduce

From the output, we see that [\[ F_{LF} = \frac{\sos{LF}/(c-2)}{\sos{PE}/(n-c)} = \frac{\sos{LF}/(4-2)}{\sos{PE}/(16-4)} = \frac{26.075/2}{299.75/12} = .5219 \]]{.math .display} which has a [\(p\)]{.math .inline}-value [\(= .606\)]{.math .inline}.

Since the [\(p\)]{.math .inline}-value is so large (small [\(F_{LF}\)]{.math .inline} value), the experiment finds that the data is compatible with the null model (linear regression model).

Part (n)

Create a plot comparing the fitted values from the regression model with the fitted values from the saturated model.

Output

plot(sort(dat$time),
     sat.mod$fitted.values[order(dat$time)],
     type = "b",
     ylab = "hardness",
     xlab = "elapsed time",
     ylim = c(min(dat$hardness),max(dat$hardness)))
abline(reg.mod$coefficients,col="red")
points(dat$time,dat$hardness,pch=16)

{width=“672”}

Part (o)

Explain an advantage of using the regression estimate [\(\hat{\mu}_{40}\)]{.math .inline} instead of the sample mean [\(\bar{y}_{40}\)]{.math .inline} and explain when this advantage will lead to a more accurate estimator.

Reproduce

Advantage of [\(\hat\mu_{40}\)]{.math .inline} over [\(\bar{y}_{40}\)]{.math .inline}: The linear regression estimate of a mean will have smaller variance than the saturated model estimate. When more accurate: If the linear regression model is appropriate, then its estimate [\(\hat\mu_{40}\)]{.math .inline} will also have a small bias.

Problem 2

Measurements were made on men involved in a physical fitness course. The input variables under consideration are age (in years), weight (in kgs), time to run 1.5 miles (in minutes), heart rate while resting (in bpm), heart rate while running, and maximum heart rate. The goal is to determine which input variables are needed to model the oxygen uptake rate (ml/kg body weight per minute). The data is available on Blackboard as a csv file.

Part (a)

Describe the goal of the discrepancy function approach to model selection. How is the best model defined? What are the two sources of model error?

The goal of model selection is to choose the “best” model from a candidate class of models.

We consider the best model that which provides the most accurate estimation of [\(E(Y_i|x_i) = \mu_i\)]{.math .inline} at the observed input levels [\(\underbar{x}_1,\ldots,\underbar{x}_n\)]{.math .inline}.

The two sources of model error are model misspecification (bias) and parameter estimation (variance).

Part (b)

Plot the [\(C_p\)]{.math .inline} statistic against the model dimension. Plot the [\(C_p\)]{.math .inline} statistic against the candidate models. Which variables are included in the selected model?

Output

library(leaps)
dat2 = read.csv("./exam1-2.csv")
dat2
##    age weight runtime rstpulse runpulse maxpulse      oxy
## 1   44  89.47   11.37       62      178      182 41.05617
## 2   40  75.07   10.07       62      185      185 48.68738
## 3   44  85.84    8.65       45      156      168 57.72057
## 4   42  68.15    8.17       40      166      172 59.51438
## 5   38  89.02    9.22       55      178      180 52.34193
## 6   47  77.45   11.63       58      176      176 45.30553
## 7   40  75.98   11.95       70      176      180 44.56676
## 8   43  81.19   10.85       64      162      170 50.19773
## 9   44  81.42   13.08       63      174      176 40.44744
## 10  38  81.87    8.63       48      170      186 54.99462
## 11  44  73.03   10.13       45      168      168 49.84048
## 12  45  87.66   14.03       56      186      192 37.94971
## 13  45  66.45   11.12       51      176      176 47.01984
## 14  47  79.15   10.60       47      162      164 47.77988
## 15  54  83.12   10.33       50      166      170 46.16785
## 16  49  81.42    8.95       44      180      185 51.31369
## 17  51  69.63   10.95       57      168      172 50.17241
## 18  51  77.91   10.00       48      162      168 51.36121
## 19  48  91.63   10.25       48      162      164 50.38991
## 20  49  73.37   10.08       67      168      168 46.90032
## 21  57  73.37   12.63       58      174      176 38.82112
## 22  54  79.38   11.17       62      156      165 42.99687
## 23  52  76.32    9.63       48      164      166 45.64238
## 24  50  70.87    8.92       48      146      155 53.98943
## 25  51  67.25   11.08       48      172      172 42.68997
## 26  54  91.63   12.88       44      168      172 37.45437
## 27  51  73.71   10.47       59      186      188 44.42336
## 28  57  59.08    9.93       49      148      155 54.26127
## 29  49  76.32    9.40       56      186      188 47.82479
## 30  48  61.24   11.50       52      170      176 48.86472
## 31  52  82.78   10.50       53      170      172 39.39456
colnames(dat2) = c("x1","x2","x3","x4","x5","x6","y")
head(dat2)
##   x1    x2    x3 x4  x5  x6        y
## 1 44 89.47 11.37 62 178 182 41.05617
## 2 40 75.07 10.07 62 185 185 48.68738
## 3 44 85.84  8.65 45 156 168 57.72057
## 4 42 68.15  8.17 40 166 172 59.51438
## 5 38 89.02  9.22 55 178 180 52.34193
## 6 47 77.45 11.63 58 176 176 45.30553
full.mod = regsubsets(y ~ ., data=dat2)
sel = summary(full.mod)
plot(sel$cp,xlab="#(variables)",ylab="Cp",type="l",main= "Cp vs dimension")
star = which.min(sel$cp)
points(star,sel$cp[star],col="red",pch=20,cex=2)

{width=“672”}

plot(full.mod,scale="Cp")

{width=“672”}

Reproduce

We select the model which includes [\(5\)]{.math .inline} input variables, [\[\begin{align*} x_1 &= \text{age},\\ x_2 &= \text{weight},\\ x_3 &= \text{run time},\\ x_5 &= \text{run pulse},\\ x_6 &= \text{max pulse}. \end{align*}\]]{.math .display}

Part (c)

Test the selected model against its best competitor having fewer parameters. (Compute the test statistic and the [\(p\)]{.math .inline}-value.) How does discrepancy based model selection compare to [\(p\)]{.math .inline}-value based selection?

Output

m.R = lm(y ~ x1+x2+x3+x5, data=dat2)
m.F = lm(y ~ x1+x2+x3+x5+x6, data=dat2)
anova(m.R,m.F)
## Analysis of Variance Table
## 
## Model 1: y ~ x1 + x2 + x3 + x5
## Model 2: y ~ x1 + x2 + x3 + x5 + x6
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     26 148.08                           
## 2     25 136.50  1    11.582 2.1211 0.1577

Reproduce

The next best reduced linear regression model discards [\(x_6\)]{.math .inline} (max pulse). So, we are testing [\[ H_0 : \beta_6 = 0 \]]{.math .display} by comparing the model [\(m_F\)]{.math .inline} that includes [\(x_6\)]{.math .inline} against the reduced model [\(m_R\)]{.math .inline} that discards [\(x_6\)]{.math .inline}. See the ANOVA output.

We see that [\(F^* = 2.121\)]{.math .inline} with a [\(p\)]{.math .inline}-value [\(= .158\)]{.math .inline}.

We see that in the hypothesis test, we find the data to be compatible with the reduced model, since the [\(F^*\)]{.math .inline} statistic has a large [\(p\)]{.math .inline}-value. So, according to this test, we would choose [\(m_R\)]{.math .inline}. In the discrepancy-based model selection, we found the best model to be [\(m_F\)]{.math .inline}. Thus, the rule for adding predictors/inputs to a model is less stringent for discrepancy based model selection than for hypothesis testing.