Skip to main content

Regression Analysis - STAT 482 - Problem Set 4

row

Alex Towell (atowell@siue.edu)

Problem 1

Refer to the data from Exercise 1.20

Data has been collected on [\(45\)]{.math .inline} calls for routine maintenance. The goal is to explore the relationship between the number of copiers serviced [\((x)\)]{.math .inline} and the time in minutes spent to complete the service [\((y)\)]{.math .inline}.

In a confidence band, the error of interest is the maximum error across the input space [\(x_1,\ldots,x_n\)]{.math .inline}. That is, we are interested in [\[ \max_x \frac{\left(\hat{y}(x)-\mu(x)\right)^2}{\hat{\operatorname{V}}(\hat{y}(x))} \sim 2 F(2,n-2). \]]{.math .display}

Thus, a confidence band estimate for [\(\mu(x) = \beta_0 + \beta_1 x\)]{.math .inline} for all [\(x\)]{.math .inline} is defined by the functions [\[ L(x) = \hat{y}(x) - c \sqrt{\hat{\operatorname{V}}(\hat{y}(x))} \]]{.math .display} and [\[ U(x) = \hat{y}(x) + c \sqrt{\hat{\operatorname{V}}(\hat{y}(x))} \]]{.math .display} where [\[ c = \sqrt{2 F(1-\alpha,2,n-2)}. \]]{.math .display}

call.data = read.table('CH01PR20.txt')
colnames(call.data) = c("time","copiers")
call.mod = lm(time ~ copiers, data=call.data)

b0 = call.mod$coefficients[1]; names(b0) = NULL
b1 = call.mod$coefficients[2]; names(b1) = NULL

e = call.mod$residuals
n = length(e)
sse = sum(e^2)
dfe = n-2
mse = sse / dfe

x.all = 1:10
x.sample = call.data$copiers
x.bar = mean(x.sample)
x.star = x.sample - x.bar
ssx = sum(x.star^2)

x.h = 3
y.h = b0 + b1*x.h
y.h.l = y.h - sqrt(2*qf(.95,2,dfe))*sqrt(mse*(1/n+(x.h-x.bar)^2/ssx))
y.h.u = y.h + sqrt(2*qf(.95,2,dfe))*sqrt(mse*(1/n+(x.h-x.bar)^2/ssx))

c(y.h.l,y.h.u)
## [1] 40.27853 48.77265

We see that the confidence band at [\(x_h = 3\)]{.math .inline} is given by [\[ [40.2785284, 48.7726465]. \]]{.math .display}

As we increase the scope of the estimation/prediction, we increase the probability of data incompatible with a model. Thus, we need to increase the range of compatibility.

I was amused by the quote “The really unusual day would be one where nothing unusual happens.”

A [\(95\%\)]{.math .inline} confidence band is given by an upper and lower limit such that, with repeated sampling, [\(95\%\)]{.math .inline} of the time, none of the sample points will be outside of these limits. This contrasts with a confidence interval, where we are only interested in a single point. Thus, a standard CI is too narrow and must be appropriately widened.

y.hat = b0 + b1*x.all
y.lower = y.hat - sqrt(2*qf(.95,2,dfe))*sqrt(mse*(1/n+(x.all-x.bar)^2/ssx))
y.upper = y.hat + sqrt(2*qf(.95,2,dfe))*sqrt(mse*(1/n+(x.all-x.bar)^2/ssx))
y.all = matrix(c(y.hat,y.lower,y.upper),ncol=3)

colnames(y.all) = c("mean.y","lower.limit","upper.limit")
#cbind(x.all,y.all)

matplot(x.all,y.all,type="l",lty=1,col=c("black","red","red"),
        xlab = "copiers",ylab = "time")

call.ci = predict(call.mod,data.frame(copiers=x.all),interval = "confidence")
call.ci = cbind(x.all,call.ci)
#call.ci

y.lwr = call.ci[,3]
y.upp = call.ci[,4]
points(x.all,y.lwr,type="l",lty=2,col="blue")
points(x.all,y.upp,type="l",lty=2,col="blue")

{width=“672”}

Problem 2

Refer to the data from Exercise 1.27

A person’s muscle mass is expected to decrease with age. To explore this relationship in women, a nutritionist randomly selected [\(15\)]{.math .inline} women for each [\(10\)]{.math .inline} year age group, beginning with [\(40\)]{.math .inline} and ending with age [\(79\)]{.math .inline}. The input variable [\(x\)]{.math .inline} is age (in years), and the response variable [\(y\)]{.math .inline} is muscle mass (in muscle mass units).

Recall that [\(\operatorname{E}(\operatorname{MS_E}) = \sigma^2\)]{.math .inline}. The expected value of the [\(\operatorname{MS_R}\)]{.math .inline} is [\[ \operatorname{E}(\operatorname{MS_R}) = \sigma^2 + \operatorname{SS_X}\beta_1^2. \]]{.math .display}

If [\(\beta_1 \approx 0\)]{.math .inline}, then [\(\operatorname{E}(\operatorname{MS_R}) \approx \operatorname{E}(\operatorname{MS_E})\)]{.math .inline}.

The test statistic is given by [\[ F^* = \frac{\operatorname{MS_R}}{\operatorname{MS_E}} \]]{.math .display} where a small [\(F^\)]{.math .inline} indicates data compatible with the null model and a large [\(F^\)]{.math .inline} indicates data not compatible with the null model.

Of [\(\beta_1 = 0\)]{.math .inline} (no effect model), then [\(F^* \sim F(1,n-2)\)]{.math .inline}. The [\(p\)]{.math .inline}-value is given by [\[ \Pr\left[F(1,n-2) > F^*\right]. \]]{.math .display}

We compute these results with:

mass.data = read.table('CH01PR27.txt')
colnames(mass.data) = c("mass","age")
mass.mod = lm(mass ~ age, data=mass.data)
anova(mass.mod)
## Analysis of Variance Table
## 
## Response: mass
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## age        1 11627.5 11627.5  174.06 < 2.2e-16 ***
## Residuals 58  3874.4    66.8                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA table, we see that [\(\operatorname{MS_R}= 11627.486\)]{.math .inline}, [\(\operatorname{MS_E}= 66.8\)]{.math .inline}, and [\(F^* = \operatorname{MS_R}/ \operatorname{MS_E}= 174.06\)]{.math .inline}. This value of the test statistic has a [\(p\)]{.math .inline}-value given by [\(.000\)]{.math .inline}.

Interpretation: Since the data is not compatible with the no effect model, we accept the model which includes age as a predictor.

The coefficient of determination is the proportion of variation in the response [\(Y\)]{.math .inline} that is explained by input [\(x\)]{.math .inline}, [\[ r^2 = \frac{\operatorname{SS_R}}{\operatorname{SSTO}}. \]]{.math .display}

We compute the coefficient of determination with:

summary(mass.mod)$r.squared
## [1] 0.7500668

We estimate that [\(95\%\)]{.math .inline} of the variation in mass is explained by elapsed molding time.

Since the sample correlation is given by [\(r = \sqrt{r^2} = 0.87\)]{.math .inline}, these two variables appear to be significantly correlated.