Applicants for graduate school are classified according to department, sex, and admission status. A goal of the study is to determine the role an applicant’s sex plays in the determination of admission status.
# department and sex are defined through indicator variables
grad.data <- data.frame(dep=c(0,0,1,1),
sex=c(0,1,0,1),
yes=c(235,38,122,103),
no=c(35,7,93,69))
# define row sample sizes
grad.data$n = grad.data$yes + grad.data$no
# define the interaction variable
grad.data$dep.sex = grad.data$dep * grad.data$sex
print(grad.data)
## dep sex yes no n dep.sex
## 1 0 0 235 35 270 0
## 2 0 1 38 7 45 0
## 3 1 0 122 93 215 0
## 4 1 1 103 69 172 1
Problem 1
\fbox{\begin{minipage}{.9\textwidth} Define a main effects logistic regression model having two binary input variables. Include notation for the design matrix , and the parameter vector . Provide an interpretation for each of the effect parameters in , stated in the context of the problem. \end{minipage}}
The data is cross-sectional data given by
where .
The response variables are given by
The explanatory indicator variables and are respectively given by
and
Then, denotes the probability of being admitted,
That is, models probability as a function of .
The probability model is given by
The main effects model is given by
where the parameter vector of dimension is given by
.
The parameter is the partial effect of department (non-science - science) on admission and the parameter is the partial effect of sex (female - male) on admission.
In the main effects model , we assume the effect of is the same at both levels of and vice versa, i.e.,
and
The logistic regression model for is given by
such that if we solve for we get the result
The design matrix is given by $$ \mat{X}
\begin{pmatrix} 1 & 0 & 0\ 1 & 0 & 1\ 1 & 1 & 0\ 1 & 1 & 1 \end{pmatrix}. $$
Since we are dealing with Boolean input variables , we only need to count the number of times each discrete input in the design matrix occurs, say for the -th row, denoted by , and let the corresponding denote the number of times the response was \emph{yes} for .
Then, the probabilistic model for is the binomial distribution,
and we estimate by maximizing
Problem 2
\fbox{\begin{minipage}{.9\textwidth} Provide notation for the design matrix and parameter vector for the saturated model . Provide a brief description of an interaction effect. \end{minipage}}
For the saturated model (interaction model), we must include all possible effects and interactions, which means relative to the main effects model we add account for interaction effects between and , i.e.,
where the parameter vector of dimension is given by
.
The interaction effect between department and sex ( and ) occurs when the effect of an input depends on the level of the input of the other input. In the main effects model , we assume the effect of is the same at both levels of and vice versa, i.e., \begin{align*} \theta(\text{science}:\text{non-science} ;|; \text{male}) &= e^{\beta_1},\ \theta(\text{science}:\text{non-science} ; | ;\text{female}) &= e^{\beta_1+\beta_3},\ \theta(\text{male}:\text{female} ;|; \text{science}) &= e^{\beta_2},\ \theta(\text{male}:\text{female} ; | ; \text{non-science}) &= e^{\beta_2+\beta_3}. \end{align*}
The design matrix for is given by
$$ \mat{X_S} = \begin{pmatrix} 1 & 0 & 0 & 0\\ 1 & 0 & 1 & 0\\ 1 & 1 & 0 & 0\\ 1 & 1 & 1 & 1 \end{pmatrix}. $$Problem 3
\fbox{\begin{minipage}{.9\textwidth} For each of the models , , , provide notation for the design matrix and a brief description of the model effects, stated in the context of the problem. \end{minipage}}
The design matrix for , the independence model, is given by
$$ \mat{X_O} = \begin{pmatrix} 1\\ 1\\ 1\\ 1 \end{pmatrix}. $$This model assumes that neither department nor sex has an effect on admission, i.e., .
The design matrix for is given by
$$ \mat{X_1} = \begin{pmatrix} 1 & 0\\ 1 & 0\\ 1 & 1\\ 1 & 1 \end{pmatrix}. $$This model considers only the marginal effect of department on admission, i.e., .
The design matrix for is given by
$$ \mat{X_2} = \begin{pmatrix} 1 & 0\\ 1 & 1\\ 1 & 0\\ 1 & 1 \end{pmatrix}. $$This model considers only the marginal effect of sex on admission, i.e., .
Problem 4
\fbox{\begin{minipage}{.9\textwidth} Compute the deviance statistic , and give degrees of freedom , for each of the models from the grad school data. Provide a general form for the statistic , and the degrees of freedom for the reference chi-square distribution, for testing a reduced model against a full model . \end{minipage}}
is called the deviance for testing model goodness-of-fit.
The likelihood ratio statistic for testing a reduced model against a full model is given by \begin{align*} G^2(M_R,|,M_S) &= [-2 L_R]-[-2 L_F]\ &= \left([-2 L_R]-[-2 L_S]\right)-\left([-2 L_F]-[-2 L_S] \right)\ &= G^2(M_R,|,M_S)-G^2(M_F,|,M_S)\ &= D(M_R) - D(M_F). \end{align*}
The degree of freedom is given by
and so the reference distribution is with .
# fit each of the candidate models
m.s = glm(yes/n ~ dep+sex+sex*dep,weights=n,family=binomial,data=grad.data)
m.12 = glm(yes/n ~ dep+sex,weights=n,family=binomial,data=grad.data)
m.1 = glm(yes/n ~ dep,weights=n,family=binomial,data=grad.data)
m.2 = glm(yes/n ~ sex,weights=n,family=binomial,data=grad.data)
m.0 = glm(yes/n ~ 1,weights=n,family=binomial,data=grad.data)
# create a table for candidate model deviances
# the rows represent the model, the deviance, and the degrees of freedom
deviance.table=matrix(c(
0,m.12$deviance,m.1$deviance,m.2$deviance,m.0$deviance,
0,m.12$df.residual,m.1$df.residual,m.2$df.residual,m.0$df.residual),
nrow=5)
dimnames(deviance.table) = list(c("MS","M","M1","M2","MO"),c("deviance","delta.df"))
print(deviance.table)
## deviance delta.df
## MS 0.0000000 0
## M 0.4589638 1
## M1 0.6036551 2
## M2 67.8794451 2
## MO 73.1962870 3
Problem 5
\fbox{\begin{minipage}{.9\textwidth} Compute the likelihood statistic for testing reduced model against full model from the grad school data, and provide an interpretation in the context of the problem. \end{minipage}}
# test the main effects model against the interaction (saturated) model
anova(m.12,m.s,test="LRT")
## Analysis of Deviance Table
##
## Model 1: yes/n ~ dep + sex
## Model 2: yes/n ~ dep + sex + sex * dep
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 1 0.45896
## 2 0 0.00000 1 0.45896 0.4981
The statistic , which has a -value of . Thus, we conlcude the main effects model (no interaction on inputs) is compatible with the observed data.
Problem 6
\fbox{\begin{minipage}{.9\textwidth} Compute the likelihood statistic for testing reduced model against full model from the grad school data, and provide an interpretation in the context of the problem. \end{minipage}}
We are testing for a marginal effect of (sex), ignoring the effects of (department).
# test for the input 2 effect using a marginal effect test
anova(m.0,m.2,test="LRT")
## Analysis of Deviance Table
##
## Model 1: yes/n ~ 1
## Model 2: yes/n ~ sex
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 3 73.196
## 2 2 67.879 1 5.3168 0.02112 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The statistic with has a -value around . Thus, we conlcude that the data supports the model where sex has a marginal effect on admission.
Problem 7
\fbox{\begin{minipage}{.9\textwidth} Compute the likelihood statistic for testing reduced model against full model from the grad school data, and provide an interpretation in the context of the problem. Include an explanation of how this test differs from that of the previous problem. \end{minipage}}
We are testing for a partial effect of (sex), adjusting for the effect of (department).
anova(m.1,m.12,test="LRT")
## Analysis of Deviance Table
##
## Model 1: yes/n ~ dep
## Model 2: yes/n ~ dep + sex
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2 0.60366
## 2 1 0.45896 1 0.14469 0.7037
The statistic with has , thus we conclude that when we adjust for (department), the data supports the model where (sex) does not have a significant effect.
How are we to interpret the results? We have shown that sex has a marginal effect on admission, but its effect when adjusting for department is neglibible.
We make the following observation. Males are more likely to choose science, and science department is more likely to accept. Females are more likely to choose non-science, and non-science department is less likely to accept. Thus, we see that overall, females are less likely to be admitted, thus showing that sex has a marginal effect on admission. However, the probability of admission is the same for both sexes when we adjust for department.
Thus, model (department) is the major effect.
Problem 8
\fbox{\begin{minipage}{.9\textwidth} Compute estimates of the response probabilities based on model from the grad school data, and provide an interpretation in the context of the problem. \end{minipage}}
# compute probability estimates under model M1
pred.1 = predict(m.1,type = "response")
prob.table = matrix(c(pred.1),nrow = 4)
dimnames(prob.table) = list(c("science male","science female",
"non-science male","non-science female"),
c("prob.1"))
print(prob.table,digits = 4)
## prob.1
## science male 0.8667
## science female 0.8667
## non-science male 0.5814
## non-science female 0.5814
Under model , we estimate the probability of admission into the science department is and we estimate probability of admission into the non-science department is .
Note that under this model, sex has no effect on the probability of admission. This model is justified on the basis of the data and the previous tests we conducted.