Skip to main content

Discrete Multivariate Analysis - STAT 579 - Problem Set 10

Supplemental material

We refactored the code for independence testing as a function that returns all relevant calculations given a matrix of cross-sectional data.

snugshade Highlighting [# csdata is cross sectional data (observed counts) entered as an IxJ matrix.]{style=“color: 0.56,0.35,0.01”} [# tests for independence P(X,Y) = P(X)P(Y) using the X^2 statistic.]{style=“color: 0.56,0.35,0.01”} independence_test_X2 [<-]{style=“color: 0.56,0.35,0.01”} [function]{style=“color: 0.13,0.29,0.53”}(csdata) { [# define the dimensions of the table]{style=“color: 0.56,0.35,0.01”} I [=]{style=“color: 0.56,0.35,0.01”} [dim]{style=“color: 0.13,0.29,0.53”}(csdata)[[1]{style=“color: 0.00,0.00,0.81”}] J [=]{style=“color: 0.56,0.35,0.01”} [dim]{style=“color: 0.13,0.29,0.53”}(csdata)[[2]{style=“color: 0.00,0.00,0.81”}]

[# compute the overall sample size, the row sample sizes, and the column sample sizes]{style=“color: 0.56,0.35,0.01”} total [=]{style=“color: 0.56,0.35,0.01”} [sum]{style=“color: 0.13,0.29,0.53”}(csdata) row.sum [=]{style=“color: 0.56,0.35,0.01”} [apply]{style=“color: 0.13,0.29,0.53”}(csdata,[1]{style=“color: 0.00,0.00,0.81”},sum) col.sum [=]{style=“color: 0.56,0.35,0.01”} [apply]{style=“color: 0.13,0.29,0.53”}(csdata,[2]{style=“color: 0.00,0.00,0.81”},sum)

[# use matrix algebra to compute the table of expected cell counts]{style=“color: 0.56,0.35,0.01”} expected [=]{style=“color: 0.56,0.35,0.01”} [matrix]{style=“color: 0.13,0.29,0.53”}(row.sum) [%*%]{style=“color: 0.81,0.36,0.00”} [t]{style=“color: 0.13,0.29,0.53”}([matrix]{style=“color: 0.13,0.29,0.53”}(col.sum)) [/]{style=“color: 0.81,0.36,0.00”} total [dimnames]{style=“color: 0.13,0.29,0.53”}(expected) [=]{style=“color: 0.56,0.35,0.01”} [dimnames]{style=“color: 0.13,0.29,0.53”}(csdata)

[# compute the X^2 statistic, degrees of freedom, and p-value]{style=“color: 0.56,0.35,0.01”} X2 [=]{style=“color: 0.56,0.35,0.01”} [sum]{style=“color: 0.13,0.29,0.53”}((obs[-]{style=“color: 0.81,0.36,0.00”}expected)[^]{style=“color: 0.81,0.36,0.00”}[2]{style=“color: 0.00,0.00,0.81”}[/]{style=“color: 0.81,0.36,0.00”}expected) df [=]{style=“color: 0.56,0.35,0.01”} (I[-1]{style=“color: 0.00,0.00,0.81”})[*****]{style=“color: 0.81,0.36,0.00”}(J[-1]{style=“color: 0.00,0.00,0.81”}) p.value.X2 [=]{style=“color: 0.56,0.35,0.01”} [pchisq]{style=“color: 0.13,0.29,0.53”}(X2,df,[lower.tail =]{style=“color: 0.13,0.29,0.53”} [FALSE]{style=“color: 0.56,0.35,0.01”})

[# return computed values as a map]{style=“color: 0.56,0.35,0.01”} [list]{style=“color: 0.13,0.29,0.53”}([expected_counts =]{style=“color: 0.13,0.29,0.53”} expected, [estimate =]{style=“color: 0.13,0.29,0.53”} expected[/]{style=“color: 0.81,0.36,0.00”}[sum]{style=“color: 0.13,0.29,0.53”}(obs), [X2 =]{style=“color: 0.13,0.29,0.53”} X2, [df =]{style=“color: 0.13,0.29,0.53”} df, [p.value =]{style=“color: 0.13,0.29,0.53”} p.value.X2) }

We also refactored the code for independence testing under the assumption of a monotonic association ([γ\gamma]{.math .inline} correlation).

snugshade

Problem 1

The observed data is cross-sectional with a general model given by [{nij}MULT(n,{πij}).\{n_{i j}\} \sim \operatorname{MULT}(n, \{\pi_{i j}\}).]{.math .display}

We consider a simpler model where [XX]{.math .inline} and [YY]{.math .inline} are independent, i.e., [Pr(X,Y)=Pr(X)Pr(Y)\Pr(X,Y) = \Pr(X)\Pr(Y)]{.math .inline}. Then, our task is to measure how compatible the observed data is to the null hypothesis [H0:πij=πi+π+j.H_0 : \pi_{i j} = \pi_{i+}\pi_{+j}.]{.math .display} We prefer [H0H_0]{.math .inline} to a more general model that requires more parameters to estimate (and thus will have greater variance) and interpret.

We perform a chi-square test for independence. The test statistic is given by [X2=i=14(nijm^ij)2m^ijX^2 = \sum_{i=1}^{4} \frac{(n_{i j} - \hat{m}_{i j})^2}{\hat{m}_{i j}}]{.math .display} where [m^ij=nπ^ij0\hat{m}_{i j} = n \hat{\pi}_{i j 0}]{.math .inline} and [π^ij0=ni+n+jn2\hat{\pi}_{i j 0} = \frac{n_{i+}n_{+j}}{n^2}]{.math .inline}.

Under the null model, [X2X^2]{.math .inline} is distributed [χ2\chi^2]{.math .inline} with [df=9\rm{df} = 9]{.math .inline} degrees of freedom.

We compute the [X2X^2]{.math .inline} statistic and [pp]{.math .inline}-value with the following R code:

snugshade Highlighting obs [=]{style=“color: 0.56,0.35,0.01”} [matrix]{style=“color: 0.13,0.29,0.53”}([c]{style=“color: 0.13,0.29,0.53”}([7]{style=“color: 0.00,0.00,0.81”},[7]{style=“color: 0.00,0.00,0.81”},[2]{style=“color: 0.00,0.00,0.81”},[3]{style=“color: 0.00,0.00,0.81”},[2]{style=“color: 0.00,0.00,0.81”},[8]{style=“color: 0.00,0.00,0.81”},[3]{style=“color: 0.00,0.00,0.81”},[7]{style=“color: 0.00,0.00,0.81”},[1]{style=“color: 0.00,0.00,0.81”},[5]{style=“color: 0.00,0.00,0.81”},[4]{style=“color: 0.00,0.00,0.81”},[9]{style=“color: 0.00,0.00,0.81”},[2]{style=“color: 0.00,0.00,0.81”},[8]{style=“color: 0.00,0.00,0.81”},[9]{style=“color: 0.00,0.00,0.81”},[14]{style=“color: 0.00,0.00,0.81”}), [nrow=]{style=“color: 0.13,0.29,0.53”}[4]{style=“color: 0.00,0.00,0.81”}, [byrow=]{style=“color: 0.13,0.29,0.53”}[TRUE]{style=“color: 0.56,0.35,0.01”}) [colnames]{style=“color: 0.13,0.29,0.53”}(obs) [<-]{style=“color: 0.56,0.35,0.01”} [c]{style=“color: 0.13,0.29,0.53”}(["never/occassionally"]{style=“color: 0.31,0.60,0.02”}, ["fairly often"]{style=“color: 0.31,0.60,0.02”}, ["very often"]{style=“color: 0.31,0.60,0.02”}, ["almost always"]{style=“color: 0.31,0.60,0.02”}) [rownames]{style=“color: 0.13,0.29,0.53”}(obs) [<-]{style=“color: 0.56,0.35,0.01”} [colnames]{style=“color: 0.13,0.29,0.53”}(obs)

[print]{style=“color: 0.13,0.29,0.53”}(obs)

##                     never/occassionally fairly often very often almost always
## never/occassionally                   7            7          2             3
## fairly often                          2            8          3             7
## very often                            1            5          4             9
## almost always                         2            8          9            14

snugshade Highlighting x2_test [<-]{style=“color: 0.56,0.35,0.01”} [independence_test_X2]{style=“color: 0.13,0.29,0.53”}(obs)

The observed [X02X_0^2]{.math .inline} is 16.955 and the [pp]{.math .inline}-value is 0.049. We consider this [pp]{.math .inline}-value to be moderate evidence against the null model. Stated differently, the observed data is moderately incompatible with the independence model.

For completeness, the estimates of [πij\pi_{i j}]{.math .inline} under the independence model (with no additional assumptions) for the given cross-sectional data is given by:

snugshade

##                     never/occassionally fairly often very often almost always
## never/occassionally               0.028        0.064      0.041         0.076
## fairly often                      0.029        0.068      0.043         0.080
## very often                        0.028        0.064      0.041         0.076
## almost always                     0.048        0.112      0.072         0.132

Problem 2

The observed data is cross-sectional with a general model given by [{nij}MULT(n,{πij}).\{n_{i j}\} \sim \operatorname{MULT}(n, \{\pi_{i j}\}).]{.math .display}

We consider a simpler model using [γ\gamma]{.math .inline}, [H0:γ=0,H_0 : \gamma = 0,]{.math .display} i.e., [γ\gamma]{.math .inline} is zero if [XX]{.math .inline} and [YY]{.math .inline} are independent.

The test statistic is given by [Z=γ^0σ^(γ^),Z^* = \frac{\hat\gamma-0}{\hat\sigma(\hat\gamma)},]{.math .display} which is normally distributed [N(0,1)\mathcal{N}(0,1)]{.math .inline} under the null model.

We compute [ZZ^*]{.math .inline} and [pp]{.math .inline}-value with the following R code:

snugshade Highlighting gamma_test [<-]{style=“color: 0.56,0.35,0.01”} [independence_test_gamma]{style=“color: 0.13,0.29,0.53”}(obs)

The observed [ZZ^*]{.math .inline} is 3.207 and the [pp]{.math .inline}-value is 0.001. We consider this [pp]{.math .inline}-value to be very strong evidence against the null model. That is, the observed data provides very strong evidence against the independence model when testing for support of a monotonic association model.

For completeness, the estimate for [γ\gamma]{.math .inline} is [γ^=0.36\hat\gamma=0.36]{.math .inline}.

Problem 3

Estimates from simpler models have smaller variances than for estimates from more complicated models.