SIUe - Computational Statistics (STAT 575) - Problem Set 2
Problem 1
Derive the E-M algorithm for right-censored normal data with known variance, say
. Consider ’s that are i.i.d. from a , . We observe and , where , and . Let be the total number of censored (incomplete) observations. We denote the missing data as .
Part (a)
Derive the complete log-likelihood,
.
The unobserved random variates
The likelihood function is therefore
Taking the logarithm of
Anticipating that we will be maximizing the complete log-likelihood with respect
to
Part (b)
Show the conditional expectation
and where and are pdf and cdf of standard normal.
The distribution of
If
The probability
We may rewrite
The expectation of
Analytically, this is a tricky integration problem. Certainly, it would be trivial to numerically integrate this to obtain a solution, but we seek a closed-form solution.
I searched online, and discovered an interesting way to tackle this integration problem.
Let
Then,
We may rewrite the last line as
Part (c)
Derive the
-step and -step using parts (a) and (b). Give the updating equation.
E-step
The
We have already solved the expectation of
Letting
M-step
We wish to solve
Solving for
Part (d)
Use your algorithm on the V.A. data to find the MLE of
. Take the log of the event times first and standardize by sample standard deviation. You may simply use the censored data sample mean as your starting value.
In the following R code, we implement the updating equation derived in the
previous step.
We encapulsate the procedure into a function that takes its arguments in
the form of a censored set, uncensorted set, starting value (
# assuming the uncensored and censored data are distributed normally,
# we use the EM algorithm to derive an estimator given censored and uncensored
# data.
mean_normal_censored_estimator_em <- function(uncensored,censored,theta,eps=1e-6,debug=T)
{
dev <- sd(log(c(uncensored,censored)))
censored <- log(censored) / dev
uncensored <- log(uncensored) / dev
theta <- log(theta) / dev
n <- length(censored) + length(uncensored)
C <- length(censored)
R <- sum(uncensored)
s <- function(theta)
{
sum <- 0
for (i in 1:C)
{
num <- dnorm(censored[i],mean=theta,sd=1)
denom <- 1-pnorm(censored[i],mean=theta,sd=1)
sum <- sum + (num / denom)
}
sum
}
i <- 1
repeat
{
theta.new <- R/n + C/n * theta + (1/n)*s(theta)
if (debug==T) { cat("theta[", i, "] =",theta,", theta[", i+1, "] =",theta.new,"\n") }
if (abs(theta.new - theta) < eps)
{
theta <- theta.new * dev
theta <- exp(theta)
return(theta)
}
i <- i + 1
theta <- theta.new
}
}
We apply this procedure to the indicated data set.
library(MASS) # has VA data
VAs <- subset(VA,prior==0)
censored <- VAs$status == 0
censored_xs <- VAs[censored,c("stime")]
uncensored_xs <- VAs[!censored,c("stime")]
mu <- mean(uncensored_xs)
cat("mean of the uncensored sample is ", mu, ".")
## mean of the uncensored sample is 112.1648 .
sol <- mean_normal_censored_estimator_em(uncensored_xs,censored_xs,mu)
## theta[ 1 ] = 3.857928 , theta[ 2 ] = 3.424258
## theta[ 2 ] = 3.424258 , theta[ 3 ] = 3.415443
## theta[ 3 ] = 3.415443 , theta[ 4 ] = 3.415286
## theta[ 4 ] = 3.415286 , theta[ 5 ] = 3.415283
## theta[ 5 ] = 3.415283 , theta[ 6 ] = 3.415283
sol
## [1] 65.2625
We see that our estimate of
This mean is somewhat lower than anticipated, which makes me suspect something is wrong with my updating equation. If I have the time, I will revisit it.
Problem 2

Problem 4.2
Part (a)
There are
We assume there are
Group
Group
This represents a finite mixture model with a pdf
Let the uncertain group that the
The likelihood function is thus given by
We wish to rewrite this so that the data is explicitly represented.
First, we do the transformation
We let
Anticipating taking
E-step
The conditional expectation to solve in the EM algorithm is given by
Using the linearity of expectations, we rewrite the above to
Consider
Suppose
Making the substitutions yields the result
Assuming
M-step
We wish to solve
We use the Lagrangian to impose the restriction
Thus, when we solve for
Similar results hold for
This does not look too promising until we realize that
Thus,
Solving an estimator for
The same derivation essentially follows for
Part (b)
Estimate the parameters of the model, using the observed data.
# we observe n = (n0,n1,...,n16)
ns <- c(379,299,222,145,109,95,73,59,45,30,24,12,4,2,0,1,1)
N <- sum(ns)
# theta := (alpha, beta, mu, lambda)'
# note that there is an implicit parameter gamma s.t.
# alpha + beta + gamma = 1
# the initial value assumes each category z, t, or p
# is equally probable, and so we let
# (alpha^(0),beta^(0)) = (1/3,1/3)
# and mu^(0) and lambda^(0) are just arbitrarily chosen to be 2 and 3,
# with the insight that group 3 is more risky than group 2.
theta <- c(1/3,1/3,2,3)
# theta := (alpha, beta, mu, lambda)
Pi <- function(i,theta)
{
res <- 0
if (i == 0)
res <- theta[1]
res <- res + theta[2] * theta[3]^i * exp(-theta[3])
res <- res + (1 - theta[1] - theta[2]) * theta[4]^i * exp(-theta[4])
res
}
z0 <- function(theta)
{
theta[1] / Pi(0,theta)
}
t <- function(i,theta)
{
theta[2] * theta[3]^i * exp(-theta[3]) / Pi(i,theta)
}
p <- function(i,theta)
{
(1-theta[1] - theta[2]) * theta[4]^i * exp(-theta[4]) / Pi(i,theta)
}
# update algorithm, based on EM algorithm
update <- function(theta,ns)
{
# note: n0 := ns[1] instead of ns[0] since R does not use zero-based indexes
alpha <- ns[1] * z0(theta) / N
beta <- 0
mu_num <- 0
mu_denom <- 0
lam_num <- 0
lam_denom <- 0
for (i in 0:16)
{
ti <- t(i,theta)
pi <- p(i,theta)
beta <- beta + ns[i+1] * ti
mu_num <- mu_num + i * ns[i+1] * ti
mu_denom <- mu_denom + ns[i+1] * ti
lam_num <- lam_num + i * ns[i+1] * pi
lam_denom <- lam_denom + ns[i+1] * pi
}
beta <- beta / N
mu <- mu_num / mu_denom
lam <- lam_num / lam_denom
c(alpha,beta,mu,lam)
}
em <- function(theta,ns,steps=10000,debug=T)
{
for(i in 1:steps)
{
theta = update(theta,ns)
if (debug==T)
{
if (i %% 1000 == 0) { cat("iteration =",i," theta = (",theta,")'\n") }
}
}
theta
}
# solution theta = (alpha, beta, mu, lambda)
sol <- em(theta,ns,10000,T)
## iteration = 1000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 2000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 3000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 4000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 5000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 6000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 7000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 8000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 9000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
## iteration = 10000 theta = ( 0.1221661 0.5625419 1.467475 5.938889 )'
We see that the solution is
Part (c)
Estimate the standard errors and pairwise correlations of your parameters, using any available method.
We have chosen to use the Bootstrap method.
# ns = (379,299,222,145,109,95,73,59,45,30,24,12,4,2,0,1,1)
# 379 responded 0 encounters
# 299 responded 1 encounters
# 222 responded 2 encounters
# ...
# 1 responded 16 encounters
#
# to resample, we resample from the data set that includes each
# persons response, as determined by ns.
data <- NULL
for (i in 1:length(ns))
{
data <- append(data,rep((i-1),ns[i]))
}
make_into_counts <- function(data)
{
ns <- NULL
for (i in 0:16)
{
ni <- data[data == i]
l <-length(ni)
ns <- append(ns,l)
}
ns
}
m <- 1000 # bootstrap replicates
em_steps <- 100
theta.bs <- em(theta,ns,em_steps,F)
thetas <- rbind(theta.bs)
for (i in 2:m)
{
indices <- sample(N,N,replace=T)
resampled <- make_into_counts(data[indices])
theta.bs <- em(theta,resampled,em_steps,F)
thetas <- rbind(thetas,theta.bs)
if (i %% 100 == 0)
{
cat("iteration ", i, "\n")
print(cov(thetas))
}
}
## iteration 100
## [,1] [,2] [,3] [,4]
## [1,] 0.0003903744 -0.0001438786 0.0017263560 0.001571458
## [2,] -0.0001438786 0.0004087653 0.0004993799 0.001813153
## [3,] 0.0017263560 0.0004993799 0.0152443668 0.015694549
## [4,] 0.0015714585 0.0018131530 0.0156945490 0.040863019
## iteration 200
## [,1] [,2] [,3] [,4]
## [1,] 0.0004023631 -0.0001703146 0.001697344 0.001484011
## [2,] -0.0001703146 0.0004447321 0.000419080 0.001903868
## [3,] 0.0016973442 0.0004190800 0.014398069 0.015896531
## [4,] 0.0014840108 0.0019038683 0.015896531 0.044389192
## iteration 300
## [,1] [,2] [,3] [,4]
## [1,] 0.0003956313 -0.0001727811 0.0016888749 0.00159774
## [2,] -0.0001727811 0.0004628827 0.0003008066 0.00181497
## [3,] 0.0016888749 0.0003008066 0.0137001437 0.01583219
## [4,] 0.0015977404 0.0018149703 0.0158321853 0.04443171
## iteration 400
## [,1] [,2] [,3] [,4]
## [1,] 0.0003966960 -0.0001719888 0.0016451275 0.001548858
## [2,] -0.0001719888 0.0004684167 0.0003622689 0.002003458
## [3,] 0.0016451275 0.0003622689 0.0137218437 0.016201510
## [4,] 0.0015488580 0.0020034577 0.0162015099 0.046175397
## iteration 500
## [,1] [,2] [,3] [,4]
## [1,] 0.0004135674 -0.0001873768 0.0016694824 0.001624097
## [2,] -0.0001873768 0.0004681143 0.0003021341 0.001786532
## [3,] 0.0016694824 0.0003021341 0.0136412731 0.015939754
## [4,] 0.0016240973 0.0017865318 0.0159397544 0.043573619
## iteration 600
## [,1] [,2] [,3] [,4]
## [1,] 0.0004111393 -0.0001937508 0.001653108 0.001534535
## [2,] -0.0001937508 0.0004822347 0.000260584 0.001829480
## [3,] 0.0016531076 0.0002605840 0.013451057 0.015368992
## [4,] 0.0015345352 0.0018294799 0.015368992 0.042992734
## iteration 700
## [,1] [,2] [,3] [,4]
## [1,] 0.0004259534 -0.0001994680 0.0017338084 0.001631373
## [2,] -0.0001994680 0.0004943474 0.0002033533 0.001795465
## [3,] 0.0017338084 0.0002033533 0.0136916465 0.015681755
## [4,] 0.0016313734 0.0017954653 0.0156817555 0.043016583
## iteration 800
## [,1] [,2] [,3] [,4]
## [1,] 0.0004134958 -0.0001932351 0.0016624262 0.001607135
## [2,] -0.0001932351 0.0004925245 0.0002242567 0.001774913
## [3,] 0.0016624262 0.0002242567 0.0131371751 0.015149211
## [4,] 0.0016071351 0.0017749127 0.0151492114 0.042293737
## iteration 900
## [,1] [,2] [,3] [,4]
## [1,] 0.0004139774 -0.0002014407 0.0016479052 0.001583003
## [2,] -0.0002014407 0.0004968611 0.0001793993 0.001649937
## [3,] 0.0016479052 0.0001793993 0.0129764348 0.014710759
## [4,] 0.0015830026 0.0016499365 0.0147107588 0.041445055
## iteration 1000
## [,1] [,2] [,3] [,4]
## [1,] 0.0004196823 -0.0001981646 0.0016697031 0.001606301
## [2,] -0.0001981646 0.0004857090 0.0001668595 0.001625460
## [3,] 0.0016697031 0.0001668595 0.0129729349 0.014645101
## [4,] 0.0016063014 0.0016254599 0.0146451012 0.041083606
cov.bs <- cov(thetas)
cor.bs <- cor(thetas)
The Bootstrap estimator of the covariance matrix is given by
## [,1] [,2] [,3] [,4]
## [1,] 0.0004196823 -0.0001981646 0.0016697031 0.001606301
## [2,] -0.0001981646 0.0004857090 0.0001668595 0.001625460
## [3,] 0.0016697031 0.0001668595 0.0129729349 0.014645101
## [4,] 0.0016063014 0.0016254599 0.0146451012 0.041083606
and the correlation matrix is given by
## [,1] [,2] [,3] [,4]
## [1,] 1.0000000 -0.43891212 0.71558269 0.3868409
## [2,] -0.4389121 1.00000000 0.06647277 0.3638764
## [3,] 0.7155827 0.06647277 1.00000000 0.6343647
## [4,] 0.3868409 0.36387642 0.63436466 1.0000000
Let’s try using the Hessian of the observed information matrix.
library(numDeriv)
loglike <- function(theta)
{
s <- 0
for (x in data)
{
s <- s + log(theta[1]*as.numeric(x==0) +
theta[2]*dpois(x,theta[3]) +
(1-theta[1]-theta[2])*dpois(x,theta[4]))
}
s
}
mle <- c(0.1221661,0.5625419,1.4674746,5.9388889)
solve(-hessian(loglike,mle))
## [,1] [,2] [,3] [,4]
## [1,] 0.0003799048 -1.909698e-04 1.438556e-03 0.001184057
## [2,] -0.0001909698 4.657702e-04 7.132638e-05 0.001417409
## [3,] 0.0014385560 7.132638e-05 1.111722e-02 0.011376017
## [4,] 0.0011840568 1.417409e-03 1.137602e-02 0.034664940