Skip to main content

Computational Statistics - SIUe - STAT 575 - Exam 1

Problem 1

Part (1) (10 points)

Use accept-reject algorithm to generate a truncate standard Normal distribution, with density

f(x)ex2/2I(x>2). f(x) ∝ e^{-x^2/2}I(x > 2).

Consider the proposed distribution g(x)=2e2(x2)I(x>2)g(x) = 2 e^{−2(x−2)}I(x > 2).

Part (a)

Sample from gg using inverse-transform method.

Solution

First, we must find the cdf GG of gg. One way to derive GG is to notice that gg is the density of the shifted exponential, i.e., if Sexp(λ=2)S \sim \exp(\lambda=-2), then X=2+SX = 2 + S has a cdf

G(x)=P(Xx)=P(2+Sx)=P(Sx2)=FS(x2)==I(x>2)(1exp(2(x2))). \begin{align*} G(x) &= P(X \leq x)\\ &= P(2+S \leq x)\\ &= P(S \leq x-2)\\ &= F_S(x-2)\\ &= &= I(x > 2)(1 - \exp(-2(x-2))). \end{align*}

As a quick proof that GG is the cdf of gg, note that dG/dx=gdG/dx = g.

Alternatively, the cdf GG is defined as

G(x)=xg(s)ds=I(x>2)2x2e2(s2)ds=I(x>2)e42x2e2sds=I(x>2)e42x(2)e2sds=I(x>2)e4(e2s2x)=I(x>2)e4(e2xe4)=I(x>2)(1e4e2x)=I(x>2)(1e2(x2)), \begin{align*} G(x) &= \int_{-\infty}^{x} g(s) ds\\ &= I(x > 2) \int_{2}^{x} 2 e^{−2(s−2)} ds\\ &= I(x > 2) e^4 \int_{2}^{x} 2 e^{−2 s} ds\\ &= -I(x > 2) e^4 \int_{2}^{x} (-2) e^{−2 s} ds\\ &= -I(x > 2) e^4 \left(e^{-2 s}|_2^x\right)\\ &= -I(x > 2) e^4 \left(e^{-2 x}-e^{-4}\right)\\ &= I(x > 2) (1 - e^4 e^{-2 x})\\ &= I(x > 2) (1 - e^{-2(x-2)}), \end{align*}

which obtains the same result. Either way, we solve for xx in

u=G(x)u=1e2(x2)1u=e2(x2)log(1u)=2(x2)12log(1u)=x2, \begin{align*} u &= G(x)\\ u &= 1 - e^{-2(x-2)}\\ 1-u &= e^{-2(x-2)}\\ \log(1-u) &= -2(x-2)\\ -\frac{1}{2}\log(1-u) &= x-2, \end{align*}

and thus

x=12log(1u)+2 x = -\frac{1}{2}\log(1-u) + 2

where uu is drawn from UNIF(0,1)\rm{UNIF}(0,1). That is, if UUNIF(0,1)U \sim \rm{UNIF}(0,1) then

X=12log(1U)+2 X = -\frac{1}{2}\log(1-U) + 2

is a random variable whose density is gg.

Part (b)

Use accept-reject algorithm to generate a sample of 1000010000 from f. Choose a cc so that f(y)/[cg(y)]1f(y)/[c g(y)] ≤ 1. Verify the generated sample via a plot of the true normalized density, and a histogram of the generated values.

Solution

We wish to find a cc such that f(y)/[cg(y)]1f(y)/[c g(y)] ≤ 1. However, we do not need ff, we only need the kernel of ff, which has already been given. So, ideally, we find the smallest cc that satisifes the inequality

k(y)g(y)1 \frac{k(y)}{g(y)} \leq 1

where kk is the kernel of ff, which is given by c=maxk(y)/g(y)c = \max{k(y)/g(y)}. We may solve this analytically by finding

y=arg maxyk(y)/g(y) y^* = \argmax_y k(y)/g(y)

and then let $c = k(y^)/g(y^)$.

First, we let

h(y)=k(y)/g(y)=ey2/22e2(y2). h(y) = k(y)/g(y) = \frac{e^{-y^2/2}}{2e^{-2(y-2)}}.

We can find the value that maximizes hh by finding the value that maximizes logh\log h, which is given by

logh(y)=logey2/22e2(y2)=12y2+2(y2). \log h(y) = \log \frac{e^{-y^2/2}}{2e^{-2(y-2)}} = -\frac{1}{2}y^2 + 2(y-2).

The value $y^thatmaximizes that maximizes \log hsatisfies satisfies dloghdyy=y=y+2=0, \frac{d \log h}{d y}\biggr|_{y=y^*} = -y^* + 2 = 0, whichmeans which means y^=2maximizes maximizes \log h$ and thus

c=h(2)=k(2)g(2)=12e2. \begin{align*} c = h(2) &= \frac{k(2)}{g(2)}\\ &= \frac{1}{2e^2}. \end{align*}

Thus, we sample yy from gg and uu from UNIF(0,1)\rm{UNIF}(0,1) and accept the sample as being drawn from ff if

uk(y)g(y)=e(y2)2/2. u \leq \frac{k(y)}{g(y)} = e^{-(y-2)^2/2}.

Here is the code that implements our sampling method for ff using acceptance-rejection sampling:

# the proposed density we can easily sample from
g <- function(y)
{
  2*exp(-2*(y-2))
}

# inverse transform method for sampling from g
rg <- function(n)
{
  -0.5*log(1-runif(n))+2
}

rf <- function(n)
{
  data <- vector(length=n)
  for (i in 1:n)
  {
    repeat
    {
      y <- rg(1)
      u <- runif(1)
      if (runif(1) <= exp(-(y-2)^2/2))
      {
        data[i] <- y
        break
      }
    }
  }
  data
}

k <- function(y)
{
  exp(-y^2/2)
}

f <- function(y)
{
  # normalizing constant
  Z <- 17.5358
  Z*k(y)
}

To verify the sampling method, we drawn n=10000n=10000 samples from ff and plot its histogram along with a plot of its density function ff.

data <- rf(10000)
ys <- seq(2,10,by=.1)
hist(data,freq=F,breaks=50,main="f")
lines(x=ys,f(ys),col="red") 

plot of chunk unnamed-chunk-2

The histogram seems compatible with the pdf ff.

Problem 2 (5 points)

Implement your accept-reject algorithm to get a sample of 10000 from Gamma(2.5,1)\mathrm{Gamma}(2.5, 1). Verify that your method works via a plot of the true normalized density, and a histogram of the generated values.

Solution

This is not the approach you were looking for, but on the test I did poorly on this section.

Be that as it may, here is essentially my answer on the test, except that the solution is so slow that I decreased cc so that, at the extreme tail of the distribution, f(y)/cg(y)>1f(y) / c g(y) > 1.

alpha <- 2.5
n <- 10000
c <- 10
a <- floor(alpha)
b <- a / alpha

ys <- vector(length=n)
for (i in 1:n)
{
  repeat
  {
    y <- rgamma(n=1,shape=a,scale=b)
    u <- runif(1)
    P <- dgamma(y,shape=alpha,scale=1) / (c*dgamma(y,shape=a,scale=b))
    if (u < P)
    {
      ys[i] <- y
      break
    }
  }
}

hist(ys,freq=F,breaks=50)
lines(x=seq(.01,10,by=.05),y=dgamma(seq(.01,10,by=.05),shape=alpha,scale=1),col="red")

plot of chunk unnamed-chunk-3

Problem 3 (30 points)

Considers 197197 animals randomly divided into four categories (four phenotypes) as follows: X=(x1,x2,x3,x4)TX = (x1, x2, x3, x4)^T with cell probabilities (1/2+θ/4,(1θ)/4,(1θ)/4,θ/4)T(1/2 + θ/4,(1 − θ)/4,(1 − θ)/4, θ/4)^T. We observe X=(125,18,20,34)TX = (125, 18, 20, 34)^T.

Part (a)

Hand code your algorithm using Newton-Ralphson to find the maximum likelihood estimator of θθ directly from the observed likelihood. Compare your result with what you get using the built-in optim()\mathrm{optim}() function in R.

Solution

In part 1 of the exam, the log-likelihood given the data was determined to be

(θx)=x1log(12+θ4)+x2log(1θ4)+x3log(1θ4)+x4log(θ4). \ell(\theta | \vec{x}) = x_1 \log\left(\frac{1}{2}+\frac{\theta}{4}\right) + x_2 \log\left(\frac{1-\theta}{4}\right) + x_3 \log\left(\frac{1-\theta}{4}\right) + x_4 \log\left(\frac{\theta}{4}\right).

We derived the updating equation for the MLE of θ\theta to be

θ(t+1)=θ(t)d/dθd2/dθ2 \theta^{(t+1)} = \theta^{(t)} - \frac{d \ell / d \theta}{d^2 \ell / d \theta^2}

where

ddθ=x1θ+2+x2θ1+x3θ1+x4θ \frac{d \ell}{d \theta} = \frac{x_1}{\theta+2} + \frac{x_2}{\theta-1} + \frac{x_3}{\theta-1} + \frac{x_4}{\theta}

and

d2dθ2=x1(θ+2)2x2(θ1)2x3(θ1)2x4θ2. \frac{d^2 \ell}{d \theta^2} = -\frac{x_1}{(\theta+2)^2} - \frac{x_2}{(\theta-1)^2} - \frac{x_3}{(\theta-1)^2} - \frac{x_4}{\theta^2}.

Instead of taking the time to simplify this expression, we will just substitute these derivations into the updating equation in the following R code:

l1 <- function(theta,x)
{
  x[1]/(theta+2) + (x[2]+x[3])/(theta-1) + x[4]/theta
}

l2 <- function(theta,x)
{
  -x[1]/(theta+2)^2 - (x[2]+x[3])/(theta-1)^2 - x[4]/theta^2
}

theta_mle <- function(x, start = 0.5, eps = 1e-6)
{
  i <- 0
  theta0 <- start
  theta1 <- NULL
  repeat
  {
    theta1 <- theta0 - l1(theta0,x) / l2(theta0,x)
    if (abs(theta1-theta0) < eps) { break }
    theta0 <- theta1
    i <- i + 1
  }
  list(mle=theta1,iterations=i)
}

We invoke the updating equations on the given data to estimate θ\theta with the following R code:

# observed data
x <-c(125, 18, 20, 34)
sol <- theta_mle(x=x, start=0.5, eps=1e-6)
mle <- sol$mle
mle_iterations <- sol$iterations

We see that the MLE converges to a solution around 0.62682150.6268215 after 33. We compare this result with the built-in procedure, optim\rm{optim}:

# the function to maximize, the log-likelihood function
loglike <- function(theta)
{
  x[1]*log(0.5+theta/4) + x[2]*log(0.25*(1-theta)) + x[3]*log(0.25*(1-theta)) + x[4]*log(0.25*theta)
}

# since optim finds the value that minimizes the function, we provide it with
# the negative of the log-likelihood.
optim(0.5,function(theta) { -loglike(theta) },lower=0.1,upper=0.9,method="L-BFGS-B")$par
## [1] 0.626821

The value optim\mathrm{optim} found is approximately the same. It was fussy with explicitly providing upper and lower bounds. This warrants further investigation, but for another time.

Part (b)

Implement the E-M algorithm to find the the maximum likelihood estimator of θθ with the ``augmented’’ data with missing information Z=y2Z = y_2. Compare your result with part (a).

Solution

We split the first category into two,

y=(y1,y2,y3,y4,y5)T \vec{y} = (y_1, y_2, y_3, y_4, y_5)^T

with cell probabilities (1/2,θ/4,(1θ)/4,(1θ)/4,θ/4)T(1/2, θ/4,(1 − θ)/4,(1 − θ)/4, θ/4)^T.

So, y1y_1 and y2y_2 are \emph{unobserved}. The pdf is given by

f(yθ)i=15πi(θ)yi f(\vec{y}|\theta) \propto \prod_{i=1}^{5} \pi_i(\theta)^{y_i}

where

π(θ)=(1/2,θ/4,(1θ)/4,(1θ)/4,θ/4)T. \vec{\pi}(\theta) = (1/2, θ/4,(1 − θ)/4,(1 − θ)/4, θ/4)^T.

Thus, the complete log-likelihood is given by

logf(yθ)=i=15yilogπi(θ) \log f(\vec{y} | \theta) = \sum_{i=1}^{5} y_i \log \pi_i(\theta)

which may be written as

logf(yθ)=k(y)+y2log(θ)+y3log(1θ)+y4log(1θ)+y5logθ. \log f(\vec{y} | \theta) = k(\vec{y}) + y_2 \log(\theta) + y_3 \log(1-\theta) + y_4 \log(1-\theta) + y_5 \log \theta.

E-step

The function we seek to maximize QQ is thus given by

Q(θθ(t))=EZx,θ(t)(logf(yθ)), Q(\theta|\theta^{(t)}) = E_{Z|\vec{x},\theta^{(t)}}(\log f(\vec{y}|\theta)),

which is given by

Q(θθ(t))=EZx,θ(t)(y2log(θ)+y3log(1θ)+y4log(1θ)+y5logθ) Q(\theta|\theta^{(t)})= E_{Z|\vec{x},|\theta^{(t)}}\left(y_2 \log(\theta) + y_3 \log(1-\theta) + y_4 \log(1-\theta) + y_5 \log \theta\right)

which by the property of linearity of expectations is equivalent to

Q(θθ(t))=EZx,θ(t)(y2)log(θ)+EZx,θ(t)(y3)log(1θ)+EZx,θ(t)(y4)log(1θ)+EZx,θ(t)(y5)logθ. \begin{split} Q(\theta|\theta^{(t)}) = E_{Z|\vec{x},|\theta^{(t)}}(y_2) & \log(\theta) + E_{Z|\vec{x},|\theta^{(t)}}(y_3) \log(1-\theta) +\\ & E_{Z|\vec{x},|\theta^{(t)}}(y_4) \log(1-\theta) + E_{Z|\vec{x},|\theta^{(t)}}(y_5) \log \theta. \end{split}

Note that y3y_3, y4y_4, and y5y_5 are respectively observed to be x2x_2,x3x_3,x4x_4 and thus, for instance,

EZx(y3)=x2. E_{Z|\vec{x}}(y_3) = x_2.

The missing data Z=y2Z = y_2 is distributed

Z(x,θ(t))BIN(x1,θ(t)2+θ(t)). Z | (\vec{x},\theta^{(t)}) \sim \rm{BIN}(x_1, \frac{\theta^{(t)}}{2+\theta^{(t)}}).

and thus

EZx,θ(t)(y2)=x1θ(t)2+θ(t). E_{Z|\vec{x},|\theta^{(t)}}(y_2) = \frac{x_1 \theta^{(t)}}{2+\theta^{(t)}}.

Putting it all together, we may rewrite QQ as

Q(θθ(t))=x1θ(t)2+θ(t)log(θ)+x2log(1θ)+x3log(1θ)+x4logθ. Q(\theta|\theta^{(t)}) = \frac{x_1 \theta^{(t)}}{2+\theta^{(t)}} \log(\theta) + x_2 \log(1-\theta) + x_3 \log(1-\theta) + x_4 \log \theta.

M-step

In the M-step, we seek

θ(t+1)=arg maxθQ(θθ(t)), \theta^{(t+1)} = \argmax_{\theta} Q(\theta|\theta^{(t)}),

which can be found by solving

dQ(θθ(t))dθθ=θ(t+1)=0. \frac{d Q(\theta|\theta^{(t)})}{d \theta}\biggr|_{\theta = \theta^{(t+1)}} = 0.

Since the final solution is a little noisy when we leave x\vec{x} symbolic, we are going to finish the solution with the observed values of x\vec{x}. So,

dQdθ=125θ(t)2+θ(t)381θ34θ=0 \frac{d Q}{d \theta} = \frac{125 \theta^{(t)}}{2+\theta^{(t)}} - \frac{38}{1-\theta} - \frac{34}{\theta} = 0

has the solution, when we replace θ\theta with θ(t+1)\theta^{(t+1)},

θ(t+1)=159θ(t)+68197θ(t)+144. \theta^{(t+1)} = \frac{159 \theta^{(t)} + 68}{197 \theta^{(t)} + 144}.

Iterating update equation until some stopping condition that signifies convergence may then be done. We denote the converged solution of the EM algorithm by

θ^EM=limnθ(t). \hat\theta_{\mathrm{EM}} = \lim_n \theta^{(t)}.

We implement the EM algorithm in the following R code:

theta0 <- .5
theta1 <- NULL
i <- 1
repeat
{
  theta1 <- (159*theta0 + 68)/(197*theta0 + 144)
  cat("theta[",i,"] = ", theta1,"\n")
  if (abs(theta1-theta0) < 1e-6) { break }
  theta0 <- theta1
  i <- i + 1
}
## theta[ 1 ] =  0.6082474 
## theta[ 2 ] =  0.6243211 
## theta[ 3 ] =  0.6264889 
## theta[ 4 ] =  0.6267773 
## theta[ 5 ] =  0.6268156 
## theta[ 6 ] =  0.6268207 
## theta[ 7 ] =  0.6268214
print(theta1)
## [1] 0.6268214

We see the standard MLE θ^mle\hat\theta_{\rm{mle}} and θ^EM\hat\theta_{\rm{EM}} both obtain the same solution up to 66 decimal places when using the same starting value and stopping condition. However, θ^EM\hat\theta_{\rm{EM}} required more iterations before convergence, which was expected given that the EM algorithm is of linear order while the MLE using Newton-raphson is of quadratic order. However, the EM algorithm does have the benefit of a less complex updating equation.

Part (c)

Find the standard error of the MLE using either a numerical calculation of the inverse of the observed fisher’s information, i.e [l(ˆθ)]1[−l”(ˆθ)]^{−1}, or using Louis’ Method.

Solution

We already have the second derivative of the log-likelihood, so we choose to use the observed Fisher information evaluated at θ=θmle\theta = \theta_{\rm{mle}}.

obs_fisher <- -l2(mle,x)
var_mle <- 1 / obs_fisher
sqrt(var_mle)
## [1] 0.05146735

We see that an estimate of standard error of the MLE is sd(θ^mle)=0.0514673\mathrm{sd}(\hat{\theta}_{\rm{mle}}) = 0.0514673.

Problem 4 (10 points )

Consider the density f(x)3e0.5(x+2)2+7e0.5(x2)2f(x) ∝ 3e^{−0.5(x+2)^2} + 7e^{−0.5(x−2)^2} (problems 6, Homework 1).

Part (a)

Compute the exact normalizing constant, both in closed form using π\pi (pencil and paper), and also to 5 decimal places (by evaluating the exact version in R).

Solution

It is given that

f(x)ker(x) f(x) \propto \ker(x)

i.e., f(x)=Cker(x)f(x) = C \ker(x), where

ker(x)=3e0.5(x+2)2+7e0.5(x2)2 \rm{ker}(x) = 3 e^{-0.5(x+2)^2} + 7 e^{-0.5(x-2)^2}

and CC is the normalizing constant that satisfies the equation

1Cker(x)dx=1. \int_{-\infty}^{\infty} \frac{1}{C} \ker(x) dx = 1.

Solving for CC yields the result

C=3e0.5(x+2)2dx+7e0.5(x2)2dx. C = \int_{-\infty}^{\infty} 3 e^{-0.5(x+2)^2} dx + \int_{-\infty}^{\infty} 7 e^{-0.5(x-2)^2} dx.

Each mode of the bimodal distribution resembles a normal distribution with σ=1\sigma = 1. We may rewrite the above to

C=32πe0.5(x+2)22πdx+72πe0.5(x2)22πdx C = 3 \sqrt{2 \pi} \int_{-\infty}^{\infty} \frac{e^{-0.5(x+2)^2}}{\sqrt{2 \pi}} dx + 7 \sqrt{2 \pi} \int_{-\infty}^{\infty} \frac{e^{-0.5(x-2)^2}}{\sqrt{2 \pi}} dx

which is equivalent to

C=32πϕ(x+2)dx+72πϕ(x2)dx. C = 3 \sqrt{2 \pi} \int_{-\infty}^{\infty} \phi(x+2) dx + 7 \sqrt{2 \pi} \int_{-\infty}^{\infty} \phi(x-2) dx.

Since ϕ\phi is a pdf, each integral evaluates to 11, and thus

C=32π+72π=102π C = 3 \sqrt{2 \pi} + 7 \sqrt{2 \pi} = 10 \sqrt{2 \pi}

is the normalizing constant, which is 25.06628 to 55 decimal places.

Using R’s numerical integrator, we get the result:

C <- 10*sqrt(2*pi)
ker <- function(x) { 3*exp(-0.5*(x+2)^2) + 7*exp(-0.5*(x-2)^2) }
res <- integrate(ker,lower = -Inf, upper = Inf)
print(res)
## 25.06628 with absolute error < 4e-05

Part (b)

Approximate this integral with a Simpson’s rule to three decimal places. What is the effective range and number of subintervals required?

Solution

Simpson’s rule is implemented by the following R code. Note that we did not bother to optimize it.

# simpson : numerical integrator applying simpson's rule to n subintervals
# over the range (a,b).
#
# arguments;
#   f : the function to integrate
#   a : the lower-bound
#   b : the upper-bound. (a,b) together discrete the range.
#   n : the number of subintervals to partition the range
#
# we evenly partition the range into n subintervals of size (b-a)/n
# and apply simson's rule to each subinterval. then, we accumulate these
# values and return the result.
simpson <- function(f, a, b, n)
  {
    h <- (b-a)/n
    s <- 0
    x <- a
    for (i in 1:(n/2))
    {
      s <- s + f(x) + 4 * f(x+h) + f(x+2*h)
      x <- x + 2*h
    }
    s*h/3
}

To test effective range and subintervals, we decided to make a program that exhaustively searches for the minimum range and number of subintervals over a discrete set of points. In particular, we search for a range of the form (r,r)(-r,r) that is symmetric and an even nn. It is not perfect, but it seems like a reasonable way to estimate these requirements.

Here is the R code:

R <- NULL
N <- NULL
found <- F
for (r in 1:200)
{
  for (n in 1:100)
  {
    res <- simpson(ker,-r/10,r/10,2*n)
    if (abs(res - C) < 0.001)
    {
      R <- r/10
      N <- 2*n
      cat("r = +-", R, ", n = ", N, ", C = ", res, "\n")
      found <- T
      break
    }
  }
  if (found) { break }
}
## r = +- 5.9 , n =  16 , C =  25.06603

The minimum range (r,r)(-r,r), r>0r > 0, when divided into nn subintervals, that is the same as the true value 102π10 \sqrt{2\pi} to 33 decimal places is given by r=5.9r = 5.9 and n=16n = 16.