Setting up the scene

We are given some i.i.d data, \({\bf y}=(y_1, \ldots, y_n)\), and assume that \[y_i \sim \mbox{Gamma}(\alpha,\beta), \quad i = 1,\ldots, n.\] We wish to estimate the parameters of the Gamma distribution \(\alpha, \beta)\) within a Bayesian framework.

Exercise 1: Writing down the posterior density of interest

Priors

Since we are operating within a Bayesian framework we need to assign prior distributions to the parameters of interest \(\alpha\) and \(\beta\). Both parameters are positive so a natural to distribution to represent our prior beliefs them is a Gamma distribution. We will also assume that a priori (ie before we see the data) that the parameters are independent. Hence: \[\alpha \sim \mbox{Gamma}(\lambda_{\alpha}, \nu_{\alpha})\] \[\beta \sim \mbox{Gamma}(\lambda_{\beta}, \nu_{\beta})\]

The corresponding pdfs of these Gamma distributions are: \[\pi(\alpha) = \frac{\nu_{\alpha}^{\lambda{\alpha}}}{\Gamma(\lambda_\alpha)} \alpha^{\lambda_{\alpha}-1} \exp\{-\alpha \nu_\alpha\} \propto \alpha^{\lambda_\alpha-1} \exp\{-\alpha \nu_\alpha \} \]

\[\pi(\beta) = \frac{\nu_{\beta}^{\lambda{\beta}}}{\Gamma(\lambda_\beta)} \beta^{\lambda_{\beta}-1} \exp\{-\beta \nu_\beta\} \propto \beta^{\lambda_\beta-1} \exp\{-\beta \nu_\beta \} \]

Likelihood function

The likelihood function, denoted by \(\pi(\bf{y}|(\alpha, \beta))\) is a product of the densities \(f(y_i|\alpha, \beta)\) since we have i.i.d. data: \[L(\alpha,\beta) = \pi({\bf y}|\alpha, \beta) = \prod_{i=1}^{n}f(y_i|\alpha, \beta) = \prod_{i=1}^{n}\frac{\beta^\alpha}{\Gamma(\alpha)} y_i^{\alpha - 1} \exp\{-\beta y_i\}\] It is easy to see that the likelihood function is equal to: \[ \pi({\bf y}|\alpha, \beta) = \frac{\beta^{n\alpha}}{\Gamma(\alpha)^n} \prod_{i=1}^{n} y_i^{\alpha-1} \exp\left\{-\beta\sum_{i=1}^{n}y_i\right\}\]

Full (Joint) Posterior Distribution

The density of the posterior distribution is up to proportionality the product of the likelihood functions times the prior densities: \[\pi(\alpha, \beta|{\bf y}) \propto \pi({\bf y}|\alpha, \beta) \pi(\alpha) \pi(\beta)\] Using the expression above we derive (up to proportionality) the density of the joint posterior distribution of the parameters given the data: \[\pi(\alpha, \beta|{\bf y}) \propto \frac{\beta^{n\alpha}}{\Gamma(\alpha)^n} \prod_{i=1}^{n} y_i^{\alpha-1} \exp\left\{-\beta\sum_{i=1}^{n}y_i\right\} \times \alpha^{\lambda_\alpha-1} \exp\{-\alpha \nu_\alpha \} \times \beta^{\lambda_\beta-1} \exp\{-\beta \nu_\beta \}\]

Exercise 2: Derive the Full Conditionals

We would like to draw samples from this distribution using MCMC. The first thing we need to do, is to write down the full conditional distribution of the parameters:

Full Conditionals

To derive the full conditional density of the distribution of one of the parameters, say, \(\alpha\), we just look at the joint posterior density and put together the terms that only involve \(\alpha\). Everything else is treated as a constant: \[ \pi(\alpha|\beta, {\bf y}) \propto \frac{\beta^{n\alpha}}{\Gamma(\alpha)^n} \left(\prod_{i=1}^{n} y_i^{\alpha-1}\right) \alpha^{\lambda_\alpha-1} \exp\{-\alpha \nu_\alpha \} \] Similarly, the density of full conditional posterior distribution of \(\beta\) given everything else is given as: \[ \pi(\beta|\alpha, {\bf y}) \propto \beta^{n\alpha} \exp\left\{-\beta\sum_{i=1}^{n}y_i\right\} \beta^{\lambda_\beta-1} \exp\{-\beta \nu_\beta \} = \beta^{n\alpha + \lambda_\beta -1} \exp\left\{-\beta\left(\sum_{i=1}^{n}y_i + \nu_{\beta} \right) \right \} \]

It is easy to recognise that this a density from a Gamma distribution with parameters \(n\alpha + \lambda_\beta\) and \(\sum_{i=1}^{n}y_i + \nu_{\beta}\), i.e.:

\[\beta|\alpha, {\bf y} \sim \mbox{Ga}\left(n\alpha + \lambda_\beta, \sum_{i=1}^{n}y_i + \nu_{\beta} \right)\]

Therefore for \(\beta\) we can sample directly from its conditional distribution.

Exercise 3: Write an MCMC algorithm

A pseudo-code for implementing a Metropolis Hastings algorithm:

  1. Initialisation;
  2. Sample from \((\pi(\alpha|\beta, {\bf y}))\)
  3. Sample from \((\pi(\beta|\alpha, {\bf y}))\)
  4. Goto 2

R-code

We will now write some R functions which will enable us to implement the above algorithm in the computer.

We first write functions to compute the (log) gamma density with parameters \(\lambda\) and \(\nu\) for a random variable \(\theta\) up to proportionality:

log.gamma.density <- function(theta, lambda, nu) {
   out <- (lambda-1)*log(theta) - theta*nu
   return(out)
}

Needless to say that one can use dgamma( ... , log=TRUE) instead.

We now write a function that takes as arguments the data (vector), the number of MCMC iterations (scalar), the tuning varianaces for the random walk Metropolis (vector), and the values of the parameters for the prior distributios (scalars):

mcmc <- function(y, iter, sigma.rw, lambda.alpha, lambda.beta, nu.alpha, nu.beta) {
   
   # find n from the data (ie the number of observations)
   n <- length(y)
   
   # compute the sum of the observations and the sum of the log of the observations
   sum.y <- sum(y)
   sum.log.y <- sum(log(y))
   
   # first create a table which will store the samples; first column will be alpha, second will be beta
   out <- matrix(NA, nrow=iter, ncol=2)
   
   # initial values
   alpha.cur <- 1
   beta.cur <- 1
   out[1,] <- c(alpha.cur, beta.cur)
   
   # mcmc loop starts here
   for (i in 2:iter) {
      
      ###############
      # update alpha (assume beta is fixed)
      ###############
      
      # propose a new value for alpha
      alpha.can <- rnorm(1, alpha.cur, sigma.rw)
      
      # if it is negative reject straight away else compute the M-H ratio
      if (alpha.can > 0) {
         
           # evaluate the loglikelihood at the current values of alpha.
          loglik.cur <- n*alpha.cur*log(beta.cur) - n*lgamma(alpha.cur) + (alpha.cur-1)*sum.log.y 

         # compute the log-likelihood at the candidate value of alpha
         loglik.can <- n*alpha.can*log(beta.cur) - n*lgamma(alpha.can) + (alpha.can-1)*sum.log.y 
   
         
         # log prior densities
         log.prior.alpha.cur <- log.gamma.density(alpha.cur, lambda.alpha, nu.alpha)
         log.prior.alpha.can <- log.gamma.density(alpha.can, lambda.alpha, nu.alpha)
         
         logpi.cur <- loglik.cur + log.prior.alpha.cur
         logpi.can <- loglik.can + log.prior.alpha.can
         
         # M-H ratio
         
         # draw from a U(0,1)
         u <- runif(1)
         
         if (log(u) < logpi.can - logpi.cur) {
            alpha.cur <- alpha.can
         }
      }
      
      ###############
      # update beta
      ###############
      
      beta.cur <- rgamma(1, alpha.cur*n + lambda.beta, rate = sum.y + nu.beta)  
    
      # store the samples
      out[i,] <- c(alpha.cur, beta.cur)
      
   }
   
   return(out)
}

Exercise 4: Test the MCMC algorithm

Suppose now that we have some data and we want to fit this model, i.e. that these data a random sample from a Gamma distribution whose parameters we wanto to estimate.

First let us generate some idea to test our code:

data <- rgamma(100, 4, 2)

Use the mcmc function to draw samples from the posterior distribution of the parameters assuming that \(\lambda_{\alpha} = \lambda_\beta = 1\) and \(\nu_{\alpha} = \nu_{\beta} = 0.001\):

res <- mcmc(y = data, iter = 10000, sigma.rw = 1, lambda.alpha = 1, lambda.beta = 1, nu.alpha = 1e-3, nu.beta = 1e-3)

Let’s look at the output. First, trace plots to ensure that the chains mix well:

par(mfrow=c(1,2))
plot(res[,1],type='l', main=expression(paste(alpha)))
plot(res[,2],type='l', main=expression(paste(beta)))

Exlude the first 1,000 observations as burn in:

res <- res[-c(1:1000),]

Next, let us draw some histograms:

par(mfrow=c(1,2))
hist(res[,1], prob=TRUE, col=4, xlab=expression(paste(alpha)), main=expression(paste("Posterior distribution of ", alpha)))
hist(res[,2], prob=TRUE, col =4, xlab=expression(paste(beta)), main=expression(paste("Posterior distribution of ", beta)))

An advantage of a Bayesian approach that we can explore the the joint distribution of the parameters

par(mfrow=c(1,1))
plot(res, xlab=paste(expression(alpha)), ylab=paste(expression(beta)))

It is obvious that there is a very strong correlation between the two paramaters (shape and scale).

Exericse 5

We use the same code above to fit the model to the Cambylobacter data:

# read the data 
y <- c(rep(2,2), rep(3, 6), rep(4, 11), rep(5, 3), rep(7,7))

# use the MCMC code above
res <- mcmc(y = y, iter = 10000, sigma.rw = 1, lambda.alpha = 1, lambda.beta = 1, nu.alpha = 1e-3, nu.beta = 1e-3)

# trace plots 
par(mfrow=c(1,2))
plot(res[,1],type='l', main=expression(paste(alpha)))
plot(res[,2],type='l', main=expression(paste(beta)))

# burn in
res <- res[-c(1:1000),]

# marginal posteriors
par(mfrow=c(1,2))
hist(res[,1], prob=TRUE, col=4, xlab=expression(paste(alpha)), main=expression(paste("Posterior distribution of ", alpha)))
hist(res[,2], prob=TRUE, col =4, xlab=expression(paste(beta)), main=expression(paste("Posterior distribution of ", beta)))

There are a number of ways to assess the goodness of fit; one way do this is as follows:

alpha.hat <- median(res[,1])
alpha.hat
## [1] 8.443669
beta.hat <- median(res[,2])
beta.hat
## [1] 1.901505
hist(y, prob=TRUE, main="")
curve(dgamma(x, shape=alpha.hat, rate=beta.hat), from =1.5, to = 8, add=TRUE, col=2)

Exercise 6

Block-update

Let us first have a look at the joint posterior of \(\alpha\) and \(\beta\)

plot(res[,1], res[,2])

The plot reveals a clear positive correlation between the two parameters. Therefore, it makes sense to propose from a bivariate Normal distribution where the correlation between the two components (variables) is relatively high.

library(mvtnorm)

# block-update mcmc algorithm 
mcmc.block <- function(y, iter, Sigma, lambda.alpha, lambda.beta, nu.alpha, nu.beta) {
   
   # find n from the data (ie the number of observations)
   n <- length(y)
   
   # compute the sum of the observations and the sum of the log of the observations
   sum.y <- sum(y)
   sum.log.y <- sum(log(y))
   
   # first create a table which will store the samples; first column will be alpha, second will be beta
   out <- matrix(NA, nrow=iter, ncol=2)
   
   # initial values
   alpha.cur <- 1
   beta.cur <- 1
   
   out[1,] <- c(alpha.cur, beta.cur)
   
   # mcmc loop starts here
   for (i in 2:iter) {
      
      
      
      #######################
      # update alpha and beta
      #######################
      
      alpha.beta.can <- rmvnorm(1, c(alpha.cur, beta.cur), Sigma)
      
      alpha.can <- alpha.beta.can[1]
      beta.can <- alpha.beta.can[2]
      
      
      # if either of them is negative, reject straight away else compute the M-H ratio
      if ( sum(alpha.beta.can > 0)==length(alpha.beta.can)) {
         
          # evaluate the loglikelihood at the current values of alpha.
          loglik.cur <- n*alpha.cur*log(beta.cur) - n*lgamma(alpha.cur) + (alpha.cur-1)*sum.log.y - beta.cur*sum.y

         # compute the log-likelihood at the candidate value of alpha
         loglik.can <- n*alpha.can*log(beta.can) - n*lgamma(alpha.can) + (alpha.can-1)*sum.log.y - beta.can * sum.y
   
         
         # log prior densities
         log.prior.alpha.cur <- log.gamma.density(alpha.cur, lambda.alpha, nu.alpha)
         log.prior.alpha.can <- log.gamma.density(alpha.can, lambda.alpha, nu.alpha)
         
         log.prior.beta.cur <- log.gamma.density(beta.cur, lambda.beta, nu.beta)
         log.prior.beta.can <- log.gamma.density(beta.can, lambda.beta, nu.beta)
         
         # log target densities
         logpi.cur <- loglik.cur + log.prior.alpha.cur + log.prior.beta.cur
         logpi.can <- loglik.can + log.prior.alpha.can + log.prior.beta.can
         
         # M-H ratio
         
         # draw from a U(0,1)
         u <- runif(1)
         
         if (log(u) < logpi.can - logpi.cur) {
            alpha.cur <- alpha.can
            beta.cur <- beta.can
         }
      }
      
      # store the samples
      out[i,] <- c(alpha.cur, beta.cur)
      
   }
   
   return(out)
}

Let us run this algorithm:

#find the sample variance covariance matrix that use that in the proposal
Sigma.hat <- cov(res)

# use the MCMC code above
res.block <- mcmc.block(y = y, iter = 10000, Sigma = Sigma.hat, lambda.alpha = 1, lambda.beta = 1, nu.alpha = 1e-3, nu.beta = 1e-3)

# trace plots 
par(mfrow=c(1,2))
plot(res.block[,1],type='l', main=expression(paste(alpha)))
plot(res.block[,2],type='l', main=expression(paste(beta)))

# burn in
res.block <- res.block[-c(1:1000),]

# marginal posteriors
par(mfrow=c(1,2))
hist(res.block[,1], prob=TRUE, col=4, xlab=expression(paste(alpha)), main=expression(paste("Posterior distribution of ", alpha)))
hist(res.block[,2], prob=TRUE, col =4, xlab=expression(paste(beta)), main=expression(paste("Posterior distribution of ", beta)))

Let us check that this new sampler is drawing from the same distribution as the first sampler.

par(mfrow=c(1,2))
hist(res.block[,1], prob=TRUE, col=4, xlab=expression(paste(alpha)), main=expression(paste("Posterior distribution of ", alpha)))
lines(density(res[,1]),col=2, lwd=2)
hist(res.block[,2], prob=TRUE, col =4, xlab=expression(paste(beta)), main=expression(paste("Posterior distribution of ", beta)))
lines(density(res[,2]),col=2, lwd=2)

Independence sampler

The idea behind this sampler is to approximate the joint posterior distribution of \(\alpha\) and \(\beta\) with a bivariate Normal distribution and use that distribution within an MCMC algorithm.

First, let us define a function which returns the log posterior density \[\pi(\alpha, \beta|{\bf y}) \propto \frac{\beta^{n\alpha}}{\Gamma(\alpha)^n} \prod_{i=1}^{n} y_i^{\alpha-1} \exp\left\{-\beta\sum_{i=1}^{n}y_i\right\} \times \alpha^{\lambda_\alpha-1} \exp\{-\alpha \nu_\alpha \} \times \beta^{\lambda_\beta-1} \exp\{-\beta \nu_\beta \}\]

log.post.density <- function(theta, lambda.alpha, lambda.beta, nu.alpha, nu.beta, y) {
   
   alpha <- theta[1]
   beta <- theta[2]
   
   if (alpha > 0 && beta > 0) {
   
      # find n from the data (ie the number of observations)
      n <- length(y)
      
      # compute the sum of the observations and the sum of the log of the observations
      sum.y <- sum(y)
      sum.log.y <- sum(log(y))
      
      # log-likelihood
      loglik <- n*alpha*log(beta) - n*lgamma(alpha) + (alpha-1)*sum.log.y - beta*sum.y
      log.priors <- log.gamma.density(alpha, lambda.alpha, nu.alpha) + log.gamma.density(alpha, lambda.alpha, nu.alpha)
      
      # put it all together
      res <- loglik + log.priors
   }
   else {
      res <- -10^100
   }
   
   # return output
   return(res)
}

We then use optim to find the mode of the posterior density

res.MAP <- optim(c(1,1), log.post.density, lambda.alpha = 1, lambda.beta = 1, nu.alpha  = 0.001, nu.beta = 0.001, y=y, control=list(fnscale=-1), hessian=TRUE)
res.MAP
## $par
## [1] 8.079134 1.802244
## 
## $value
## [1] -53.13166
## 
## $counts
## function gradient 
##      133       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
## 
## $hessian
##           [,1]      [,2]
## [1,] -3.820777  16.09105
## [2,] 16.091055 -72.13333

Therefore, the Normal approximation to the joint posterior distribution will have mean 8.0791335, 1.8022438 and variance covariance matrix the inverse of the negative Hessian matrix, i.e.

res.mean <- res.MAP$par
res.mean
## [1] 8.079134 1.802244
res.var.cov <- solve(-res.MAP$hessian)
res.var.cov
##           [,1]      [,2]
## [1,] 4.3236678 0.9644969
## [2,] 0.9644969 0.2290172

Let us now write the algotithm

# independence sampler mcmc algorithm 
mcmc.indep <- function(y, iter, mu, Sigma, lambda.alpha, lambda.beta, nu.alpha, nu.beta) {
   
   # find n from the data (ie the number of observations)
   n <- length(y)
   
   # compute the sum of the observations and the sum of the log of the observations
   sum.y <- sum(y)
   sum.log.y <- sum(log(y))
   
   # first create a table which will store the samples; first column will be alpha, second will be beta
   out <- matrix(NA, nrow=iter, ncol=2)
   
   # initial values
   alpha.cur <- 1
   beta.cur <- 1
   
   out[1,] <- c(alpha.cur, beta.cur)
   
   # mcmc loop starts here
   for (i in 2:iter) {
      
      
      
      #######################
      # update alpha and beta
      #######################
      
      alpha.beta.can <- rmvnorm(1, mu, Sigma)
      
      alpha.can <- alpha.beta.can[1]
      beta.can <- alpha.beta.can[2]
      
      
      # if either of them is negative, reject straight away else compute the M-H ratio
      if ( sum(alpha.beta.can > 0)==length(alpha.beta.can)) {
         
          # evaluate the loglikelihood at the current values of alpha.
          loglik.cur <- n*alpha.cur*log(beta.cur) - n*lgamma(alpha.cur) + (alpha.cur-1)*sum.log.y - beta.cur*sum.y

         # compute the log-likelihood at the candidate value of alpha
         loglik.can <- n*alpha.can*log(beta.can) - n*lgamma(alpha.can) + (alpha.can-1)*sum.log.y - beta.can * sum.y
   
         
         # log prior densities
         log.prior.alpha.cur <- log.gamma.density(alpha.cur, lambda.alpha, nu.alpha)
         log.prior.alpha.can <- log.gamma.density(alpha.can, lambda.alpha, nu.alpha)
         
         log.prior.beta.cur <- log.gamma.density(beta.cur, lambda.beta, nu.beta)
         log.prior.beta.can <- log.gamma.density(beta.can, lambda.beta, nu.beta)
         
         # log target densities
         logpi.cur <- loglik.cur + log.prior.alpha.cur + log.prior.beta.cur
         logpi.can <- loglik.can + log.prior.alpha.can + log.prior.beta.can
         
         # NOT symmetric any more ------> need to compute the log(qratio)
         log.q.num <- dmvnorm(c(alpha.cur, beta.cur), mean=mu, sigma = Sigma, log=TRUE)
         log.q.den <- dmvnorm(c(alpha.can, beta.can), mean=mu, sigma = Sigma, log=TRUE)
         
         
         # M-H ratio
         
         # draw from a U(0,1)
         u <- runif(1)
         
         if (log(u) < logpi.can - logpi.cur + log.q.num - log.q.den) {
            alpha.cur <- alpha.can
            beta.cur <- beta.can
         }
      }
      
      # store the samples
      out[i,] <- c(alpha.cur, beta.cur)
      
   }
   
   return(out)
}

Let us run this algorithm:

# use the MCMC code above
res.indep <- mcmc.indep(y=y, iter = 10000, mu = res.MAP$par, Sigma = res.var.cov, lambda.alpha = 1, lambda.beta = 1,  nu.alpha = 0.001, nu.beta = 0.001)

# trace plots 
par(mfrow=c(1,2))
plot(res.indep[,1],type='l', main=expression(paste(alpha)))
plot(res.indep[,2],type='l', main=expression(paste(beta)))

# burn in
res.indep <- res.indep[-c(1:1000),]

# marginal posteriors
par(mfrow=c(1,2))
hist(res.indep[,1], prob=TRUE, col=4, xlab=expression(paste(alpha)), main=expression(paste("Posterior distribution of ", alpha)))
hist(res.indep[,2], prob=TRUE, col =4, xlab=expression(paste(beta)), main=expression(paste("Posterior distribution of ", beta)))

Are all algorithms sampling from the same distribution?

Run them all for much longer for some simulated data

data.new <- rgamma(500, 4, 2)
res <- mcmc(y = y, iter = 100000, sigma.rw = 1, lambda.alpha = 1, lambda.beta = 1, nu.alpha = 1e-3, nu.beta = 1e-3)

res <- res[-c(1:1000),]

Sigma.hat <- cov(res)

res.block <- mcmc.block(y = y, iter = 100000, Sigma = Sigma.hat, lambda.alpha = 1, lambda.beta = 1, nu.alpha = 1e-3, nu.beta = 1e-3)

res.block <- res.block[-c(1:10000),]

res.MAP <- optim(c(1,1), log.post.density, lambda.alpha = 1, lambda.beta = 1, nu.alpha  = 0.001, nu.beta = 0.001, y=y, control=list(fnscale=-1), hessian=TRUE)

res.MAP
## $par
## [1] 8.079134 1.802244
## 
## $value
## [1] -53.13166
## 
## $counts
## function gradient 
##      133       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
## 
## $hessian
##           [,1]      [,2]
## [1,] -3.820777  16.09105
## [2,] 16.091055 -72.13333
res.var.cov <- solve(-res.MAP$hessian)

res.indep <- mcmc.indep(y=y, iter = 100000, mu = res.MAP$par, Sigma = res.var.cov, lambda.alpha = 1, lambda.beta = 1,  nu.alpha = 0.001, nu.beta = 0.001)

Draw the posterior densities:

par(mfrow=c(1,2))

plot(density(res[,1]), xlab=expression(paste(alpha)), main=expression(paste("Posterior distribution of ", alpha)))
lines(density(res.block[,1]),col=2)
lines(density(res.indep[,1]),col=4)

plot(density(res[,2]), xlab=expression(paste(beta)), main=expression(paste("Posterior distribution of ", beta)))
lines(density(res.block[,2]),col=2)
lines(density(res.indep[,2]),col=4)