We are given some i.i.d data, \({\bf y}=(y_1, \ldots, y_n)\), and assume that \[y_i \sim \mbox{Gamma}(\alpha,\beta), \quad i = 1,\ldots, n.\] We wish to estimate the parameters of the Gamma distribution \(\alpha, \beta)\) within a Bayesian framework.
Since we are operating within a Bayesian framework we need to assign prior distributions to the parameters of interest \(\alpha\) and \(\beta\). Both parameters are positive so a natural to distribution to represent our prior beliefs them is a Gamma distribution. We will also assume that a priori (ie before we see the data) that the parameters are independent. Hence: \[\alpha \sim \mbox{Gamma}(\lambda_{\alpha}, \nu_{\alpha})\] \[\beta \sim \mbox{Gamma}(\lambda_{\beta}, \nu_{\beta})\]
The corresponding pdfs of these Gamma distributions are: \[\pi(\alpha) = \frac{\nu_{\alpha}^{\lambda{\alpha}}}{\Gamma(\lambda_\alpha)} \alpha^{\lambda_{\alpha}-1} \exp\{-\alpha \nu_\alpha\} \propto \alpha^{\lambda_\alpha-1} \exp\{-\alpha \nu_\alpha \} \]
\[\pi(\beta) = \frac{\nu_{\beta}^{\lambda{\beta}}}{\Gamma(\lambda_\beta)} \beta^{\lambda_{\beta}-1} \exp\{-\beta \nu_\beta\} \propto \beta^{\lambda_\beta-1} \exp\{-\beta \nu_\beta \} \]
The likelihood function, denoted by \(\pi(\bf{y}|(\alpha, \beta))\) is a product of the densities \(f(y_i|\alpha, \beta)\) since we have i.i.d. data: \[L(\alpha,\beta) = \pi({\bf y}|\alpha, \beta) = \prod_{i=1}^{n}f(y_i|\alpha, \beta) = \prod_{i=1}^{n}\frac{\beta^\alpha}{\Gamma(\alpha)} y_i^{\alpha - 1} \exp\{-\beta y_i\}\] It is easy to see that the likelihood function is equal to: \[ \pi({\bf y}|\alpha, \beta) = \frac{\beta^{n\alpha}}{\Gamma(\alpha)^n} \prod_{i=1}^{n} y_i^{\alpha-1} \exp\left\{-\beta\sum_{i=1}^{n}y_i\right\}\]
The density of the posterior distribution is up to proportionality the product of the likelihood functions times the prior densities: \[\pi(\alpha, \beta|{\bf y}) \propto \pi({\bf y}|\alpha, \beta) \pi(\alpha) \pi(\beta)\] Using the expression above we derive (up to proportionality) the density of the joint posterior distribution of the parameters given the data: \[\pi(\alpha, \beta|{\bf y}) \propto \frac{\beta^{n\alpha}}{\Gamma(\alpha)^n} \prod_{i=1}^{n} y_i^{\alpha-1} \exp\left\{-\beta\sum_{i=1}^{n}y_i\right\} \times \alpha^{\lambda_\alpha-1} \exp\{-\alpha \nu_\alpha \} \times \beta^{\lambda_\beta-1} \exp\{-\beta \nu_\beta \}\]
We would like to draw samples from this distribution using MCMC. The first thing we need to do, is to write down the full conditional distribution of the parameters:
To derive the full conditional density of the distribution of one of the parameters, say, \(\alpha\), we just look at the joint posterior density and put together the terms that only involve \(\alpha\). Everything else is treated as a constant: \[ \pi(\alpha|\beta, {\bf y}) \propto \frac{\beta^{n\alpha}}{\Gamma(\alpha)^n} \left(\prod_{i=1}^{n} y_i^{\alpha-1}\right) \alpha^{\lambda_\alpha-1} \exp\{-\alpha \nu_\alpha \} \] Similarly, the density of full conditional posterior distribution of \(\beta\) given everything else is given as: \[ \pi(\beta|\alpha, {\bf y}) \propto \beta^{n\alpha} \exp\left\{-\beta\sum_{i=1}^{n}y_i\right\} \beta^{\lambda_\beta-1} \exp\{-\beta \nu_\beta \} = \beta^{n\alpha + \lambda_\beta -1} \exp\left\{-\beta\left(\sum_{i=1}^{n}y_i + \nu_{\beta} \right) \right \} \]
It is easy to recognise that this a density from a Gamma distribution with parameters \(n\alpha + \lambda_\beta\) and \(\sum_{i=1}^{n}y_i + \nu_{\beta}\), i.e.:
\[\beta|\alpha, {\bf y} \sim \mbox{Ga}\left(n\alpha + \lambda_\beta, \sum_{i=1}^{n}y_i + \nu_{\beta} \right)\]
Therefore for \(\beta\) we can sample directly from its conditional distribution.
A pseudo-code for implementing a Metropolis Hastings algorithm:
We will now write some R
functions which will enable us
to implement the above algorithm in the computer.
We first write functions to compute the (log) gamma density with parameters \(\lambda\) and \(\nu\) for a random variable \(\theta\) up to proportionality:
log.gamma.density <- function(theta, lambda, nu) {
out <- (lambda-1)*log(theta) - theta*nu
return(out)
}
Needless to say that one can use dgamma( ... , log=TRUE)
instead.
We now write a function that takes as arguments the data (vector), the number of MCMC iterations (scalar), the tuning varianaces for the random walk Metropolis (vector), and the values of the parameters for the prior distributios (scalars):
mcmc <- function(y, iter, sigma.rw, lambda.alpha, lambda.beta, nu.alpha, nu.beta) {
# find n from the data (ie the number of observations)
n <- length(y)
# compute the sum of the observations and the sum of the log of the observations
sum.y <- sum(y)
sum.log.y <- sum(log(y))
# first create a table which will store the samples; first column will be alpha, second will be beta
out <- matrix(NA, nrow=iter, ncol=2)
# initial values
alpha.cur <- 1
beta.cur <- 1
out[1,] <- c(alpha.cur, beta.cur)
# mcmc loop starts here
for (i in 2:iter) {
###############
# update alpha (assume beta is fixed)
###############
# propose a new value for alpha
alpha.can <- rnorm(1, alpha.cur, sigma.rw)
# if it is negative reject straight away else compute the M-H ratio
if (alpha.can > 0) {
# evaluate the loglikelihood at the current values of alpha.
loglik.cur <- n*alpha.cur*log(beta.cur) - n*lgamma(alpha.cur) + (alpha.cur-1)*sum.log.y
# compute the log-likelihood at the candidate value of alpha
loglik.can <- n*alpha.can*log(beta.cur) - n*lgamma(alpha.can) + (alpha.can-1)*sum.log.y
# log prior densities
log.prior.alpha.cur <- log.gamma.density(alpha.cur, lambda.alpha, nu.alpha)
log.prior.alpha.can <- log.gamma.density(alpha.can, lambda.alpha, nu.alpha)
logpi.cur <- loglik.cur + log.prior.alpha.cur
logpi.can <- loglik.can + log.prior.alpha.can
# M-H ratio
# draw from a U(0,1)
u <- runif(1)
if (log(u) < logpi.can - logpi.cur) {
alpha.cur <- alpha.can
}
}
###############
# update beta
###############
beta.cur <- rgamma(1, alpha.cur*n + lambda.beta, rate = sum.y + nu.beta)
# store the samples
out[i,] <- c(alpha.cur, beta.cur)
}
return(out)
}
Suppose now that we have some data and we want to fit this model, i.e. that these data a random sample from a Gamma distribution whose parameters we wanto to estimate.
First let us generate some idea to test our code:
data <- rgamma(100, 4, 2)
Use the mcmc
function to draw samples from the posterior
distribution of the parameters assuming that \(\lambda_{\alpha} = \lambda_\beta = 1\) and
\(\nu_{\alpha} = \nu_{\beta} =
0.001\):
res <- mcmc(y = data, iter = 10000, sigma.rw = 1, lambda.alpha = 1, lambda.beta = 1, nu.alpha = 1e-3, nu.beta = 1e-3)
Let’s look at the output. First, trace plots to ensure that the chains mix well:
par(mfrow=c(1,2))
plot(res[,1],type='l', main=expression(paste(alpha)))
plot(res[,2],type='l', main=expression(paste(beta)))
Exlude the first 1,000 observations as burn in:
res <- res[-c(1:1000),]
Next, let us draw some histograms:
par(mfrow=c(1,2))
hist(res[,1], prob=TRUE, col=4, xlab=expression(paste(alpha)), main=expression(paste("Posterior distribution of ", alpha)))
hist(res[,2], prob=TRUE, col =4, xlab=expression(paste(beta)), main=expression(paste("Posterior distribution of ", beta)))
An advantage of a Bayesian approach that we can explore the the joint distribution of the parameters
par(mfrow=c(1,1))
plot(res, xlab=paste(expression(alpha)), ylab=paste(expression(beta)))
It is obvious that there is a very strong correlation between the two paramaters (shape and scale).
We use the same code above to fit the model to the Cambylobacter data:
# read the data
y <- c(rep(2,2), rep(3, 6), rep(4, 11), rep(5, 3), rep(7,7))
# use the MCMC code above
res <- mcmc(y = y, iter = 10000, sigma.rw = 1, lambda.alpha = 1, lambda.beta = 1, nu.alpha = 1e-3, nu.beta = 1e-3)
# trace plots
par(mfrow=c(1,2))
plot(res[,1],type='l', main=expression(paste(alpha)))
plot(res[,2],type='l', main=expression(paste(beta)))
# burn in
res <- res[-c(1:1000),]
# marginal posteriors
par(mfrow=c(1,2))
hist(res[,1], prob=TRUE, col=4, xlab=expression(paste(alpha)), main=expression(paste("Posterior distribution of ", alpha)))
hist(res[,2], prob=TRUE, col =4, xlab=expression(paste(beta)), main=expression(paste("Posterior distribution of ", beta)))
There are a number of ways to assess the goodness of fit; one way do this is as follows:
alpha.hat <- median(res[,1])
alpha.hat
## [1] 8.443669
beta.hat <- median(res[,2])
beta.hat
## [1] 1.901505
hist(y, prob=TRUE, main="")
curve(dgamma(x, shape=alpha.hat, rate=beta.hat), from =1.5, to = 8, add=TRUE, col=2)
Let us first have a look at the joint posterior of \(\alpha\) and \(\beta\)
plot(res[,1], res[,2])
The plot reveals a clear positive correlation between the two parameters. Therefore, it makes sense to propose from a bivariate Normal distribution where the correlation between the two components (variables) is relatively high.
library(mvtnorm)
# block-update mcmc algorithm
mcmc.block <- function(y, iter, Sigma, lambda.alpha, lambda.beta, nu.alpha, nu.beta) {
# find n from the data (ie the number of observations)
n <- length(y)
# compute the sum of the observations and the sum of the log of the observations
sum.y <- sum(y)
sum.log.y <- sum(log(y))
# first create a table which will store the samples; first column will be alpha, second will be beta
out <- matrix(NA, nrow=iter, ncol=2)
# initial values
alpha.cur <- 1
beta.cur <- 1
out[1,] <- c(alpha.cur, beta.cur)
# mcmc loop starts here
for (i in 2:iter) {
#######################
# update alpha and beta
#######################
alpha.beta.can <- rmvnorm(1, c(alpha.cur, beta.cur), Sigma)
alpha.can <- alpha.beta.can[1]
beta.can <- alpha.beta.can[2]
# if either of them is negative, reject straight away else compute the M-H ratio
if ( sum(alpha.beta.can > 0)==length(alpha.beta.can)) {
# evaluate the loglikelihood at the current values of alpha.
loglik.cur <- n*alpha.cur*log(beta.cur) - n*lgamma(alpha.cur) + (alpha.cur-1)*sum.log.y - beta.cur*sum.y
# compute the log-likelihood at the candidate value of alpha
loglik.can <- n*alpha.can*log(beta.can) - n*lgamma(alpha.can) + (alpha.can-1)*sum.log.y - beta.can * sum.y
# log prior densities
log.prior.alpha.cur <- log.gamma.density(alpha.cur, lambda.alpha, nu.alpha)
log.prior.alpha.can <- log.gamma.density(alpha.can, lambda.alpha, nu.alpha)
log.prior.beta.cur <- log.gamma.density(beta.cur, lambda.beta, nu.beta)
log.prior.beta.can <- log.gamma.density(beta.can, lambda.beta, nu.beta)
# log target densities
logpi.cur <- loglik.cur + log.prior.alpha.cur + log.prior.beta.cur
logpi.can <- loglik.can + log.prior.alpha.can + log.prior.beta.can
# M-H ratio
# draw from a U(0,1)
u <- runif(1)
if (log(u) < logpi.can - logpi.cur) {
alpha.cur <- alpha.can
beta.cur <- beta.can
}
}
# store the samples
out[i,] <- c(alpha.cur, beta.cur)
}
return(out)
}
Let us run this algorithm:
#find the sample variance covariance matrix that use that in the proposal
Sigma.hat <- cov(res)
# use the MCMC code above
res.block <- mcmc.block(y = y, iter = 10000, Sigma = Sigma.hat, lambda.alpha = 1, lambda.beta = 1, nu.alpha = 1e-3, nu.beta = 1e-3)
# trace plots
par(mfrow=c(1,2))
plot(res.block[,1],type='l', main=expression(paste(alpha)))
plot(res.block[,2],type='l', main=expression(paste(beta)))
# burn in
res.block <- res.block[-c(1:1000),]
# marginal posteriors
par(mfrow=c(1,2))
hist(res.block[,1], prob=TRUE, col=4, xlab=expression(paste(alpha)), main=expression(paste("Posterior distribution of ", alpha)))
hist(res.block[,2], prob=TRUE, col =4, xlab=expression(paste(beta)), main=expression(paste("Posterior distribution of ", beta)))
Let us check that this new sampler is drawing from the same distribution as the first sampler.
par(mfrow=c(1,2))
hist(res.block[,1], prob=TRUE, col=4, xlab=expression(paste(alpha)), main=expression(paste("Posterior distribution of ", alpha)))
lines(density(res[,1]),col=2, lwd=2)
hist(res.block[,2], prob=TRUE, col =4, xlab=expression(paste(beta)), main=expression(paste("Posterior distribution of ", beta)))
lines(density(res[,2]),col=2, lwd=2)
The idea behind this sampler is to approximate the joint posterior distribution of \(\alpha\) and \(\beta\) with a bivariate Normal distribution and use that distribution within an MCMC algorithm.
First, let us define a function which returns the log posterior density \[\pi(\alpha, \beta|{\bf y}) \propto \frac{\beta^{n\alpha}}{\Gamma(\alpha)^n} \prod_{i=1}^{n} y_i^{\alpha-1} \exp\left\{-\beta\sum_{i=1}^{n}y_i\right\} \times \alpha^{\lambda_\alpha-1} \exp\{-\alpha \nu_\alpha \} \times \beta^{\lambda_\beta-1} \exp\{-\beta \nu_\beta \}\]
log.post.density <- function(theta, lambda.alpha, lambda.beta, nu.alpha, nu.beta, y) {
alpha <- theta[1]
beta <- theta[2]
if (alpha > 0 && beta > 0) {
# find n from the data (ie the number of observations)
n <- length(y)
# compute the sum of the observations and the sum of the log of the observations
sum.y <- sum(y)
sum.log.y <- sum(log(y))
# log-likelihood
loglik <- n*alpha*log(beta) - n*lgamma(alpha) + (alpha-1)*sum.log.y - beta*sum.y
log.priors <- log.gamma.density(alpha, lambda.alpha, nu.alpha) + log.gamma.density(alpha, lambda.alpha, nu.alpha)
# put it all together
res <- loglik + log.priors
}
else {
res <- -10^100
}
# return output
return(res)
}
We then use optim
to find the mode of the posterior
density
res.MAP <- optim(c(1,1), log.post.density, lambda.alpha = 1, lambda.beta = 1, nu.alpha = 0.001, nu.beta = 0.001, y=y, control=list(fnscale=-1), hessian=TRUE)
res.MAP
## $par
## [1] 8.079134 1.802244
##
## $value
## [1] -53.13166
##
## $counts
## function gradient
## 133 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
##
## $hessian
## [,1] [,2]
## [1,] -3.820777 16.09105
## [2,] 16.091055 -72.13333
Therefore, the Normal approximation to the joint posterior distribution will have mean 8.0791335, 1.8022438 and variance covariance matrix the inverse of the negative Hessian matrix, i.e.
res.mean <- res.MAP$par
res.mean
## [1] 8.079134 1.802244
res.var.cov <- solve(-res.MAP$hessian)
res.var.cov
## [,1] [,2]
## [1,] 4.3236678 0.9644969
## [2,] 0.9644969 0.2290172
Let us now write the algotithm
# independence sampler mcmc algorithm
mcmc.indep <- function(y, iter, mu, Sigma, lambda.alpha, lambda.beta, nu.alpha, nu.beta) {
# find n from the data (ie the number of observations)
n <- length(y)
# compute the sum of the observations and the sum of the log of the observations
sum.y <- sum(y)
sum.log.y <- sum(log(y))
# first create a table which will store the samples; first column will be alpha, second will be beta
out <- matrix(NA, nrow=iter, ncol=2)
# initial values
alpha.cur <- 1
beta.cur <- 1
out[1,] <- c(alpha.cur, beta.cur)
# mcmc loop starts here
for (i in 2:iter) {
#######################
# update alpha and beta
#######################
alpha.beta.can <- rmvnorm(1, mu, Sigma)
alpha.can <- alpha.beta.can[1]
beta.can <- alpha.beta.can[2]
# if either of them is negative, reject straight away else compute the M-H ratio
if ( sum(alpha.beta.can > 0)==length(alpha.beta.can)) {
# evaluate the loglikelihood at the current values of alpha.
loglik.cur <- n*alpha.cur*log(beta.cur) - n*lgamma(alpha.cur) + (alpha.cur-1)*sum.log.y - beta.cur*sum.y
# compute the log-likelihood at the candidate value of alpha
loglik.can <- n*alpha.can*log(beta.can) - n*lgamma(alpha.can) + (alpha.can-1)*sum.log.y - beta.can * sum.y
# log prior densities
log.prior.alpha.cur <- log.gamma.density(alpha.cur, lambda.alpha, nu.alpha)
log.prior.alpha.can <- log.gamma.density(alpha.can, lambda.alpha, nu.alpha)
log.prior.beta.cur <- log.gamma.density(beta.cur, lambda.beta, nu.beta)
log.prior.beta.can <- log.gamma.density(beta.can, lambda.beta, nu.beta)
# log target densities
logpi.cur <- loglik.cur + log.prior.alpha.cur + log.prior.beta.cur
logpi.can <- loglik.can + log.prior.alpha.can + log.prior.beta.can
# NOT symmetric any more ------> need to compute the log(qratio)
log.q.num <- dmvnorm(c(alpha.cur, beta.cur), mean=mu, sigma = Sigma, log=TRUE)
log.q.den <- dmvnorm(c(alpha.can, beta.can), mean=mu, sigma = Sigma, log=TRUE)
# M-H ratio
# draw from a U(0,1)
u <- runif(1)
if (log(u) < logpi.can - logpi.cur + log.q.num - log.q.den) {
alpha.cur <- alpha.can
beta.cur <- beta.can
}
}
# store the samples
out[i,] <- c(alpha.cur, beta.cur)
}
return(out)
}
Let us run this algorithm:
# use the MCMC code above
res.indep <- mcmc.indep(y=y, iter = 10000, mu = res.MAP$par, Sigma = res.var.cov, lambda.alpha = 1, lambda.beta = 1, nu.alpha = 0.001, nu.beta = 0.001)
# trace plots
par(mfrow=c(1,2))
plot(res.indep[,1],type='l', main=expression(paste(alpha)))
plot(res.indep[,2],type='l', main=expression(paste(beta)))
# burn in
res.indep <- res.indep[-c(1:1000),]
# marginal posteriors
par(mfrow=c(1,2))
hist(res.indep[,1], prob=TRUE, col=4, xlab=expression(paste(alpha)), main=expression(paste("Posterior distribution of ", alpha)))
hist(res.indep[,2], prob=TRUE, col =4, xlab=expression(paste(beta)), main=expression(paste("Posterior distribution of ", beta)))
Run them all for much longer for some simulated data
data.new <- rgamma(500, 4, 2)
res <- mcmc(y = y, iter = 100000, sigma.rw = 1, lambda.alpha = 1, lambda.beta = 1, nu.alpha = 1e-3, nu.beta = 1e-3)
res <- res[-c(1:1000),]
Sigma.hat <- cov(res)
res.block <- mcmc.block(y = y, iter = 100000, Sigma = Sigma.hat, lambda.alpha = 1, lambda.beta = 1, nu.alpha = 1e-3, nu.beta = 1e-3)
res.block <- res.block[-c(1:10000),]
res.MAP <- optim(c(1,1), log.post.density, lambda.alpha = 1, lambda.beta = 1, nu.alpha = 0.001, nu.beta = 0.001, y=y, control=list(fnscale=-1), hessian=TRUE)
res.MAP
## $par
## [1] 8.079134 1.802244
##
## $value
## [1] -53.13166
##
## $counts
## function gradient
## 133 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
##
## $hessian
## [,1] [,2]
## [1,] -3.820777 16.09105
## [2,] 16.091055 -72.13333
res.var.cov <- solve(-res.MAP$hessian)
res.indep <- mcmc.indep(y=y, iter = 100000, mu = res.MAP$par, Sigma = res.var.cov, lambda.alpha = 1, lambda.beta = 1, nu.alpha = 0.001, nu.beta = 0.001)
Draw the posterior densities:
par(mfrow=c(1,2))
plot(density(res[,1]), xlab=expression(paste(alpha)), main=expression(paste("Posterior distribution of ", alpha)))
lines(density(res.block[,1]),col=2)
lines(density(res.indep[,1]),col=4)
plot(density(res[,2]), xlab=expression(paste(beta)), main=expression(paste("Posterior distribution of ", beta)))
lines(density(res.block[,2]),col=2)
lines(density(res.indep[,2]),col=4)