Stan for Bayesian Analysis

Bayesian analysis has been growing in popularity among ecologists recently, largely due to accessible books such as Models for Ecological Data: An Introduction, Introduction to WinBUGS for Ecologists, and Bayesian Methods for Ecology. Most ecologists with limited programming background have chosen to implement Bayesian analyses using WinBUGS, OpenBUGS, or JAGS, all of which can be run through R packages. JAGS is the best of these, works on most operating systems, and is flexible enough for most Bayesian analyses in ecology. However, recently due to inherent inefficiencies in these programs, which all implement MCMC using a Gibbs sampler (version of the Metropolis-Hastings Algorithm), a group of statisticians and computer programmers got together to solve these problems. The result was Stan.

Stan uses a version of a Hamiltonian Monte Carlo, specifically the No U-Turn (NUTS) sampler. NUTS in combination with Stan being written in C++ makes it faster than the above-mentioned Gibbs samplers, particularly for complex problems. Models should converge faster and in cases when Gibbs samplers can’t converge at all, Stan might provide a solution. For those of us who don’t work in C++ there is a bit of a learning curve but there is a nice R package (rstan) for implementation, a great manual, and a very active users group.

I decided to go through Marc Kery’s excellent book, and run all of the examples using Stan via rstan for comparison. Apparently, I am not the only one. You can find an example from Chapter 7 (t-test) on Shige’s useful blog. I recently finished translating chapter 8 into Stan code. Running the model was relatively easy, but it took me a while to figure out how to do all the post-processing model assessment (e.g. plots, posterior predictive checks, etc.). Now that I’ve done it though, it should be fairly easy to repeat. You can find the code below for running the model for Chapter 8, exercise 3 below and the full code for the chapter and exercises as a word document here: Kery_Chp8_Stan

bm8_3 <- '
data{
int n;
vector[n] y;
vector[n] x;
}

parameters{ 
real alpha; // these will be monitored and saved
real beta;
real sigma;
}

model{
vector[n] mu; // mu is defined in the model so won't be reported

// Priors
alpha ~ normal(0, 100);
beta ~ normal(0, 100);
sigma ~ uniform(0, 30);

// Likelihood
mu y ~ normal(mu, sigma);
}
'
data_8_3 <- list(n = length(mmd.grass$year), y = mmd.grass$mmd, 
                 x = mmd.grass$year)

fit.8.3 <- stan(model_code = bm8_3, data = data_8_3,
  chains = 3,
  iter = 50000,
  warmup = 25000,
  thin = 1)

print(fit.8.3, dig = 3)

traceplot(fit.8.3)

R script to calculate QIC for Generalized Estimating Equation (GEE) Model Selection

[UPDATE: I have a new, improved version of the QIC code and an example of running it here]

Generalized Estimating Equations (GEE) can be used to analyze longitudinal count data; that is, repeated counts taken from the same subject or site. This is often referred to as repeated measures data, but longitudinal data often has more repeated observations. Longitudinal data arises from studies in virtually all branches of science. In psychology or medicine, repeated measurements are taken on the same patients over time. In sociology, schools or other social distinct groups are observed over time. In my field, ecology, we frequently record data from the same plants or animals repeated over time. Furthermore, the repeated measures don’t have to be separated in time. A researcher could take multiple tissue samples from the same subject at a given time. I often repeatedly visit the same field sites (e.g. same patch of forest) over time. If the data are discrete counts of things (e.g. number of red blood cells, number of acorns, number of frogs), the data will generally follow a Poisson distribution.

Longitudinal count data, following a Poisson distribution, can be analyzed with Generalized Linear Mixed Models (GLMM) or with GEE. I won’t get into the computational or philosophical differences between conditional, subject-specific estimates associated with GLMM and marginal, population-level estimates obtained by GEE in this post. However, if you decide that GEE is right for you (I have a paper in preparation comparing GLMM and GEE), you may also want to compare multiple GEE models. Unlike GLMM, GEE does not use full likelihood estimates, but rather, relies on a quasi-likelihood function. Therefore, the popular AIC approach to model selection don’t apply to GEE models. Luckily, Pan (2001) developed an equivalent QIC for model comparison. Like AIC, it balances the model fit with model complexity to pick the most parsimonious model.

Unfortunately, there is currently no QIC package in R for GEE models. geepack is a popular R package for GEE analysis. So, I wrote the short R script below to calculate Pan’s QIC statistic from the output of a GEE model run in geepack using the geese function. It currently employs the Moore-Penrose Generalized Matrix Inverse through the MASS package. I left in my original code using the identity matrix but it is preceded by a pound sign so it doesn’t run. [edition: April 10, 2012] The input for the QIC function needs to come from the geeglm function (as opposed to “geese”) within geepack.

I hope you find it useful. I’m still fairly new to R and this is one of my first custom functions, so let me know if you have problems using it or if there are places it can be improved. If you decide to use this for analysis in a publication, please let me know just for my own curiosity (and ego boost!).

#####################################################################################
# QIC for GEE models
# Daniel J. Hocking
# 07 February 2012
# Refs:
  # Pan (2001)
  # Liang and Zeger (1986)
  # Zeger and Liang (1986)
  # Hardin and Hilbe (2003)
  # Dornmann et al 2007
  # # http://www.unc.edu/courses/2010spring/ecol/562/001/docs/lectures/lecture14.htm
#####################################################################################
# Poisson QIC for geese{geepack} output
# Ref: Pan (2001)
QIC.pois.geeglm <-function(model.R, model.indep){
  library(MASS)
  # Fitted and observed values for quasi likelihood
  mu.R <- model.R$fitted.values
  # alt: X <- model.matrix(model.R)
      #  names(model.R$coefficients) <- NULL
      #  beta.R <- model.R$coefficients
      #  mu.R <- exp(X %*% beta.R)
  y <- model.R$y

  # Quasi Likelihood for Poisson
  quasi.R <- sum((y*log(mu.R))- mu.R)# poisson()$dev.resids - scale and weights = 1
 
  # Trace Term (penalty for model complexity)
  AIinverse<- ginv(model.indep$geese$vbeta.naiv)# Omega-hat(I) via Moore-Penrose 
generalized inverse of a matrix in MASS package
  # Alt: AIinverse <- solve(model.indep$geese$vbeta.naiv) # solve via identity
  Vr<- model.R$geese$vbeta
  trace.R <- sum(diag(AIinverse%*%Vr))
  px <- length(mu.R)# number non-redunant columns in design matrix

  # QIC
  QIC <-(-2)*quasi.R +2*trace.R
  QICu<-(-2)*quasi.R +2*px    # Approximation assuming model structured correctly 
  output <- c(QIC,QICu, quasi.R, trace.R, px)
  names(output)<- c('QIC','QICu','Quasi Lik','Trace','px')
  output}