I recently gave a talk at the Ecological Society of America (ESA) annual meeting in Portland, OR and a poster presentation at the World Congress of Herpetology meeting in Vancouver, BC, Canada. Both presentations were comparing generalized linear mixed models (GLMM) and generalized estimating equations (GEE) for analyzing repeated count data. I advocate for using GEE over the more common GLMM to analyze longitudinal count (or binomial) data when the specific subjects (sites as random effects) are not of special interest. The overall confidence intervals are much smaller in the GEE models and the coefficient estimates are averaged over all subjects (sites). This means the interpretation of coefficients is the log change in Y for each 1 unit change in X on average (averaged across subjects). Below you can see my two presentations for more details.
I am comparing estimates from subject-specific GLMMs and population-average GEE models as part of a publication I am working on. As part of this, I want to visualize predictions of each type of model including 95% confidence bands.
First I had to make a new data set for prediction. I could have compared fitted values with confidence intervals but I am specifically interested in comparing predictions for particular variables while holding others constant. For example, soil temperature is especially important for salamanders, so I am interested in the predicted effects of soil temperature from the different models. I used the expand.grid and model.matrix functions in R to generate a new data set where soil temperature varied from 0 to 30 C. The other variables were held constant at their mean levels during the study. Because of the nature of the contrast argument in the model.matrix function, I had to include more than one level of the factor “season”. I then removed all season except spring. In effect I am asking, what is the effect of soil temperature on salamander activity during the spring when the other conditions are constant (e.g. windspeed = 1.0 m/s, rain in past 24 hours = This code is based on code from Ben Bolker via http://glmm.wikidot.com
# Compare Effects of SoilT with 95% CIs formula(glmm1) newdat.soil <- expand.grid( SoilT = seq(0, 30, 1), RainAmt24 = mean(RainAmt24), RH = mean(RH), windspeed = mean(windspeed), season = c("spring", "summer", "fall"), droughtdays = mean(droughtdays), count = 0 ) newdat.soil$SoilT2 <- newdat.soil$SoilT^2 # Spring newdat.soil.spring <- newdat.soil[newdat.soil$season == 'spring', ] mm = model.matrix(terms(glmm1), newdat.soil)
Next I calculated the 95% confidence intervals for both the GLMM and GEE models. For the GLMM the plo and phi are the low and high confidence intervals for the fixed effects assuming zero effect of the random sites. tlo and thi account for the uncertainty in the random effects.
newdat.soil$count = mm %*% fixef(glmm1) pvar1 <- diag(mm %*% tcrossprod(vcov(glmm1),mm)) tvar1 <- pvar1+VarCorr(glmm1)$plot newdat.soil <- data.frame( newdat.soil , plo = newdat.soil$count-2*sqrt(pvar1) , phi = newdat.soil$count+2*sqrt(pvar1) , tlo = newdat.soil$count-2*sqrt(tvar1) , thi = newdat.soil$count+2*sqrt(tvar1) ) mm.geeEX = model.matrix(terms(geeEX), newdat.soil) newdat.soil$count.gee = mm.geeEX %*% coef(geeEX) tvar1.gee <- diag(mm.geeEX %*% tcrossprod(geeEX$geese$vbeta, mm.geeEX)) newdat.soil <- data.frame( newdat.soil , tlo.gee = newdat.soil$count-2*sqrt(tvar1.gee) , thi.gee = newdat.soil$count+2*sqrt(tvar1.gee) )
The standard error of the fixed effects are larger in the GEE model than in the GLMM, but when the variation associated with the random effects are accounted for, the uncertainty (95% CI) around the estimates is greater in the GLMM. This is especially evident when the estimated values are large since the random effects are exponential on the original scale. This can be seen in the below plots
Although this plot does the job, it isn’t an efficient use of space, nor is it easy to compare exactly where the different lines fall. It would be nice to plot everything on one set of axes. The only trouble is that all the lines could be difficult to see just using solid and dashed/dotted lines. To help with this, I combine the plots but added color and shading using the polygon function. The code and plot are below
plot(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count.gee), xlab = "Soil temperature (C)", ylab = 'Predicted salamander observations', type = 'l', ylim = c(0, 25)) polygon(c(newdat.soil.spring$SoilT , rev(newdat.soil.spring$SoilT)) , c(exp(newdat.soil.spring$thi.gee) , rev(exp(newdat.soil.spring$tlo.gee))) , col = 'grey' , border = NA) lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$thi.gee), type = 'l', lty = 2) lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$tlo.gee), type = 'l', lty = 2) lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count.gee), type = 'l', lty = 1, col = 2) lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count), col = 1) lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$thi), type = 'l', lty = 2) lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$tlo), type = 'l', lty = 2)
Now you can directly compare the results of the GLMM and GEE models. The predicted values (population-averaged) for the GEE is represented by the red line, while the average (random effects = 0, just fixed effects) from the GLMM are represented by the solid black line. The dashed lines represent the 95% confidence intervals for the GLMM and the shaded area is the 95% confidence envelope for the GEE model. As you can see, the GEE has much higher confidence in it’s prediction of soil temperature effects on salamander surface activity than the GLMM model. This would not be apparent without visualizing the predictions with confidence intervals because the standard errors of the fixed effects were lower in the GLMM than in the GEE. This is because the SEs in the GEE include the site-level (random effect) variation while the GLMM SEs of the covariates do not include this variation and are interpreted as the effect of a change of 1 X on Y at a given site.
I have been starting to analyze some data I have of repeated counts of salamanders from 5 plots over 4 years. I am trying to develop a predictive model of salamander nighttime surface activity as a function of weather variables. The repeated counting leads to the need for Generalized Linear Mixed Models (GLMM). Count data often results in data that are best described with a Poisson distribution, hence the “generalized” term. Because the counts were repeated on the same plots, plot needs to be considered a random effect. If the plot term was not included in this way it would suggest that all the counts were independent but in reality counts on one plot over time are likely to have some correlation that needs to be accounted for to avoid pseudoreplication. So I am stuck with a GLMM. The problem with GLMM in a frequentist statistical framework is that they are difficult to analyze. Bolker and colleagues give the best overview of the analysis process and it’s challenges in: Generalized Linear Mixed Models: A Practical Guide for Ecology and Evolution. They do have an online supplement to that paper that provides a workthrough example complete with R code using the lme4 package. I HIGHLY recommend everyone read Bolker’s paper if considering using GLMMs. Personally, I like the idea of analyzing GLMMs with Bayesian statistics rather than traditional frequentist stats. Below are a few emails that I’ve recently been exchanging with colleagues regarding GLMM. Let me know what you think.
Question About Selection of Correlated Predictor Variables and Model Selection:
How much correlation among independent variables is too much in a GLMM? If I have correlation in the variables does it affect the interpretation or model selection?
Answer from a Statistician Friend:
0.8 and above is high and often one variable can be replaced by the other, and
both are not necessary in the model.
Below 0.7 typically both variables are needed for a good model fit.
I usually use stepAIC (from the MASS package in R) for model selection.
The difficulty comes in interpreting the regression coefficients: with correlation in the predictor variables, the variable that appears first
on the model statement usually gets the larger absolute value, whereas
the other variable has a smaller (in absolute value) coefficient.
Remember the interpretation of regression coefficients: the change
in the response per unit increase GIVEN ALL THE OTHER VARIABLEs IN THE
If you want coefficients that represent “additive” contributions to the
variation in the response (regardless of the order in which predictors
appear in the model statement), and if you have considerable multicollinearity
you might want to consider doing a principal component regression with all
or perhaps with only a subgroup of similar predictor variables.
As with most issues in statistics, there is not a clear-cut hard-fact simple
answer. Live would be simpler if there was….
Question of GLMM Bayesian Approach:
Hey Dan – I’m using GLMM b/c I have a repeated-measures design, count data response (negative binomial distribution), etc. I’m finding admb in R is doing the job – and I read the article you mentioned a few months back, when I started considering GLMMs…
I have never worked with Bayesian stats and wouldn’t even know where to begin. Do you have any recommendations for overview reading, and can I analyze a repeated-measures design (i.e., is there a way to cope with random factors)?
My data sounds very similar to yours. I usually use lmer in the lme4 package. Right now I am just essentially copying the code in Bolker et al 2009 from the online supplements in the TREE paper previously mentioned. I have never see the admb package and will have to check it out. I’ve tried glmmPQL and glmmML but there are more examples in lmer and it’s Splus predecessor. I am annoyed that in Zuur et al. “Mixed Effects Models and Extensions in Ecology with R” they don’t spend much time on model assumptions or model comparison. I feel like they show users how to do the analysis but not how to evaluate it. Pinheiro and Bates do a better job in “Mixed-Effects Models in S and S-Plus” but they focus on linear mixed models and non-linear mixed models and less on GLMM. Plus the code is similar to but differs enough from R that it can be challenging to use at times. The “SAS for Mixed Models” book is good but SAS isn’t free and the code isn’t as transparent to me. Plus it doesn’t have good graphics so I prefer R.
Anyway, Bayesian stats have their own can of worms but I find it more intuitively appealing and I like the transparency in the code using WinBUGS (no Mac version) called from R. There are two very good, practical books to get started. McCarthy presents a good overview and introduction to bayesian stats in “Bayesian Methods for Ecology” but the examples don’t get very advanced. Personally I recommend getting that from the library and reading the first few chapters. I would then buy Marc Kery’s excellent book, “Introduction to WinBUGS for Ecologists.” It is very well written and has a wider range of examples that typically relate to many animal ecology studies. Clark and Gelfand have a decent modeling book with Bayesian analysis in R examples but it’s more ecosystem/environmentally oriented than animal ecology.
Bayesian analysis treats all factors sort of like random variables from population distributions. Therefore there is not need for explicit random vs. fixed delineation. You get estimates and credibility intervals for all variables. You can essentially write the same GLMM model and then analyze it in a Bayesian framework. The big difference in the philosophy behind frequentist vs Bayesian statistics. Bayesians use prior information (even noninformative priors contain information on the underlying distributions). Some scientists are opposed to this but for various reasons that I won’t go into now, I like it. Some people do want a sensitivity analysis to go along with Bayesian analysis to determine the influence of the priors. I might go as far as to say that in the case of GLMM type data Bayesian statistics are more sound (robust?) than frequentist methods but they differ significantly from a philosophical standpoint.
Anyway, I hope that helps.