# Category Archives: Modeling

## Review of an Information Theoretic Approach to Ecology and Evolution

There are always those papers that you mean to read but just sit on your desktop or in your To_Read folder forever. Grueber et al. 2011 was one of those for me and I finally got around to reading it.

Grueber, Nakagawa, Laws, and Jamieson. 2011. Multimodel inference in ecology and evolution: challenges and solutions. Journal of Evolutionary Biology. 24:699-711.

The Information Theoretic (IT) approach, and AIC in particular, has become so pervasive in ecology that feels almost compulsory for studies outside very controlled laboratory experiments. However, it seems like many authors and reviewers either don’t believe in the value of other approaches or don’t understand the limitations of it’s use, especially with respect to complex statistical modeling.

I constantly struggle both philosophically and practically with how to best approach analyses of field surveys. I want to develop models that both explain observed patterns for increased understanding and have the power to predict unobserved points in time and space (i.e. sites not surveyed and future conditions of monitored and unmonitored sites). I frequently use linear and generalized linear mixed models as well as more complex hierarchical models. These are areas of rapid statistical development, so getting appropriately fitting models with sparse ecological data adds to the practical challenges, regardless of philosophical desires.

The general idea in an IT approach is to balance model fit with model complexity. Generally, a more complex model will describe the data better (high fit, high complexity). However, if you describe the data perfectly, it is unlikely to have good predictive power because some of the model parameters will only apply to the data collected at those locations at those times. Hence the desire to have a simpler model that still describes the data well. Despite the desire for predictive models, few ecologists actually test the predictive power of their models. I won’t say more about that now, but will refer the reader to posts by Brian McGill on the Dynamic Ecology blog for thoughtful discussion of this topic.

The balancing of fit and complexity sounds great but it is much more difficult in practice (as are most things). When too many models are compared, especially without *a priori* formulation, it is common to get spurious results. If people are going to try every combination of variables from the most complex global model then model average, I hope the resulting model is validated on independent data to ensure that the model is useful. An extreme alternative to this approach is one often taken by Bayesians. Just develop a sensible biological model and estimate the parameters. Don’t worry about the “best” model but rather about the parameter estimates and their uncertainties.

For those interested in an IT approach and want to learn more about the practical uses, Grueber et al. (2011) provide a great resource. I can’t believe I waited this long to read it. **Box 1** provides an nice overview of the different Information Criteria (e.g. AIC, BIC, AICc, DIC). **Table 2** is really a great overview of the practical issues and tentative solutions. They point out that,

Translating biological hypotheses into statistical models is likely to remain the most difficult aspect of using an IT approach…because of the complexity of biological processes.

I agree but also think this is the most important part of the process. Significant time should be spent on this step and it’s generally helpful to talk through the hypotheses with colleagues (perfect for lab meetings). Model averaging should be avoided when completing models cannot be combined to form a biologically relevant model.

One interesting point the authors make is to, **“Always fit [random] slope if possible, otherwise use just the intercept”**. I would love to hear what people think about this. In the past, I had avoided fitting many random slopes to avoid model complexity and because I often had trouble thinking how the effect would vary by subject (often survey site). However, more recently, I’ve been including random slopes to differentiate variation in the effect of a parameter from uncertainty (SE) of the effect (fixed effect coefficient). The authors point out that including random slopes reduces the incidence of Type I and II errors and reduces the chance of overconfident estimates.

Another interesting point is whether to do exploratory plots or not. The authors are in favor of it, but note that IT advocates such as Burnham and Anderson (2002: Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach)

oppose any data exploration because it results in *post hoc* creation of statistical models and therefore associated biological processes. I generally do a fair number of exploratory plotting.

Grueber and colleagues recommend generating a model set from all possible submodels of the global model, assuming that all the submodels are biological plausible. After this though, they provide a large number of caveats and cautions. This remains an area in need of further research.

I am curious about how they would generate all submodels with inclusion of random slopes whenever possible. I have generally followed Zuur and colleagues recommendations of putting in all fixed effects parameters (most complex, over-parametrized global model) then select random effects via AIC holding the fixed effects constant. Then reduce the complexity of the fixed effects, although this method limits the fixed effects that can be removed to those without random slopes. It can also be a problem if the global model has convergence problems. I’d love to hear how you proceed with model selection in mixed and other hierarchical models. Let me know in the comments.

Some take home points:

- Use a 10:1 subject-to-predictor ratio in multiple regression
- Generally avoid retaining a focal parameter of interest in all models, especially when interested in model averaging.
- Recommend model averaging but not the full set of models. Their tentative solution to which to average is to exclude models from the set that are more complex versions of those with lower AICc, but with caution.
- The zero method should be used for model averaging when the aim of the study is to determine which factors have the strongest effect on the response variable.
- Recommend standardizing input variables with a mean of zero and
**a standard deviation of 0.5 (traditionally 1) to allow the standardization of binary and categorical dummy variables**

Overall this paper provides a great overview and good recommendations for using an Information Theoretic approach in ecology. Hopefully it also indicates that AIC isn’t perfect and doesn’t invalidate other approaches to scientific understanding in ecology. For those who use a lot of mixed models, Zuur et al. (2009) provide valuable guidance as well. Although we all want specific rules to follow, model development and selection remains nearly as much an art as a science. This paper would make great lab group reading and I hope it stimulates a healthy discussion in ecology and evolution circles.

## GEE QIC update

Here is improved code for calculating QIC from geeglm in geepack in R (original post). Let me know how it works. I haven’t tested it much, but is seems that QIC may select overparameterized models. In the code below, I had to replace <- with = because wordpress didn’t except <- within code or pre tags. It should still work just fine.

Here is a quick example of how to run this function. First, highlight and run the code below in R. That will save the function in your workspace. Then run your gee model using geeglm in geepack (package available from CRAN). Next, run QIC(your_gee_model) and you get the QIC. You can then repeat this with alternative a priori models. Below the function is an example using data available as part of geepack.

**[UPDATE: IMPROVED CODE AND EXTENSIONS ARE NOW AVAILABLE ON https://github.com/djhocking/qicpack INCLUDING AS AN R PACKAGE]**

############################################################ # QIC for GEE models # Daniel J. Hocking ############################################################ QIC = function(model.R) { library(MASS) model.indep = update(model.R, corstr = "independence") # Quasilikelihood mu.R = model.R$fitted.values y = model.R$y type = family(model.R)$family quasi.R = switch(type, poisson = sum((y*log(mu.R)) - mu.R), gaussian = sum(((y - mu.R)^2)/-2), binomial = sum(y*log(mu.R/(1 - mu.R)) + log(1 - mu.R)), Gamma = sum(-y/(mu.R - log(mu.R))), stop("Error: distribution not recognized")) # Trace Term (penalty for model complexity) omegaI = ginv(model.indep$geese$vbeta.naiv) # Omega-hat(I) via Moore-Penrose generalized inverse of a matrix in MASS package #AIinverse = solve(model.indep$geese$vbeta.naiv) # solve via indenity Vr = model.R$geese$vbeta trace.R = sum(diag(omegaI %*% Vr)) px = length(mu.R) # number non-redunant columns in design matrix # QIC QIC = 2*(trace.R - quasi.R) [EDIT: original post was missing '*'] #QICu = (-2)*quasi.R + 2*px # Approximation assuming model structured correctly output = c(QIC, quasi.R, trace.R, px) names(output) = c('QIC', 'Quasi Lik', 'Trace', 'px') return(output) }

Here’s an example you can run in R.

library(geepack) data(dietox) dietox$Cu = as.factor(dietox$Cu) mf = formula(Weight ~ Cu * (Time + I(Time^2) + I(Time^3))) gee1 = geeglm(mf, data = dietox, id = Pig, family = poisson, corstr = "ar1") gee1 summary(gee1) mf2 = formula(Weight ~ Cu * Time + I(Time^2) + I(Time^3)) gee2 = geeglm(mf2, data = dietox, id = Pig, family = poisson, corstr = "ar1") summary(gee2) anova(gee2) anova(gee1, gee2) mf3 = formula(Weight ~ Cu + Time + I(Time^2)) gee3 = geeglm(mf3, data = dietox, id = Pig, family = poisson, corstr = "ar1") gee3.I = update(gee3, corstr = "independence") gee3.Ex = update(gee3, corstr = "exchangeable") sapply(list(gee1, gee2, gee3, gee3.I, gee3.Ex), QIC)

In the output of this model it suggests that model gee1 is the best model. I have some concerns that QIC will almost inevitably choose the most complex model. More testing with simulated data will be necessary.

[,1] [,2] [,3] [,4] [,5] QIC -333199.7 -333188.0 -333187.5 -333181.8 -333153.6 Quasi Lik 166623.0 166622.7 166620.4 166622.2 166615.4 Trace 23.2 28.7 26.6 31.3 38.6 px 861.0 861.0 861.0 861.0 861.0

You will get warnings when running this model because it uses a Poisson distribution for continuous data. I will work on finding a better example in the future before I make this available as an R package.

## Conference Presentations

I recently gave a talk at the Ecological Society of America (ESA) annual meeting in Portland, OR and a poster presentation at the World Congress of Herpetology meeting in Vancouver, BC, Canada. Both presentations were comparing generalized linear mixed models (GLMM) and generalized estimating equations (GEE) for analyzing repeated count data. I advocate for using GEE over the more common GLMM to analyze longitudinal count (or binomial) data when the specific subjects (sites as random effects) are not of special interest. The overall confidence intervals are much smaller in the GEE models and the coefficient estimates are averaged over all subjects (sites). This means the interpretation of coefficients is the log change in Y for each 1 unit change in X on average (averaged across subjects). Below you can see my two presentations for more details.

## Plotting 95% Confidence Bands in R

I am comparing estimates from subject-specific GLMMs and population-average GEE models as part of a publication I am working on. As part of this, I want to visualize predictions of each type of model including 95% confidence bands.

First I had to make a new data set for prediction. I could have compared fitted values with confidence intervals but I am specifically interested in comparing predictions for particular variables while holding others constant. For example, soil temperature is especially important for salamanders, so I am interested in the predicted effects of soil temperature from the different models. I used the expand.grid and model.matrix functions in R to generate a new data set where soil temperature varied from 0 to 30 C. The other variables were held constant at their mean levels during the study. Because of the nature of the contrast argument in the model.matrix function, I had to include more than one level of the factor “season”. I then removed all season except spring. In effect I am asking, what is the effect of soil temperature on salamander activity during the spring when the other conditions are constant (e.g. windspeed = 1.0 m/s, rain in past 24 hours = This code is based on code from Ben Bolker via http://glmm.wikidot.com

# Compare Effects of SoilT with 95% CIs formula(glmm1) newdat.soil <- expand.grid( SoilT = seq(0, 30, 1), RainAmt24 = mean(RainAmt24), RH = mean(RH), windspeed = mean(windspeed), season = c("spring", "summer", "fall"), droughtdays = mean(droughtdays), count = 0 ) newdat.soil$SoilT2 <- newdat.soil$SoilT^2 # Spring newdat.soil.spring <- newdat.soil[newdat.soil$season == 'spring', ] mm = model.matrix(terms(glmm1), newdat.soil)

Next I calculated the 95% confidence intervals for both the GLMM and GEE models. For the GLMM the plo and phi are the low and high confidence intervals for the fixed effects assuming zero effect of the random sites. tlo and thi account for the uncertainty in the random effects.

newdat.soil$count = mm %*% fixef(glmm1) pvar1 <- diag(mm %*% tcrossprod(vcov(glmm1),mm)) tvar1 <- pvar1+VarCorr(glmm1)$plot[1] newdat.soil <- data.frame( newdat.soil , plo = newdat.soil$count-2*sqrt(pvar1) , phi = newdat.soil$count+2*sqrt(pvar1) , tlo = newdat.soil$count-2*sqrt(tvar1) , thi = newdat.soil$count+2*sqrt(tvar1) ) mm.geeEX = model.matrix(terms(geeEX), newdat.soil) newdat.soil$count.gee = mm.geeEX %*% coef(geeEX) tvar1.gee <- diag(mm.geeEX %*% tcrossprod(geeEX$geese$vbeta, mm.geeEX)) newdat.soil <- data.frame( newdat.soil , tlo.gee = newdat.soil$count-2*sqrt(tvar1.gee) , thi.gee = newdat.soil$count+2*sqrt(tvar1.gee) )

The standard error of the fixed effects are larger in the GEE model than in the GLMM, but when the variation associated with the random effects are accounted for, the uncertainty (95% CI) around the estimates is greater in the GLMM. This is especially evident when the estimated values are large since the random effects are exponential on the original scale. This can be seen in the below plots

Although this plot does the job, it isn’t an efficient use of space, nor is it easy to compare exactly where the different lines fall. It would be nice to plot everything on one set of axes. The only trouble is that all the lines could be difficult to see just using solid and dashed/dotted lines. To help with this, I combine the plots but added color and shading using the polygon function. The code and plot are below

plot(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count.gee), xlab = "Soil temperature (C)", ylab = 'Predicted salamander observations', type = 'l', ylim = c(0, 25)) polygon(c(newdat.soil.spring$SoilT , rev(newdat.soil.spring$SoilT)) , c(exp(newdat.soil.spring$thi.gee) , rev(exp(newdat.soil.spring$tlo.gee))) , col = 'grey' , border = NA) lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$thi.gee), type = 'l', lty = 2) lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$tlo.gee), type = 'l', lty = 2) lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count.gee), type = 'l', lty = 1, col = 2) lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count), col = 1) lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$thi), type = 'l', lty = 2) lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$tlo), type = 'l', lty = 2)

Now you can directly compare the results of the GLMM and GEE models. The predicted values (population-averaged) for the GEE is represented by the red line, while the average (random effects = 0, just fixed effects) from the GLMM are represented by the solid black line. The dashed lines represent the 95% confidence intervals for the GLMM and the shaded area is the 95% confidence envelope for the GEE model. As you can see, the GEE has much higher confidence in it’s prediction of soil temperature effects on salamander surface activity than the GLMM model. This would not be apparent without visualizing the predictions with confidence intervals because the standard errors of the fixed effects were lower in the GLMM than in the GEE. This is because the SEs in the GEE include the site-level (random effect) variation while the GLMM SEs of the covariates do not include this variation and are interpreted as the effect of a change of 1 X on Y * at a given site*.