# Blog Archives

## Review of an Information Theoretic Approach to Ecology and Evolution

There are always those papers that you mean to read but just sit on your desktop or in your To_Read folder forever. Grueber et al. 2011 was one of those for me and I finally got around to reading it.

Grueber, Nakagawa, Laws, and Jamieson. 2011. Multimodel inference in ecology and evolution: challenges and solutions. Journal of Evolutionary Biology. 24:699-711.

The Information Theoretic (IT) approach, and AIC in particular, has become so pervasive in ecology that feels almost compulsory for studies outside very controlled laboratory experiments. However, it seems like many authors and reviewers either don’t believe in the value of other approaches or don’t understand the limitations of it’s use, especially with respect to complex statistical modeling.

I constantly struggle both philosophically and practically with how to best approach analyses of field surveys. I want to develop models that both explain observed patterns for increased understanding and have the power to predict unobserved points in time and space (i.e. sites not surveyed and future conditions of monitored and unmonitored sites). I frequently use linear and generalized linear mixed models as well as more complex hierarchical models. These are areas of rapid statistical development, so getting appropriately fitting models with sparse ecological data adds to the practical challenges, regardless of philosophical desires.

The general idea in an IT approach is to balance model fit with model complexity. Generally, a more complex model will describe the data better (high fit, high complexity). However, if you describe the data perfectly, it is unlikely to have good predictive power because some of the model parameters will only apply to the data collected at those locations at those times. Hence the desire to have a simpler model that still describes the data well. Despite the desire for predictive models, few ecologists actually test the predictive power of their models. I won’t say more about that now, but will refer the reader to posts by Brian McGill on the Dynamic Ecology blog for thoughtful discussion of this topic.

The balancing of fit and complexity sounds great but it is much more difficult in practice (as are most things). When too many models are compared, especially without *a priori* formulation, it is common to get spurious results. If people are going to try every combination of variables from the most complex global model then model average, I hope the resulting model is validated on independent data to ensure that the model is useful. An extreme alternative to this approach is one often taken by Bayesians. Just develop a sensible biological model and estimate the parameters. Don’t worry about the “best” model but rather about the parameter estimates and their uncertainties.

For those interested in an IT approach and want to learn more about the practical uses, Grueber et al. (2011) provide a great resource. I can’t believe I waited this long to read it. **Box 1** provides an nice overview of the different Information Criteria (e.g. AIC, BIC, AICc, DIC). **Table 2** is really a great overview of the practical issues and tentative solutions. They point out that,

Translating biological hypotheses into statistical models is likely to remain the most difficult aspect of using an IT approach…because of the complexity of biological processes.

I agree but also think this is the most important part of the process. Significant time should be spent on this step and it’s generally helpful to talk through the hypotheses with colleagues (perfect for lab meetings). Model averaging should be avoided when completing models cannot be combined to form a biologically relevant model.

One interesting point the authors make is to, **“Always fit [random] slope if possible, otherwise use just the intercept”**. I would love to hear what people think about this. In the past, I had avoided fitting many random slopes to avoid model complexity and because I often had trouble thinking how the effect would vary by subject (often survey site). However, more recently, I’ve been including random slopes to differentiate variation in the effect of a parameter from uncertainty (SE) of the effect (fixed effect coefficient). The authors point out that including random slopes reduces the incidence of Type I and II errors and reduces the chance of overconfident estimates.

Another interesting point is whether to do exploratory plots or not. The authors are in favor of it, but note that IT advocates such as Burnham and Anderson (2002: Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach)

oppose any data exploration because it results in *post hoc* creation of statistical models and therefore associated biological processes. I generally do a fair number of exploratory plotting.

Grueber and colleagues recommend generating a model set from all possible submodels of the global model, assuming that all the submodels are biological plausible. After this though, they provide a large number of caveats and cautions. This remains an area in need of further research.

I am curious about how they would generate all submodels with inclusion of random slopes whenever possible. I have generally followed Zuur and colleagues recommendations of putting in all fixed effects parameters (most complex, over-parametrized global model) then select random effects via AIC holding the fixed effects constant. Then reduce the complexity of the fixed effects, although this method limits the fixed effects that can be removed to those without random slopes. It can also be a problem if the global model has convergence problems. I’d love to hear how you proceed with model selection in mixed and other hierarchical models. Let me know in the comments.

Some take home points:

- Use a 10:1 subject-to-predictor ratio in multiple regression
- Generally avoid retaining a focal parameter of interest in all models, especially when interested in model averaging.
- Recommend model averaging but not the full set of models. Their tentative solution to which to average is to exclude models from the set that are more complex versions of those with lower AICc, but with caution.
- The zero method should be used for model averaging when the aim of the study is to determine which factors have the strongest effect on the response variable.
- Recommend standardizing input variables with a mean of zero and
**a standard deviation of 0.5 (traditionally 1) to allow the standardization of binary and categorical dummy variables**

Overall this paper provides a great overview and good recommendations for using an Information Theoretic approach in ecology. Hopefully it also indicates that AIC isn’t perfect and doesn’t invalidate other approaches to scientific understanding in ecology. For those who use a lot of mixed models, Zuur et al. (2009) provide valuable guidance as well. Although we all want specific rules to follow, model development and selection remains nearly as much an art as a science. This paper would make great lab group reading and I hope it stimulates a healthy discussion in ecology and evolution circles.

## R script to calculate QIC for Generalized Estimating Equation (GEE) Model Selection

**[UPDATE: IMPROVED CODE AND EXTENSIONS ARE NOW AVAILABLE ON https://github.com/djhocking/qicpack INCLUDING AS AN R PACKAGE]**

Generalized Estimating Equations (GEE) can be used to analyze longitudinal count data; that is, repeated counts taken from the same subject or site. This is often referred to as repeated measures data, but longitudinal data often has more repeated observations. Longitudinal data arises from studies in virtually all branches of science. In psychology or medicine, repeated measurements are taken on the same patients over time. In sociology, schools or other social distinct groups are observed over time. In my field, ecology, we frequently record data from the same plants or animals repeated over time. Furthermore, the repeated measures don’t have to be separated in time. A researcher could take multiple tissue samples from the same subject at a given time. I often repeatedly visit the same field sites (e.g. same patch of forest) over time. If the data are discrete counts of things (e.g. number of red blood cells, number of acorns, number of frogs), the data will generally follow a Poisson distribution.

Longitudinal count data, following a Poisson distribution, can be analyzed with Generalized Linear Mixed Models (GLMM) or with GEE. I won’t get into the computational or philosophical differences between conditional, subject-specific estimates associated with GLMM and marginal, population-level estimates obtained by GEE in this post. However, if you decide that GEE is right for you (I have a paper in preparation comparing GLMM and GEE), you may also want to compare multiple GEE models. Unlike GLMM, GEE does not use full likelihood estimates, but rather, relies on a quasi-likelihood function. Therefore, the popular AIC approach to model selection don’t apply to GEE models. Luckily, Pan (2001) developed an equivalent QIC for model comparison. Like AIC, it balances the model fit with model complexity to pick the most parsimonious model.

Unfortunately, there is currently no QIC package in R for GEE models. geepack is a popular R package for GEE analysis. So, I wrote the short R script below to calculate Pan’s QIC statistic from the output of a GEE model run in geepack using the geese function. It currently employs the Moore-Penrose Generalized Matrix Inverse through the MASS package. I left in my original code using the identity matrix but it is preceded by a pound sign so it doesn’t run. *[edition: April 10, 2012] The input for the QIC function needs to come from the geeglm function (as opposed to “geese”) within geepack.*

I hope you find it useful. I’m still fairly new to R and this is one of my first custom functions, so let me know if you have problems using it or if there are places it can be improved. If you decide to use this for analysis in a publication, please let me know just for my own curiosity (and ego boost!).

```
###################################################################################### QIC for GEE models# Daniel J. Hocking# 07 February 2012# Refs:
# Pan (2001)
# Liang and Zeger (1986)
# Zeger and Liang (1986)
# Hardin and Hilbe (2003)
# Dornmann et al 2007
# # http://www.unc.edu/courses/2010spring/ecol/562/001/docs/lectures/lecture14.htm###################################################################################### Poisson QIC for geese{geepack} output# Ref: Pan (2001)
QIC.pois.geeglm <-function(model.R, model.indep){
library(MASS)
# Fitted and observed values for quasi likelihood
mu.R <- model.R$fitted.values
# alt: X <- model.matrix(model.R)
# names(model.R$coefficients) <- NULL
# beta.R <- model.R$coefficients
# mu.R <- exp(X %*% beta.R)
y <- model.R$y
# Quasi Likelihood for Poisson
quasi.R <- sum((y*log(mu.R))- mu.R)# poisson()$dev.resids - scale and weights = 1
```

```
# Trace Term (penalty for model complexity)
AIinverse<- ginv(model.indep$geese$vbeta.naiv)# Omega-hat(I) via Moore-Penrose
```

```
generalized inverse of a matrix in MASS package
# Alt: AIinverse <- solve(model.indep$geese$vbeta.naiv) # solve via identity
Vr<- model.R$geese$vbeta
trace.R <- sum(diag(AIinverse%*%Vr))
px <- length(mu.R)# number non-redunant columns in design matrix
# QIC
QIC <-(-2)*quasi.R +2*trace.R
QICu<-(-2)*quasi.R +2*px # Approximation assuming model structured correctly
output <- c(QIC,QICu, quasi.R, trace.R, px)
names(output)<- c('QIC','QICu','Quasi Lik','Trace','px')
output}
```