Category Archives: GEE

GEE QIC update

Here is improved code for calculating QIC from geeglm in geepack in R (original post). Let me know how it works. I haven’t tested it much, but is seems that QIC may select overparameterized models. In the code below, I had to replace <- with = because wordpress didn’t except <- within code or pre tags. It should still work just fine.

Here is a quick example of how to run this function. First, highlight and run the code below in R. That will save the function in your workspace. Then run your gee model using geeglm in geepack (package available from CRAN). Next, run QIC(your_gee_model) and you get the QIC. You can then repeat this with alternative a priori models. Below the function is an example using data available as part of geepack.

[UPDATE: IMPROVED CODE AND EXTENSIONS ARE NOW AVAILABLE ON https://github.com/djhocking/qicpack INCLUDING AS AN R PACKAGE]

############################################################
# QIC for GEE models
# Daniel J. Hocking
############################################################
QIC = function(model.R) {
      library(MASS)
      model.indep = update(model.R, corstr = "independence")
      # Quasilikelihood
      mu.R = model.R$fitted.values
      y = model.R$y
      type = family(model.R)$family
      quasi.R = switch(type,
                    poisson = sum((y*log(mu.R)) - mu.R),
                    gaussian = sum(((y - mu.R)^2)/-2),
                    binomial = sum(y*log(mu.R/(1 - mu.R)) + log(1 - mu.R)),
                    Gamma = sum(-y/mu.R - log(mu.R)), # original updated:sum(-y/(mu.R - log(mu.R))),
stop("Error: distribution not recognized")) # Trace Term (penalty for model complexity) omegaI = ginv(model.indep$geese$vbeta.naiv) # Omega-hat(I) via Moore-Penrose generalized inverse of a matrix in MASS package #AIinverse = solve(model.indep$geese$vbeta.naiv) # solve via indenity Vr = model.R$geese$vbeta trace.R = sum(diag(omegaI %*% Vr)) px = length(mu.R) # number non-redunant columns in design matrix # QIC QIC = 2*(trace.R - quasi.R) [EDIT: original post was missing '*'] #QICu = (-2)*quasi.R + 2*px # Approximation assuming model structured correctly output = c(QIC, quasi.R, trace.R, px) names(output) = c('QIC', 'Quasi Lik', 'Trace', 'px') return(output) }

Here’s an example you can run in R.

library(geepack)

data(dietox)
dietox$Cu = as.factor(dietox$Cu)
mf = formula(Weight ~ Cu * (Time + I(Time^2) + I(Time^3)))
gee1 = geeglm(mf, data = dietox, id = Pig, family = poisson, corstr = "ar1")
gee1
summary(gee1)

mf2 = formula(Weight ~ Cu * Time + I(Time^2) + I(Time^3))
gee2 = geeglm(mf2, data = dietox, id = Pig, family = poisson, corstr = "ar1")
summary(gee2)
anova(gee2)
anova(gee1, gee2)

mf3 = formula(Weight ~ Cu + Time + I(Time^2))
gee3 = geeglm(mf3, data = dietox, id = Pig, family = poisson, corstr = "ar1")
gee3.I = update(gee3, corstr = "independence")
gee3.Ex = update(gee3, corstr = "exchangeable")

sapply(list(gee1, gee2, gee3, gee3.I, gee3.Ex), QIC)

In the output of this model it suggests that model gee1 is the best model. I have some concerns that QIC will almost inevitably choose the most complex model. More testing with simulated data will be necessary.

               [,1]      [,2]      [,3]      [,4]      [,5]
QIC       -333199.7 -333188.0 -333187.5 -333181.8 -333153.6
Quasi Lik  166623.0  166622.7  166620.4  166622.2  166615.4
Trace          23.2      28.7      26.6      31.3      38.6
px            861.0     861.0     861.0     861.0     861.0

You will get warnings when running this model because it uses a Poisson distribution for continuous data. I will work on finding a better example in the future before I make this available as an R package.

Conference Presentations

I recently gave a talk at the Ecological Society of America (ESA) annual meeting in Portland, OR and a poster presentation at the World Congress of Herpetology meeting in Vancouver, BC, Canada. Both presentations were comparing generalized linear mixed models (GLMM) and generalized estimating equations (GEE) for analyzing repeated count data. I advocate for using GEE over the more common GLMM to analyze longitudinal count (or binomial) data when the specific subjects (sites as random effects) are not of special interest. The overall confidence intervals are much smaller in the GEE models and the coefficient estimates are averaged over all subjects (sites). This means the interpretation of coefficients is the log change in Y for each 1 unit change in X on average (averaged across subjects). Below you can see my two presentations for more details.

WCH2012-poster-Hocking

ESA–ppt-Presentation-2012-Hocking

Plotting 95% Confidence Bands in R

I am comparing estimates from subject-specific GLMMs and population-average GEE models as part of a publication I am working on. As part of this, I want to visualize predictions of each type of model including 95% confidence bands.

First I had to make a new data set for prediction. I could have compared fitted values with confidence intervals but I am specifically interested in comparing predictions for particular variables while holding others constant. For example, soil temperature is especially important for salamanders, so I am interested in the predicted effects of soil temperature from the different models. I used the expand.grid and model.matrix functions in R to generate a new data set where soil temperature varied from 0 to 30 C. The other variables were held constant at their mean levels during the study. Because of the nature of the contrast argument in the model.matrix function, I had to include more than one level of the factor “season”. I then removed all season except spring. In effect I am asking, what is the effect of soil temperature on salamander activity during the spring when the other conditions are constant (e.g. windspeed = 1.0 m/s, rain in past 24 hours =  This code is based on code from Ben Bolker via http://glmm.wikidot.com

# Compare Effects of SoilT with 95% CIs
formula(glmm1)
newdat.soil <- expand.grid(
SoilT = seq(0, 30, 1),
RainAmt24 = mean(RainAmt24),
RH = mean(RH),
windspeed = mean(windspeed),
season = c("spring", "summer", "fall"),
droughtdays = mean(droughtdays),
count = 0
)
newdat.soil$SoilT2 <- newdat.soil$SoilT^2

# Spring
newdat.soil.spring <- newdat.soil[newdat.soil$season == 'spring', ]

mm = model.matrix(terms(glmm1), newdat.soil)

Next I calculated the 95% confidence intervals for both the GLMM and GEE models. For the GLMM the plo and phi are the low and high confidence intervals for the fixed effects assuming zero effect of the random sites. tlo and thi account for the uncertainty in the random effects.

newdat.soil$count = mm %*% fixef(glmm1)
pvar1 <- diag(mm %*% tcrossprod(vcov(glmm1),mm))
tvar1 <- pvar1+VarCorr(glmm1)$plot[1]
newdat.soil <- data.frame(
newdat.soil
, plo = newdat.soil$count-2*sqrt(pvar1)
, phi = newdat.soil$count+2*sqrt(pvar1)
, tlo = newdat.soil$count-2*sqrt(tvar1)
, thi = newdat.soil$count+2*sqrt(tvar1)
)
mm.geeEX = model.matrix(terms(geeEX), newdat.soil)
newdat.soil$count.gee = mm.geeEX %*% coef(geeEX)
tvar1.gee <- diag(mm.geeEX %*% tcrossprod(geeEX$geese$vbeta, mm.geeEX))
newdat.soil <- data.frame(
newdat.soil
, tlo.gee = newdat.soil$count-2*sqrt(tvar1.gee)
, thi.gee = newdat.soil$count+2*sqrt(tvar1.gee)
)

The standard error of the fixed effects are larger in the GEE model than in the GLMM, but when the variation associated with the random effects are accounted for, the uncertainty (95% CI) around the estimates is greater in the GLMM. This is especially evident when the estimated values are large since the random effects are exponential on the original scale. This can be seen in the below plots

Although this plot does the job, it isn’t an efficient use of space, nor is it easy to compare exactly where the different lines fall. It would be nice to plot everything on one set of axes. The only trouble is that all the lines could be difficult to see just using solid and dashed/dotted lines. To help with this, I combine the plots but added color and shading using the polygon function. The code and plot are below

plot(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count.gee),
xlab = "Soil temperature (C)",
ylab = 'Predicted salamander observations',
type = 'l',
ylim = c(0, 25))
polygon(c(newdat.soil.spring$SoilT
, rev(newdat.soil.spring$SoilT))
, c(exp(newdat.soil.spring$thi.gee)
, rev(exp(newdat.soil.spring$tlo.gee)))
, col = 'grey'
, border = NA)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$thi.gee),
type = 'l',
lty = 2)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$tlo.gee),
type = 'l',
lty = 2)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count.gee),
type = 'l',
lty = 1,
col = 2)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count),
col = 1)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$thi),
type = 'l',
lty = 2)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$tlo),
type = 'l',
lty = 2)

GLMM vs GEE plot with 95% confidence intervals

Now you can directly compare the results of the GLMM and GEE models. The predicted values (population-averaged) for the GEE is represented by the red line, while the average (random effects = 0, just fixed effects) from the GLMM are represented by the solid black line. The dashed lines represent the 95% confidence intervals for the GLMM and the shaded area is the 95% confidence envelope for the GEE model. As you can see, the GEE has much higher confidence in it’s prediction of soil temperature effects on salamander surface activity than the GLMM model. This would not be apparent without visualizing the predictions with confidence intervals because the standard errors of the fixed effects were lower in the GLMM than in the GEE. This is because the SEs in the GEE include the site-level (random effect) variation while the GLMM SEs of the covariates do not include this variation and are interpreted as the effect of a change of 1 X on Y at a given site.

Follow

Get every new post delivered to your Inbox.

Join 54 other followers