GEE QIC update


Here is improved code for calculating QIC from geeglm in geepack in R (original post). Let me know how it works. I haven’t tested it much, but is seems that QIC may select overparameterized models. In the code below, I had to replace <- with = because wordpress didn’t except <- within code or pre tags. It should still work just fine.

Here is a quick example of how to run this function. First, highlight and run the code below in R. That will save the function in your workspace. Then run your gee model using geeglm in geepack (package available from CRAN). Next, run QIC(your_gee_model) and you get the QIC. You can then repeat this with alternative a priori models. Below the function is an example using data available as part of geepack.

[UPDATE: IMPROVED CODE AND EXTENSIONS ARE NOW AVAILABLE ON https://github.com/djhocking/qicpack INCLUDING AS AN R PACKAGE]

############################################################
# QIC for GEE models
# Daniel J. Hocking
############################################################
QIC = function(model.R) {
      library(MASS)
      model.indep = update(model.R, corstr = "independence")
      # Quasilikelihood
      mu.R = model.R$fitted.values
      y = model.R$y
      type = family(model.R)$family
      quasi.R = switch(type,
                    poisson = sum((y*log(mu.R)) - mu.R),
                    gaussian = sum(((y - mu.R)^2)/-2),
                    binomial = sum(y*log(mu.R/(1 - mu.R)) + log(1 - mu.R)),
                    Gamma = sum(-y/mu.R - log(mu.R)), # original updated:sum(-y/(mu.R - log(mu.R))),
stop("Error: distribution not recognized")) # Trace Term (penalty for model complexity) omegaI = ginv(model.indep$geese$vbeta.naiv) # Omega-hat(I) via Moore-Penrose generalized inverse of a matrix in MASS package #AIinverse = solve(model.indep$geese$vbeta.naiv) # solve via indenity Vr = model.R$geese$vbeta trace.R = sum(diag(omegaI %*% Vr)) px = length(mu.R) # number non-redunant columns in design matrix # QIC QIC = 2*(trace.R - quasi.R) [EDIT: original post was missing '*'] #QICu = (-2)*quasi.R + 2*px # Approximation assuming model structured correctly output = c(QIC, quasi.R, trace.R, px) names(output) = c('QIC', 'Quasi Lik', 'Trace', 'px') return(output) }

Here’s an example you can run in R.

library(geepack)

data(dietox)
dietox$Cu = as.factor(dietox$Cu)
mf = formula(Weight ~ Cu * (Time + I(Time^2) + I(Time^3)))
gee1 = geeglm(mf, data = dietox, id = Pig, family = poisson, corstr = "ar1")
gee1
summary(gee1)

mf2 = formula(Weight ~ Cu * Time + I(Time^2) + I(Time^3))
gee2 = geeglm(mf2, data = dietox, id = Pig, family = poisson, corstr = "ar1")
summary(gee2)
anova(gee2)
anova(gee1, gee2)

mf3 = formula(Weight ~ Cu + Time + I(Time^2))
gee3 = geeglm(mf3, data = dietox, id = Pig, family = poisson, corstr = "ar1")
gee3.I = update(gee3, corstr = "independence")
gee3.Ex = update(gee3, corstr = "exchangeable")

sapply(list(gee1, gee2, gee3, gee3.I, gee3.Ex), QIC)

In the output of this model it suggests that model gee1 is the best model. I have some concerns that QIC will almost inevitably choose the most complex model. More testing with simulated data will be necessary.

               [,1]      [,2]      [,3]      [,4]      [,5]
QIC       -333199.7 -333188.0 -333187.5 -333181.8 -333153.6
Quasi Lik  166623.0  166622.7  166620.4  166622.2  166615.4
Trace          23.2      28.7      26.6      31.3      38.6
px            861.0     861.0     861.0     861.0     861.0

You will get warnings when running this model because it uses a Poisson distribution for continuous data. I will work on finding a better example in the future before I make this available as an R package.

21 thoughts on “GEE QIC update

  1. Pingback: R script to calculate QIC for Generalized Estimating Equation (GEE) Model Selection | Daniel Hocking, Ecologist

  2. This is great stuff and would be useful. May I ask however, if there is an error on line 26? Maybe a special character was filtered out when you posted the code?

    QIC = 2(trace.R – quasi.R)

    • Thanks for catching that Rob. It should be 2 times the difference:

      QIC = 2*(trace.R – quasi.R)

      I should really upload this to GitHub to give people easier access and avoid problems with the code and pre tags within WordPress. Hope this helps. Let me know if you get to play around with it.

      • Daniel, I expected that was the issue, but hadn’t had time to research the equation thoroughly. Thank you for the quick confirmation! It does work nicely.

        Are you planning to expand this routine? If so, here are some processing steps that I have used after calculating QIC.
        1. Call QIC with a list of models
        2. transform the matrix and convert to a data.frame (my preference)
        3. set the row.names to the model names
        4. sort the rows.

        These steps make the output similar in function to aictab in package AICcmodavg.

      • Rob – I’ve actually now made improvements that address all of your suggestions. I appreciate the feedback. I borrowed heavily from the AICcmodavg package. The R code is available now on github if you’re interested: https://github.com/djhocking/QIC.git

      • Rob,

        Thanks again for serving as an alpha tester for this code/package. I have now created an initial rough R package for the code. You can use the citation below and you can also get the R package tarball with the link below. I really haven’t put much info on the help pages but that will be something I work on in the future.

        Hocking, D. J. (2012). QICpack: Model selection for generalized estimating equations using QIC. R package version 0.9. https://github.com/djhocking/qicpack.git

  3. Daniel, thanks very much for this. Especially enjoy that the trace is printed, as that has been shown to be more useful than QIC in choosing among correlation structures. Can’t remember where I saw it on your blog (or maybe in one of your presentations?), but can you provide a reference for the negative binomial not being relevant in GEE? NB GEEs are not possible at this point in R, but I’ve compared Poisson and NB GEEs in SAS and the resulting estimates do vary and (based on QIC or the trace) seem to improve the fit… Apparently the common dispersion term in the calculation of the GEE doesn’t capture all of the extra variationcompared to the NB variance… At the least, it captures it differently in the the Poisson vs. the NB GEE. Again, thanks…

    • Adam – thanks for the comment. I can’t remember any specific reference for the lack of need for the negative binomial in GEE models. I think the general thought is that the dispersion term can take up any extra Poisson variation. I’m not surprised that a NB GEE would improve the fit if there was a lot of unaccounted for variation. As noted, the geepack package in R doesn’t have a NB option that I’m aware of. It certainly could be added. Maybe if I get the time, I’ll contact the package author and see if he would be interested in me adding the NB option and the QIC wrapper.

      I think in deciding whether to use a NB or Poisson GEE, one would want to think about what is being modeled and the ease of interpretation. Sometimes people get over concerned with model fit. It’s nice to have great fitting models but I think a useful model is more important than a fitting model (yes, having both is best). With respect to the saying, “All models are wrong, but some are useful,” I think careful consideration of model construction is important and the fit must be balanced with the explanation.

      Anyway, thanks for bringing that to my attention. I’ve never actually seen a NB GEE used, but it’s good to know that it can further improve the fit (and be implemented in SAS).

  4. Daniel, The library works great. Thanks. Thanks for the citation as well. In your future edits you might want to make sure “citation(QICpack)” provides the full reference. It definitely helps people like me who try to provide full references for all of the packages we use.

    • Thanks Rob. I have a lot of work to do on the documentation. I’ve never used github or created R packages or used LaTeX before, so I’m working out all the bugs. Hopefully, in the future I can get this all straightened out in addition to adding further functionality to the package (model averaging, table options, etc.). I will also have to spend some time testing in and hopefully add a function for QIC adjusted for small samples sizes (QICc). Best of luck and thanks for the help.

  5. Hi Daniel. Thanks for the QIC script for R. GEE seems to be the way to go for part of my analysis on Australian fur seal haulout behaviour (binomial response, “home” or “away”, with repeated measures, multiple foraging trips, per individual). I just wanted to let you know I have run your script and it seems to work fine for my data. I have run a gee, using geeglm from geepack, with an exchangeable correlation structure. One question, in your script it seems like you define the correlation structure as independent Should I be changing this to “exchangeable”? Thanks, Marcus

    • Hi Marcus – Thanks for your interest in my QIC code. I’m glad it seems to be working. I’ve only tried it will a limited number of examples. In the code, the correlation structure should be left as “independent” because QIC works by comparing the correlation structure of the model used to the same model with an independent correlation structure. Also, I have updated code (with some error checking) and an additional function that makes a table comparing QIC from multiple models. It’s based on the aiccmodavg package. The raw code is available along with an R package of both functions in my GitHub repository: https://github.com/djhocking/qicpack

  6. Hi Daniel,

    I have a GEE question which I would welcome any feedback you had on if you have time. We have been working with panel data from an experimental economic and have been advised to use a GEE approach due to auto-correlation between periods and players. More details on the experimental design can be found here:

    http://stats.stackexchange.com/questions/50374/how-to-deal-with-two-time-variables-in-panel-data-modelling-using-plm-package-in

    The GEE model we developed is…
    geeglm(Nash_decision~fishery+treatment+state_of_resource+period+Experience+Cumulative, family=binomial(link=”logit”), data=data1, id=playerno., corstr = “ar1″)

    We recoded player and session (playerno.) to indicate a different player playing in each session (i.e. 1a – 8a for players 1-8 in session 1 and then 1b-8b for players 1-8 in session 2) and included period (i.e. round) in the model to cater for the time effect. We assume that playerno. is the individual sample units in our dataset.

    We are particularly interested in whether fishery and treatment have any effect on players making Nash decisions. To cater for learning and income differences we also included a factor ‘experience” which catered for how many rounds a player played and “cumulative” income which was how much money they had earned.

    What we have found through testing however is that the QIC score (using your code) is the same no matter what correlation matrix we specify (ar1 or exchangeable or independence).

    geeglm(Nash_decision~fishery+treat+period+state_of_resource+Experience+Cumulative, family=binomial(link=”logit”),data=data1, id=playerno, corstr = “ar1”)
    gee2 = update(gee1, corstr = “independence”)
    gee3 = update(gee1, corstr = “exchangeable”)
    sapply(list(gee1, gee2, gee3), QIC)
    [,1] [,2] [,3]
    QIC 871.02 871.02 871.02
    Quasi Lik -434.02 -434.02 -434.02
    Trace 1.49 1.49 1.49
    px 1296.00 1296.00 1296.00

    Also there is no Auto-correlation (checked with ACF) in each case, but if we remove period auto-correlation arises in some lags. Is this suggesting that the correlation between rounds is not so strong, so including period as factor is enough to deal with that? Is this an issue that the QIC is the same across the same models with different correlation structures?

    Also do we need to consider the nesting factors (i.e. rounds nested within session)? We have read that you can ignore nested design through specifying the correlation structure correctly in GEE? Is this correct?

    Thanks!

    • Rafael – Sorry I missed your comments so this is a long delayed response. I also don’t have a good answer for you. This is very curious behavior with having the exact same QIC and QLik values for all models. Do the outputs of summary(gee1), summary(gee2), and summary(gee2) differ? Have you tried the updated code on my GitHub page?

      I think the proper correlation structure can help with the nested design but it might be worth going to a more flexible generalized linear mixed model (GLMM) analytical framework since the experimental design can be more explicitly parametrized.

  7. Pingback: 2013 in review | Daniel J. Hocking

  8. Hi Daniel, thanks very much for the code. I used it with gamma distribution and QIC behaved quite oddly, so I looked in McCullagh & Nelder (1989, p. 326) and there appears to be a parantheses problem in your function. The quasi-likelihood for gamma distribution should be computed as:

    Gamma = sum(-y/mu.R – log(mu.R))

    • @Pavel – Thanks for catching that. I added the gamma distribution but never actually tested it (oops). I’ll change it on the post and in the more up-to-date GitHub repository.

  9. Pingback: Dealing with ugly data: Generalized Estimating Equations (GEE) | WildlifeSNPits

  10. Pingback: Year in Review: 2014 | Daniel J. Hocking

Leave a comment