Blog Archives

dplyr is great…but

I have been loving Hadley Wickham’s new dplyr package for R. It creates a relatively small number of verbs to quickly and easily manipulate data. It is generally as fast as data.table but unless you’re already very familiar with data.table the syntax is much easier. There are a number of great introductions and tutorials, which I won’t attempt to recreate but here are some links:

The only challenge with going to dplyr has been dealing with the non-standard evaluation (see vignette(“nse”) and the new tbl_df class that gets created. In theory the resulting dataframes created from dplyr should act both as a data.frame object and as a new tbl_df object. There are some nice new features associated with tbl_df objects but also some idiosyncrasies that can break existing code.

For example, I had a function:

stdCovs <- function (x, y, var.names) {
    for (i in 1:length(var.names)) {
        x[, var.names[i]] <- (x[, var.names[i]] - mean(y[, var.names[i]], na.rm = T)) / 
            sd(y[, var.names[i]], na.rm = T)

that took the standardized covariates from one dataframe and applied them to standardize another dataframe for a particular set of covariates. This was particularly useful for doing predictions for new data using coefficients from a different dataset that had been standardized (y.standard). The code above worked when I was using data.frame objects. However, I have now started using dplyr for a variety of reasons but the code no longer works. I believe this is due to how tbl_df objects handle bracketing (e.g. y[ , var]). There is an issue on GitHub related to this.

In the end it turned out that double bracketing to extract the column works: Rather than the y[ , var] formulation it should be y[[var]]. Not too hard but the type of thing that wastes an hour when going to a new package. Despite this challenge and similar things, I highly recommend dplyr. Now hopefully the simplicity of the verbs and speed of computation can make up for the few hours spent learning and adjusting code.

Here’s my final code:

stdCovs <- function(x, y, var.names){
  for(i in 1:length(var.names)){
    x[ , var.names[i]] <- (x[ , var.names[i]] - mean(y[[var.names[i]]], na.rm = T)) / 
        sd(y[[var.names[i]]], na.rm = T)

One of the best features of dplyr is it’s use of magrittr for piping functions together with the %>% symbol. It’s not shown above but it’s worth learning. Previously, you would either have a bunch of useless intermediate objects created or a long string of nested functions that were hard (impossible at times) to decipher. The “then” command (%>%) does away with this as shown here by Revolution Analytics:

hourly_delay <- filter( 
      date, hour
    delay = mean(dep_delay), 
    n = n()
  n > 10 

Here’s the same code, but rather than nesting one function call inside the next, data is passed from one function to the next using the %>% operator:

hourly_delay <- flights %>% 
 filter(! %>% 
 group_by(date, hour) %>% 
   delay = mean(dep_delay), 
   n = n() ) %>% 
 filter(n > 10)

Beautiful, right. Pipes are fantastic and combined with dplyr’s simple verbs make coding a lot easier and more readable once you get the hang of it.

Paper Summary: Natural Disturbance and Logging Effects on Salamanders

Paper Summary:

Hocking, D.J., K.J. Babbitt, and M. Yamasaki. 2013. Comparison of Silvicultural and Natural Disturbance Effects on Terrestrial Salamanders in Northern Hardwood Forests. Biological Conservation 167:194-202. doi:

Unfortunately, this paper is behind a paywall. Please email me if you would like a copy for educational purposes.

We were interested in how red-backed salamanders respond to various logging practices compared with natural disturbance. Specifically, we compared abundance of salamanders in the two years following a major ice-storm with clearcuts, patch cuts, group cuts, single-tree selection harvests, and undisturbed forest patches in the White Mountains of New Hampshire (Northern Appalachian Mountains). The 100-year ice storm caused ~65% percent canopy loss in the effected areas. We know that clearcutting has detrimental effects on populations of woodland salamanders but the impacts of less intense harvesting and natural disturbances is less well understood.

We used transects of coverboards from 80m inside each forest patch extending to 80m outside each patch into the surround, undisturbed forest. Repeated counts of salamanders under these coverboards allowed us to employ a Dail-Madsen open population model to estimate abundance in each treatment, while accounting for imperfect detection. The results were quite clear as demonstrated in this figure:

Abundance Plot by Treatment

There were slightly fewer salamanders in the ice-storm damaged sites compared with undisturbed reference sites. The single-tree selection sites were most similar to the ice-storm damage sites. The group cut, patch cut, and clearcut didn’t differ from each other and all had ~88% fewer salamanders compared with reference sites.

In addition to comparing natural and anthropogenic disturbances, we were interested in examining how salamanders respond along the edge of even-aged harvests. Wind, solar exposure, and similar factors are altered in the forest edge adjacent to harvested areas. This can result in salamander abundance being reduced in forest edges around clearcuts. Previous researchers have used nonparametric estimates of edge effects. A limitation of this methods is that effects cannot be projected well across the landscape. These methods are also unable to account for imperfect detection. We developed a method to model edge effects as a logistic function while accounting for imperfect detection. As with the treatment effects, the results are quite clear with very few salamanders in the center of the even-aged harvests, a gradual increase in abundance near the forest edge, increasingly more salamanders in the forest moving away from the edge, and eventually leveling off at carrying capacity. In this case, red-backed salamander abundance reached 95% of carrying capacity 34 m into the surrounding forest. As the model is parametric, predictions can be projected across landscapes. The equation can be used in GIS and land managers can predict the total number of salamanders that would be lost from a landscape given a variety of alternative timber harvest plans.

Hopefully other researchers find this method useful and apply it for a variety of taxa. It could also be incorporated into ArcGIS or QGIS toolboxes/plugins as a tool for land managers. You can read our paper for more details if you’re interested. In addition to methodological details there is more information on environmental factors that affect detection and abundance of salamanders in this landscape.

Edge Effects

In Praise of Exploratory Statistics

If you haven’t seen it, you should definitely check out the latest post by Brian McGill at Dynamic Ecology. It’s a great post on the use of exporatory statistics. While it may be impossible to get an NSF grant with an analysis framed in terms of exploratory statistics, reviewers should definitely be more open to their appropriate use. Is is wrong to force authors to frame exploratory analyses in a hypothetico-deductive framework after the fact. Brian suggests that there are three forms of scientific analysis:

  1. Hypothesis testing
  2. Prediction
  3. Exploratory

and only one of these can be used for a single data set. My favorite quote is,

“If exploratory statistics weren’t treated like the crazy uncle nobody wants to talk about and everybody is embarrassed to admit being related to, science would be much better off.”

There’s lots of good stuff in his post, so I recommend spending 10 minutes of your day to read through it.

Conference Presentations

I recently gave a talk at the Ecological Society of America (ESA) annual meeting in Portland, OR and a poster presentation at the World Congress of Herpetology meeting in Vancouver, BC, Canada. Both presentations were comparing generalized linear mixed models (GLMM) and generalized estimating equations (GEE) for analyzing repeated count data. I advocate for using GEE over the more common GLMM to analyze longitudinal count (or binomial) data when the specific subjects (sites as random effects) are not of special interest. The overall confidence intervals are much smaller in the GEE models and the coefficient estimates are averaged over all subjects (sites). This means the interpretation of coefficients is the log change in Y for each 1 unit change in X on average (averaged across subjects). Below you can see my two presentations for more details.



R script to calculate QIC for Generalized Estimating Equation (GEE) Model Selection


Generalized Estimating Equations (GEE) can be used to analyze longitudinal count data; that is, repeated counts taken from the same subject or site. This is often referred to as repeated measures data, but longitudinal data often has more repeated observations. Longitudinal data arises from studies in virtually all branches of science. In psychology or medicine, repeated measurements are taken on the same patients over time. In sociology, schools or other social distinct groups are observed over time. In my field, ecology, we frequently record data from the same plants or animals repeated over time. Furthermore, the repeated measures don’t have to be separated in time. A researcher could take multiple tissue samples from the same subject at a given time. I often repeatedly visit the same field sites (e.g. same patch of forest) over time. If the data are discrete counts of things (e.g. number of red blood cells, number of acorns, number of frogs), the data will generally follow a Poisson distribution.

Longitudinal count data, following a Poisson distribution, can be analyzed with Generalized Linear Mixed Models (GLMM) or with GEE. I won’t get into the computational or philosophical differences between conditional, subject-specific estimates associated with GLMM and marginal, population-level estimates obtained by GEE in this post. However, if you decide that GEE is right for you (I have a paper in preparation comparing GLMM and GEE), you may also want to compare multiple GEE models. Unlike GLMM, GEE does not use full likelihood estimates, but rather, relies on a quasi-likelihood function. Therefore, the popular AIC approach to model selection don’t apply to GEE models. Luckily, Pan (2001) developed an equivalent QIC for model comparison. Like AIC, it balances the model fit with model complexity to pick the most parsimonious model.

Unfortunately, there is currently no QIC package in R for GEE models. geepack is a popular R package for GEE analysis. So, I wrote the short R script below to calculate Pan’s QIC statistic from the output of a GEE model run in geepack using the geese function. It currently employs the Moore-Penrose Generalized Matrix Inverse through the MASS package. I left in my original code using the identity matrix but it is preceded by a pound sign so it doesn’t run. [edition: April 10, 2012] The input for the QIC function needs to come from the geeglm function (as opposed to “geese”) within geepack.

I hope you find it useful. I’m still fairly new to R and this is one of my first custom functions, so let me know if you have problems using it or if there are places it can be improved. If you decide to use this for analysis in a publication, please let me know just for my own curiosity (and ego boost!).

###################################################################################### QIC for GEE models# Daniel J. Hocking# 07 February 2012# Refs:
  # Pan (2001)
  # Liang and Zeger (1986)
  # Zeger and Liang (1986)
  # Hardin and Hilbe (2003)
  # Dornmann et al 2007
  # # Poisson QIC for geese{geepack} output# Ref: Pan (2001)
QIC.pois.geeglm <-function(model.R, model.indep){
  # Fitted and observed values for quasi likelihood
  mu.R <- model.R$fitted.values
  # alt: X <- model.matrix(model.R)
      #  names(model.R$coefficients) <- NULL
      #  beta.R <- model.R$coefficients
      #  mu.R <- exp(X %*% beta.R)
  y <- model.R$y

  # Quasi Likelihood for Poisson
  quasi.R <- sum((y*log(mu.R))- mu.R)# poisson()$dev.resids - scale and weights = 1
  # Trace Term (penalty for model complexity)
  AIinverse<- ginv(model.indep$geese$vbeta.naiv)# Omega-hat(I) via Moore-Penrose 
generalized inverse of a matrix in MASS package
  # Alt: AIinverse <- solve(model.indep$geese$vbeta.naiv) # solve via identity
  Vr<- model.R$geese$vbeta
  trace.R <- sum(diag(AIinverse%*%Vr))
  px <- length(mu.R)# number non-redunant columns in design matrix

  # QIC
  QIC <-(-2)*quasi.R +2*trace.R
  QICu<-(-2)*quasi.R +2*px    # Approximation assuming model structured correctly 
  output <- c(QIC,QICu, quasi.R, trace.R, px)
  names(output)<- c('QIC','QICu','Quasi Lik','Trace','px')

Get every new post delivered to your Inbox.

Join 75 other followers