Blog Archives

Plotting grouped data vs time with error bars in R

This is my first blog since joining R-bloggers. I’m quite excited to be part of this group and apologize if I bore any experienced R users with my basic blogs for learning R or offend programmers with my inefficient, sloppy coding. Hopefully writing for this most excellent community will help improve my R skills while helping other novice R users.

I have a dataset from my dissertation research were I repeatedly counted salamanders from some plots and removed salamanders from other plots. I was interested in plotting the captures over time (sensu Hairston 1987). As with all statistics and graphs in R, there are a variety of ways to create the same or similar output. One challenge I always have is dealing with dates. Another challenge is plotting error bars on plots. Now I had the challenge of plotting the average captures per night or month ± SE for two groups (reference and depletion plots – 5 of each on each night) vs. time. This is the type of thing that could be done in 5 minutes on graph paper by hand but took me a while in R and I’m still tweaking various plots. Below I explore a variety of ways to handle and plot this data. I hope it’s helpful for others. Chime in with comments if you have any suggestions or similar experiences.
> ####################################################################
> # Summary of Count and Removal Data
> # Dissertation Project 2011
> # October 25, 2011
> # Daniel J. Hocking
> ####################################################################
>
> Data <- read.table(‘/Users/Dan/…/AllCountsR.txt’, header = TRUE, na.strings = “NA”)
>
> str(Data)
‘data.frame':            910 obs. of  32 variables:
 $ dates            : Factor w/ 91 levels “04/06/09″,”04/13/11″,..: 12 14 15 16 17 18 19 21 22 23 …
You’ll notice that when importing a text file created in excel with the default date format, R treats the date variable as a Factor within the data frame. We need to convert it to a date form that R can recognize. Two such built-in functions are as.Date and as.POSIXct. The latter is a more common format and the one I choose to use (both are very similar but not fully interchangeable). To get the data in the POSIXct format in this case I use the strptime function as seen below. I also create a couple new columns of day, month, and year in case they become useful for aggregating or summarizing the data later.
>
> dates <-strptime(as.character(Data$dates), “%m/%d/%y”)  # Change date from excel storage to internal R format
>
> dim(Data)
[1] 910  32
> Data = Data[,2:32]               # remove      
> Data = data.frame(date = dates, Data)      #this is now the date in useful fashion
>
> Data$mo <- strftime(Data$date, “%m”)
> Data$mon <- strftime(Data$date, “%b”)
> Data$yr <- strftime(Data$date, “%Y”)
>
> monyr <- function(x)
+ {
+     x <- as.POSIXlt(x)
+     x$mday <- 1
+     as.POSIXct(x)
+ }
>
> Data$mo.yr <- monyr(Data$date)
>
> str(Data)
‘data.frame':            910 obs. of  36 variables:
 $ date             : POSIXct, format: “2008-05-17″ “2008-05-18″ …
$ mo               : chr  “05” “05” “05” “05” …
 $ mon              : chr  “May” “May” “May” “May” …
 $ yr               : chr  “2008” “2008” “2008” “2008” …
 $ mo.yr            : POSIXct, format: “2008-05-01 00:00:00″ “2008-05-01 00:00:00″ …
>
As you can see the date is now in the internal R date form of POSIXct (YYYY-MM-DD).
Now I use a custom function to summarize each night of counts and removals. I forgot offhand how to call to a custom function stored elsewhere to I lazily pasted it in my script. I found this nice little function online but I apologize to the author because I don’t remember were I found it.
> library(ggplot2)
> library(doBy)
>
> ## Summarizes data.
> ## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
> ## If there are within-subject variables, calculate adjusted values using method from Morey (2008).
> ##   measurevar: the name of a column that contains the variable to be summariezed
> ##   groupvars: a vector containing names of columns that contain grouping variables
> ##   na.rm: a boolean that indicates whether to ignore NA’s
> ##   conf.interval: the percent range of the confidence interval (default is 95%)
> summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95) {
+     require(doBy)
+
+     # New version of length which can handle NA’s: if na.rm==T, don’t count them
+     length2 <- function (x, na.rm=FALSE) {
+         if (na.rm) sum(!is.na(x))
+         else       length(x)
+     }
+
+     # Collapse the data
+     formula <- as.formula(paste(measurevar, paste(groupvars, collapse=” + “), sep=” ~ “))
+     datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd), na.rm=na.rm)
+
+     # Rename columns
+     names(datac)[ names(datac) == paste(measurevar, ".mean",    sep="") ] <- measurevar
+     names(datac)[ names(datac) == paste(measurevar, ".sd",      sep="") ] <- “sd”
+     names(datac)[ names(datac) == paste(measurevar, ".length2", sep="") ] <- “N”
+    
+     datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean
+    
+     # Confidence interval multiplier for standard error
+     # Calculate t-statistic for confidence interval:
+     # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
+     ciMult <- qt(conf.interval/2 + .5, datac$N-1)
+     datac$ci <- datac$se * ciMult
+    
+     return(datac)
+ }
>
> # summarySE provides the standard deviation, standard error of the mean, and a (default 95%) confidence interval
> gCount <- summarySE(Count, measurevar=”count”, groupvars=c(“date”,”trt”))
Now I’m ready to plot. I’ll start with a line graph of the two treatment plot types.
> ### Line Plot ###
> # Ref: Quick-R
>
> # convert factor to numeric for convenience
> gCount$trtcode <- as.numeric(gCount$trt)
> ntrt <- max(gCount$trtcode)
>
> # ranges for x and y axes
> xrange <- range(gCount$date)
> yrange <- range(gCount$count)
>
> # Set up blank plot
> plot(gCount$date, gCount$count, type = “n”,
+      xlab = “Date”,
+      ylab = “Mean number of salamanders per night”)
>
> # add lines
> color <- seq(1:ntrt)
> line <- seq(1:ntrt)
> for (i in 1:ntrt){
+   Treatment <- subset(gCount, gCount$trtcode==i)
+   lines(Treatment$date, Treatment$count, col = color[i])#, lty = line[i])
+ }

As you can see this is not attractive but it does show that I generally caught more salamander in the reference (red line) plots. This makes it apparent that a line graph is probably a bit messy for this data and might even be a bit misleading because data was not collected continuously or at even intervals (weather and season dependent collection). So, I tried a plot of just the points using the scatterplot function from the “car” package.
> ### package car scatterplot by groups ###
> library(car)
>
> # Plot
> scatterplot(count ~ date + trt, data = gCount,
+             smooth = FALSE, grid = FALSE, reg.line = FALSE,
+    xlab=”Date”, ylab=”Mean number of salamanders per night”)

This was nice because it was very simple to code. It includes points from every night but I would still like to summarize it more. Before I get to that, I would like to try having breaks between the years. The lattice package should be useful for this.
> library(lattice)
>
> # Add year, month, and day to dataframe
> chardate <- as.character(gCount$date)
> splitdate <- strsplit(chardate, split = “-“)
> gCount$year <- as.numeric(unlist(lapply(splitdate, “[“, 1)))
> gCount$month <- as.numeric(unlist(lapply(splitdate, “[“, 2)))
> gCount$day <- as.numeric(unlist(lapply(splitdate, “[“, 3)))
>
>
> # Plot
> xyplot(count ~ trt + date | year,
+ data = gCount,
+ ylab=”Daily salamander captures”, xlab=”date”,
+ pch = seq(1:ntrt),
+ scales=list(x=list(alternating=c(1, 1, 1))),
+ between=list(y=1),
+ par.strip.text=list(cex=0.7),
+ par.settings=list(axis.text=list(cex=0.7)))

Obviously there is a problem with this. I am not getting proper overlaying of the two treatments. I tried adjusting the equation (e.g. count ~ month | year*trt), but nothing was that enticing and I decided to go back to other plotting functions. The lattice package is great for trellis plots and visualizing mixed effects models.
 I now decided to summarize the data by month rather than by day and add standard error bars. This goes back to using the base plot function.
> ### Line Plot ###
>
> # summarySE provides the standard deviation, standard error of the mean, and a (default 95%) confidence interval
> mCount <- summarySE(Count, measurevar=”count”, groupvars=c(“mo.yr”,”trt”))
> refmCount <- subset(mCount, mCount$trt == “Reference”)
> depmCount <- subset(mCount, mCount$trt == “Depletion”)
>
> daterange=c(as.POSIXct(min(mCount$mo.yr)),as.POSIXct(max(mCount$mo.yr)))
>
> # determine the lowest and highest months
> ylims <- c(0, max(mCount$count + mCount$se))
> r <- as.POSIXct(range(refmCount$mo.yr), “month”)
>
> plot(refmCount$mo.yr, refmCount$count, type = “n”, xaxt = “n”,
+      xlab = “Date”,
+      ylab = “Mean number of salamanders per night”,
+      xlim = c(r[1], r[2]),
+      ylim = ylims)
> axis.POSIXct(1, at = seq(r[1], r[2], by = “month”), format = “%b”)
> points(refmCount$mo.yr, refmCount$count, type = “p”, pch = 19)
> points(depmCount$mo.yr, depmCount$count, type = “p”, pch = 24)
> arrows(refmCount$mo.yr, refmCount$count+refmCount$se, refmCount$mo.yr, refmCount$count-refmCount$se, angle=90, code=3, length=0)
> arrows(depmCount$mo.yr, depmCount$count+depmCount$se, depmCount$mo.yr, depmCount$count-depmCount$se, angle=90, code=3, length=0)

Now that’s a much better visualization of the data and that’s the whole goal of a figure for publication. The only thing I might change would be I might plot by year with the labels of Month-Year (format = %b $Y). I might add a legend but with only two treatments I might just include the info in the figure description.
Although that is probably going to be my final product for my current purposes, I wanted to explore a few other graphing options for visualizing this data. The first is to use box plots. I use the add = TRUE option to add a second group after subsetting the data.
> ### Boxplot ###
>
> #    as.POSIXlt(date)$mon     #gives the months in numeric order mod 12 with January = 0 and December = 11
>
> refboxplot <- boxplot(count ~ date, data = Count, subset = trt == “Reference”,
+                       ylab = “Mean number of salamanders per night”,
+                       xlab = “Date”)   #show the graph and save the data
> depboxplot <- boxplot(count ~ date, data = Count, subset = trt == “Depletion”, col = 2, add = TRUE)
>

Clearly this is a mess and not useful. But you can imagine that with some work and summarizing by month or season it could be a useful and efficient way to present the data. Next I tried the popular package ggplot2.
> ### ggplot ###
>  
> library(ggplot2)
>
> ggplot(data = gCount, aes(x = date, y = count, group = trt)) +
+     #geom_point(aes(shape = factor(trt))) +
+     geom_point(aes(colour = factor(trt), shape = factor(trt)), size = 3) +
+     #geom_line() +
+     geom_errorbar(aes(ymin=count-se, ymax=count+se), width=.1) +
+     #geom_line() +
+    # scale_shape_manual(values=c(24,21)) + # explicitly have sham=fillable triangle, ACCX=fillable circle
+     #scale_fill_manual(values=c(“white”,”black”)) + # explicitly have sham=white, ACCX=black
+     xlab(“Date”) +
+     ylab(“Mean number of salamander captures per night”) +
+     scale_colour_hue(name=”Treatment”, # Legend label, use darker colors
+                      l=40) +                  # Use darker colors, lightness=40
+     theme_bw() + # make the theme black-and-white rather than grey (do this before font changes, or it overrides them)
+
+     opts(legend.position=c(.2, .9), # Position legend inside This must go after theme_bw
+           panel.grid.major = theme_blank(), # switch off major gridlines
+           panel.grid.minor = theme_blank(), # switch off minor gridlines
+          legend.title = theme_blank(), # switch off the legend title
+          legend.key = theme_blank()) # switch off the rectangle around symbols in the legend
This plot could work with some fine tuning, especially with the legend(s) but you get the idea. It wasn’t as easy for me as the plot function but ggplot is quite versatile and probably a good package to have in your back pocket for complicated graphing.
Next up was the gplots package for the plotCI function.
> library(gplots)
>
> plotCI(
+   x = refmCount$mo.yr,
+   y = refmCount$count,
+   uiw = refmCount$se, # error bar length (default is to put this much above and below point)
+   pch = 24, # symbol (plotting character) type: see help(pch); 24 = filled triangle pointing up
+   pt.bg = “white”, # fill colour for symbol
+   cex = 1.0, # symbol size multiplier
+   lty = 1, # error bar line type: see help(par) – not sure how to change plot lines
+   type = “p”, # p=points, l=lines, b=both, o=overplotted points/lines, etc.; see help(plot.default)
+   gap = 0, # distance from symbol to error bar
+   sfrac = 0.005, # width of error bar as proportion of x plotting region (default 0.01)
+   xlab = “Year”, # x axis label
+   ylab = “Mean number of salamanders per night”,
+   las = 1, # axis labels horizontal (default is 0 for always parallel to axis)
+   font.lab = 1, # 1 plain, 2 bold, 3 italic, 4 bold italic, 5 symbol
+   xaxt = “n”) # Don’t print x-axis
> )
>   axis.POSIXct(1, at = seq(r[1], r[2], by = “year”), format = “%b %Y”) # label the x axis by month-years
>
> plotCI(
+   x=depmCount$mo.yr,
+   y = depmCount$count,
+   uiw=depmCount$se, # error bar length (default is to put this much above and below point)
+   pch=21, # symbol (plotting character) type: see help(pch); 21 = circle
+   pt.bg=”grey”, # fill colour for symbol
+   cex=1.0, # symbol size multiplier
+   lty=1, # line type: see help(par)
+   type=”p”, # p=points, l=lines, b=both, o=overplotted points/lines, etc.; see help(plot.default)
+   gap=0, # distance from symbol to error bar
+   sfrac=0.005, # width of error bar as proportion of x plotting region (default 0.01)
+   xaxt = “n”,
+   add=TRUE # ADD this plot to the previous one
+ )

Now this is a nice figure. I must say that I like this. It is very similar to the standard plot code and graph but it was a little easier to add the error bars.
So that’s it, what a fun weekend I had! Let me know what you think or if you have any suggestions. I’m new to this and love to learn new ways of coding things in R.

Model Validation: Interpreting Residual Plots

When conducting any statistical analysis it is important to evaluate how well the model fits the data and that the data meet the assumptions of the model. There are numerous ways to do this and a variety of statistical tests to evaluate deviations from model assumptions. However, there is little general acceptance of any of the statistical tests. Generally statisticians (which I am not but I do my best impression) examine various diagnostic plots after running their regression models. There are a number of good sources of information on how to do this. My recommendation is Fox and Weisberg’s An R Companion to Applied Regression (Chp 6). You can refer to Fox’s book, Applied Regression Analysis and Generalized Linear Models for the theory and details behind these plots but the corresponding R book is more of the “how to” guide. A very brief but good introduction to checking linear model assumptions can be found here.

The point of this post isn’t to go over the details or theory but rather discuss one of the challenges that I and others have had with interpreting these diagnostic plots. Without going into the differences between standardized, studentized, Pearson’s and other residuals, I will say that most of the model validation centers around the residuals (essentially the distance of the data points from the fitted regression line). Here is an example from Zuur and Colleagues’ excellent book, Mixed Effects Models and Extensions in Ecology with R:

So these residuals appear exhibit homogeneity, normality, and independence. Those are pretty clear, although I’m not sure if the variation in residuals associated with the predictor (independent) variable Month is a problem. This might be a problem with heterogeneity. Most books just show a few examples like this and then residuals with clear patterning, most often increasing residual values with increasing fitted values (i.e. large values in the response/dependent variable results in greater variation, which is often correct with a log transformation). A good example of this can be see in (d) below in fitted vs. residuals plots (like top left plot in figure above).

These are the type of idealized examples usually shown. I think it’s important to show these perfect examples of problems but I wish I could get expert opinions on more subtle, realistic examples. These figures are often challenging to interpret because the density of points also changes along the x-axis. I don’t have a good example of this but will add one in when I get one. Instead I will show some diagnostic plots that I’ve generated as part of a recent attempt to fit a Generalized Linear Mixed Model (GLMM) to problematic count data.

The assumption of normality (upper left) is probably sufficient. However, the plot of the fitted vs. residuals (upper right) seems to have more variation at mid-level values compared with the low or high fitted values. Is this patten enough to be problematic and suggest a poor model fit? Is it driven by greater numbers of points at mid-level fitted values? I’m not sure. The diagonal dense line of points is generated by the large number of zeros in the dataset. My model does seem to have some problem fitting the zeros. I have two random effects in my GLMM. The residuals across plots (5 independent sites/subjects on which the data was repeatedly measured – salamanders were counted on the same 5 plots repeatedly over 4 years) don’t show any pattern. However, there is heterogeneity in residuals among years (bottom right). This isn’t surprising given that I collected much more data over a greater range of conditions in some years. This is a problem for the model and this variation will need to be modeled better.

So I refit the model and came up with these plots (different plots for further discussion rather than direct comparison):

Here you can see considerable variation from normality for the overall model (upper left) but okay normality within plots (lower right). The upper right plot is an okay example of what I was talking about with changes in density making interpretation difficult. There are far more points at lower values and a sparsity of points are very high fitted values. The eye is often pulled in the direction of the few points on the right creating difficult in interpretation. To help with this I like to add a loess smoother or smoothing spline (solid line) and a horizontal line at zero (broken line). The smoothing line should be approximately straight and horizontal around zero. Basically it should overlay the horizontal zero line. Here’s the code to do it in R for a fitted linear mixed model (lme1):
plot(fitted(lme1), residuals(lme1),
  xlab = “Fitted Values”, ylab = “Residuals”)
  abline(h=0, lty=2)
  lines(smooth.spline(fitted(lme1), residuals(lme1)))

This also helps determine if the points are symmetrical around zero. I often also find it useful to plot the absolute value of the residuals with the fitted values. This helps visualize if there is a trend in direction (bias). It can also help to better see changes in spread of the residuals indicating heterogeneity. The bias can be detected with a sloping loess or smooth spline. In the lower left plot, you can see little evidence of bias but some evidence of heterogeneity (change in spread of points). Again, I an not sure if this is bad enough to invalidate the model but in combination with the deviation from normality I would reject the fit of this model.

In a mixed model it can be important to look at variation across the values of the random effects. In my case here is an example of fitted vs. residuals for each of the plots (random sites/subjects). I used the following code, which takes advantage of the lattice package in R.
# Check for residual pattern within groups and difference between groups     
xyplot(residuals(glmm1) ~ fitted(glmm1) | Count$plot, main = “glmm1 – full model by plot”,
  panel=function(x, y){
    panel.xyplot(x, y)
    panel.loess(x, y, span = 0.75)
    panel.lmline(x, y, lty = 2)  # Least squares broken line
  }
)

And here is another way to visualize a mixed model:

You can see that the variation in the two random effects (Plot and Year) is much better in this model but there are problems with normality and potentially heterogeneity. Since violations of normality are off less concern than the other assumptions, I wonder if this model is completely invalid or if I could make some inference from it. I don’t know and would welcome expert opinion.

Regardless, this model was fit using a poisson GLMM and the deviance divided by the residual degrees of freedom (df) was 5.13, which is much greater than 1, indicating overdispersion. Therefore, I tried to fit the regression using a negative binomial distribution:

# Using glmmPQL via MASS package

library(MASS)

#recommended to run model first as non-mixed to get a starting value for the theta estimate:

#negbin

glmNB1 <- glm.nb(count ~ cday +cday2 + cSoilT + cSoilT2 + cRainAmt24 + cRainAmt242 + RHpct + soak24 + windspeed, data = Count, na.action = na.omit)

summary(glmNB1)

#anova(glmNB1)

#plot(glmNB1)

# Now run full GLMM with initial theta starting point from glm

glmmPQLnb1 <- glmmPQL(count ~ cday +cday2 + cSoilT + cSoilT2 + cRainAmt24 + cRainAmt242 + RHpct + soak24 + windspeed, random = list(~1 | plot, ~1 | year), data = Count, family = negative.binomial(theta = 1.480, link = log), na.action = na.exclude)

Unfortunately, I got the following validation plots:

Clearly, this model doesn’t work for the data. It is quite surprising given the fit of the poisson and that the negative binomial is a more general distribution than the poisson and handles overdispersed count data well usually. I’m not sure what the problem is in this case.

Next I tried to run the model as if all observations were random:

<!– /* Font Definitions */ @font-face {font-family:Cambria; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:0; mso-generic-font-family:auto; mso-font-pitch:variable; mso-font-signature:3 0 0 0 1 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-parent:""; margin:0in; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Times New Roman"; mso-ascii-font-family:Cambria; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:Cambria; mso-fareast-theme-font:minor-latin; mso-hansi-font-family:Cambria; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.25in 1.0in 1.25in; mso-header-margin:.5in; mso-footer-margin:.5in; mso-paper-source:0;} div.Section1 {page:Section1;} — glmmObs1 <- lmer(count ~ cday +cday2 + cSoilT + cSoilT2 + cRainAmt24 + cRainAmt242 + RHpct + soak24 + windspeed + (1 | plot) + (1 | year) + (1 | obs), data = Count, family = poisson) Again I end up with more problematic validation/diagnostic plots:

So that’s about it for now. Hopefully this post helps some people with model validation and interpretation of fitted vs. residual plots. I would love to hear opinions regarding interpretation of residuals and when some pattern is too much and when it is acceptable. Let me know if you have examples of other more subtle residual plots.

Happy coding and may all your analyses run smoothly and provide clear interpretations!

Follow

Get every new post delivered to your Inbox.

Join 54 other followers