This is my first blog since joining R-bloggers. I’m quite excited to be part of this group and apologize if I bore any experienced R users with my basic blogs for learning R or offend programmers with my inefficient, sloppy coding. Hopefully writing for this most excellent community will help improve my R skills while helping other novice R users.

I have a dataset from my dissertation research were I repeatedly counted salamanders from some plots and removed salamanders from other plots. I was interested in plotting the captures over time (sensu Hairston 1987). As with all statistics and graphs in R, there are a variety of ways to create the same or similar output. One challenge I always have is dealing with dates. Another challenge is plotting error bars on plots. Now I had the challenge of plotting the average captures per night or month ± SE for two groups (reference and depletion plots – 5 of each on each night) vs. time. This is the type of thing that could be done in 5 minutes on graph paper by hand but took me a while in R and I’m still tweaking various plots. Below I explore a variety of ways to handle and plot this data. I hope it’s helpful for others. Chime in with comments if you have any suggestions or similar experiences.

> ####################################################################

> # Summary of Count and Removal Data

> # Dissertation Project 2011

> # October 25, 2011

> # Daniel J. Hocking

> ####################################################################

>

> Data <- read.table(‘/Users/Dan/…/AllCountsR.txt’, header = TRUE, na.strings = “NA”)

>

> str(Data)

‘data.frame’: 910 obs. of 32 variables:

$ dates : Factor w/ 91 levels “04/06/09″,”04/13/11”,..: 12 14 15 16 17 18 19 21 22 23 …

You’ll notice that when importing a text file created in excel with the default date format, R treats the date variable as a Factor within the data frame. We need to convert it to a date form that R can recognize. Two such built-in functions are as.Date and as.POSIXct. The latter is a more common format and the one I choose to use (both are very similar but not fully interchangeable). To get the data in the POSIXct format in this case I use the strptime function as seen below. I also create a couple new columns of day, month, and year in case they become useful for aggregating or summarizing the data later.

>

> dates <-strptime(as.character(Data$dates), “%m/%d/%y”) # Change date from excel storage to internal R format

>

> dim(Data)

[1] 910 32

> Data = Data[,2:32] # remove

> Data = data.frame(date = dates, Data) #this is now the date in useful fashion

>

> Data$mo <- strftime(Data$date, “%m”)

> Data$mon <- strftime(Data$date, “%b”)

> Data$yr <- strftime(Data$date, “%Y”)

>

> monyr <- function(x)

+ {

+ x <- as.POSIXlt(x)

+ x$mday <- 1

+ as.POSIXct(x)

+ }

>

> Data$mo.yr <- monyr(Data$date)

>

> str(Data)

‘data.frame’: 910 obs. of 36 variables:

$ date : POSIXct, format: “2008-05-17” “2008-05-18” …

$ mo : chr “05” “05” “05” “05” …

$ mon : chr “May” “May” “May” “May” …

$ yr : chr “2008” “2008” “2008” “2008” …

$ mo.yr : POSIXct, format: “2008-05-01 00:00:00” “2008-05-01 00:00:00” …

>

As you can see the date is now in the internal R date form of POSIXct (YYYY-MM-DD).

Now I use a custom function to summarize each night of counts and removals. I forgot offhand how to call to a custom function stored elsewhere to I lazily pasted it in my script. I found this nice little function online but I apologize to the author because I don’t remember were I found it.

> library(ggplot2)

> library(doBy)

>

> ## Summarizes data.

> ## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).

> ## If there are within-subject variables, calculate adjusted values using method from Morey (2008).

> ## measurevar: the name of a column that contains the variable to be summariezed

> ## groupvars: a vector containing names of columns that contain grouping variables

> ## na.rm: a boolean that indicates whether to ignore NA’s

> ## conf.interval: the percent range of the confidence interval (default is 95%)

> summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95) {

+ require(doBy)

+

+ # New version of length which can handle NA’s: if na.rm==T, don’t count them

+ length2 <- function (x, na.rm=FALSE) {

+ if (na.rm) sum(!is.na(x))

+ else length(x)

+ }

+

+ # Collapse the data

+ formula <- as.formula(paste(measurevar, paste(groupvars, collapse=” + “), sep=” ~ “))

+ datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd), na.rm=na.rm)

+

+ # Rename columns

+ names(datac)[ names(datac) == paste(measurevar, “.mean”, sep=””) ] <- measurevar

+ names(datac)[ names(datac) == paste(measurevar, “.sd”, sep=””) ] <- “sd”

+ names(datac)[ names(datac) == paste(measurevar, “.length2″, sep=””) ] <- “N”

+

+ datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean

+

+ # Confidence interval multiplier for standard error

+ # Calculate t-statistic for confidence interval:

+ # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1

+ ciMult <- qt(conf.interval/2 + .5, datac$N-1)

+ datac$ci <- datac$se * ciMult

+

+ return(datac)

+ }

>

> # summarySE provides the standard deviation, standard error of the mean, and a (default 95%) confidence interval

> gCount <- summarySE(Count, measurevar=”count”, groupvars=c(“date”,”trt”))

>

Now I’m ready to plot. I’ll start with a line graph of the two treatment plot types.

> ### Line Plot ###

> # Ref: Quick-R

>

> # convert factor to numeric for convenience

> gCount$trtcode <- as.numeric(gCount$trt)

> ntrt <- max(gCount$trtcode)

>

> # ranges for x and y axes

> xrange <- range(gCount$date)

> yrange <- range(gCount$count)

>

> # Set up blank plot

> plot(gCount$date, gCount$count, type = “n”,

+ xlab = “Date”,

+ ylab = “Mean number of salamanders per night”)

>

> # add lines

> color <- seq(1:ntrt)

> line <- seq(1:ntrt)

> for (i in 1:ntrt){

+ Treatment <- subset(gCount, gCount$trtcode==i)

+ lines(Treatment$date, Treatment$count, col = color[i])#, lty = line[i])

+ }

As you can see this is not attractive but it does show that I generally caught more salamander in the reference (red line) plots. This makes it apparent that a line graph is probably a bit messy for this data and might even be a bit misleading because data was not collected continuously or at even intervals (weather and season dependent collection). So, I tried a plot of just the points using the scatterplot function from the “car” package.

> ### package car scatterplot by groups ###

> library(car)

>

> # Plot

> scatterplot(count ~ date + trt, data = gCount,

+ smooth = FALSE, grid = FALSE, reg.line = FALSE,

+ xlab=”Date”, ylab=”Mean number of salamanders per night”)

>

This was nice because it was very simple to code. It includes points from every night but I would still like to summarize it more. Before I get to that, I would like to try having breaks between the years. The lattice package should be useful for this.

> library(lattice)

>

> # Add year, month, and day to dataframe

> chardate <- as.character(gCount$date)

> splitdate <- strsplit(chardate, split = “-“)

> gCount$year <- as.numeric(unlist(lapply(splitdate, “[“, 1)))

> gCount$month <- as.numeric(unlist(lapply(splitdate, “[“, 2)))

> gCount$day <- as.numeric(unlist(lapply(splitdate, “[“, 3)))

>

>

> # Plot

> xyplot(count ~ trt + date | year,

+ data = gCount,

+ ylab=”Daily salamander captures”, xlab=”date”,

+ pch = seq(1:ntrt),

+ scales=list(x=list(alternating=c(1, 1, 1))),

+ between=list(y=1),

+ par.strip.text=list(cex=0.7),

+ par.settings=list(axis.text=list(cex=0.7)))

>

Obviously there is a problem with this. I am not getting proper overlaying of the two treatments. I tried adjusting the equation (e.g. count ~ month | year*trt), but nothing was that enticing and I decided to go back to other plotting functions. The lattice package is great for trellis plots and visualizing mixed effects models.

I now decided to summarize the data by month rather than by day and add standard error bars. This goes back to using the base plot function.

> ### Line Plot ###

>

> # summarySE provides the standard deviation, standard error of the mean, and a (default 95%) confidence interval

> mCount <- summarySE(Count, measurevar=”count”, groupvars=c(“mo.yr”,”trt”))

> refmCount <- subset(mCount, mCount$trt == “Reference”)

> depmCount <- subset(mCount, mCount$trt == “Depletion”)

>

> daterange=c(as.POSIXct(min(mCount$mo.yr)),as.POSIXct(max(mCount$mo.yr)))

>

> # determine the lowest and highest months

> ylims <- c(0, max(mCount$count + mCount$se))

> r <- as.POSIXct(range(refmCount$mo.yr), “month”)

>

> plot(refmCount$mo.yr, refmCount$count, type = “n”, xaxt = “n”,

+ xlab = “Date”,

+ ylab = “Mean number of salamanders per night”,

+ xlim = c(r[1], r[2]),

+ ylim = ylims)

> axis.POSIXct(1, at = seq(r[1], r[2], by = “month”), format = “%b”)

> points(refmCount$mo.yr, refmCount$count, type = “p”, pch = 19)

> points(depmCount$mo.yr, depmCount$count, type = “p”, pch = 24)

> arrows(refmCount$mo.yr, refmCount$count+refmCount$se, refmCount$mo.yr, refmCount$count-refmCount$se, angle=90, code=3, length=0)

> arrows(depmCount$mo.yr, depmCount$count+depmCount$se, depmCount$mo.yr, depmCount$count-depmCount$se, angle=90, code=3, length=0)

>

Now that’s a much better visualization of the data and that’s the whole goal of a figure for publication. The only thing I might change would be I might plot by year with the labels of Month-Year (format = %b $Y). I might add a legend but with only two treatments I might just include the info in the figure description.

Although that is probably going to be my final product for my current purposes, I wanted to explore a few other graphing options for visualizing this data. The first is to use box plots. I use the add = TRUE option to add a second group after subsetting the data.

> ### Boxplot ###

>

> # as.POSIXlt(date)$mon #gives the months in numeric order mod 12 with January = 0 and December = 11

>

> refboxplot <- boxplot(count ~ date, data = Count, subset = trt == “Reference”,

+ ylab = “Mean number of salamanders per night”,

+ xlab = “Date”) #show the graph and save the data

> depboxplot <- boxplot(count ~ date, data = Count, subset = trt == “Depletion”, col = 2, add = TRUE)

>

Clearly this is a mess and not useful. But you can imagine that with some work and summarizing by month or season it could be a useful and efficient way to present the data. Next I tried the popular package ggplot2.

> ### ggplot ###

>

> library(ggplot2)

>

> ggplot(data = gCount, aes(x = date, y = count, group = trt)) +

+ #geom_point(aes(shape = factor(trt))) +

+ geom_point(aes(colour = factor(trt), shape = factor(trt)), size = 3) +

+ #geom_line() +

+ geom_errorbar(aes(ymin=count-se, ymax=count+se), width=.1) +

+ #geom_line() +

+ # scale_shape_manual(values=c(24,21)) + # explicitly have sham=fillable triangle, ACCX=fillable circle

+ #scale_fill_manual(values=c(“white”,”black”)) + # explicitly have sham=white, ACCX=black

+ xlab(“Date”) +

+ ylab(“Mean number of salamander captures per night”) +

+ scale_colour_hue(name=”Treatment”, # Legend label, use darker colors

+ l=40) + # Use darker colors, lightness=40

+ theme_bw() + # make the theme black-and-white rather than grey (do this before font changes, or it overrides them)

+

+ opts(legend.position=c(.2, .9), # Position legend inside This must go after theme_bw

+ panel.grid.major = theme_blank(), # switch off major gridlines

+ panel.grid.minor = theme_blank(), # switch off minor gridlines

+ legend.title = theme_blank(), # switch off the legend title

+ legend.key = theme_blank()) # switch off the rectangle around symbols in the legend

>

This plot could work with some fine tuning, especially with the legend(s) but you get the idea. It wasn’t as easy for me as the plot function but ggplot is quite versatile and probably a good package to have in your back pocket for complicated graphing.

Next up was the gplots package for the plotCI function.

> library(gplots)

>

> plotCI(

+ x = refmCount$mo.yr,

+ y = refmCount$count,

+ uiw = refmCount$se, # error bar length (default is to put this much above and below point)

+ pch = 24, # symbol (plotting character) type: see help(pch); 24 = filled triangle pointing up

+ pt.bg = “white”, # fill colour for symbol

+ cex = 1.0, # symbol size multiplier

+ lty = 1, # error bar line type: see help(par) – not sure how to change plot lines

+ type = “p”, # p=points, l=lines, b=both, o=overplotted points/lines, etc.; see help(plot.default)

+ gap = 0, # distance from symbol to error bar

+ sfrac = 0.005, # width of error bar as proportion of x plotting region (default 0.01)

+ xlab = “Year”, # x axis label

+ ylab = “Mean number of salamanders per night”,

+ las = 1, # axis labels horizontal (default is 0 for always parallel to axis)

+ font.lab = 1, # 1 plain, 2 bold, 3 italic, 4 bold italic, 5 symbol

+ xaxt = “n”) # Don’t print x-axis

> )

> axis.POSIXct(1, at = seq(r[1], r[2], by = “year”), format = “%b %Y”) # label the x axis by month-years

>

> plotCI(

+ x=depmCount$mo.yr,

+ y = depmCount$count,

+ uiw=depmCount$se, # error bar length (default is to put this much above and below point)

+ pch=21, # symbol (plotting character) type: see help(pch); 21 = circle

+ pt.bg=”grey”, # fill colour for symbol

+ cex=1.0, # symbol size multiplier

+ lty=1, # line type: see help(par)

+ type=”p”, # p=points, l=lines, b=both, o=overplotted points/lines, etc.; see help(plot.default)

+ gap=0, # distance from symbol to error bar

+ sfrac=0.005, # width of error bar as proportion of x plotting region (default 0.01)

+ xaxt = “n”,

+ add=TRUE # ADD this plot to the previous one

+ )

>

Now this is a nice figure. I must say that I like this. It is very similar to the standard plot code and graph but it was a little easier to add the error bars.

So that’s it, what a fun weekend I had! Let me know what you think or if you have any suggestions. I’m new to this and love to learn new ways of coding things in R.

Thanks for posting this. I like your final graph. I have similar kinds of data which I'd like to plot like this, so will find this useful.In terms of the post, I could offer a couple of suggestions. Firstly it's nicer to have an example that people can try themselves – either provide the dataset to download, or use one already freely available online/ through an R package. Secondly I think it would be better if you just showed how you got the final graph to make the post shorter – although it is quite nice to see how you arrived at the final graph too.

Thanks for the interest and advice Tom. Good luck with your research.

Good post. I have been making a lot of graphs like this for the project I am currently working on. I also love summarySE and use it a lot. One option you didn't mention is xYplot in the package Hmisc. It's nice because it has all the functionality of lattice (grouping variables, panel conditioning etc) but lets you add error bars. It's not necessarily better than the solutions you showed though. I have an example here: http://anthony.darrouzet-nardi.net/scienceblog/?p=406

Thanks Anthony. That's the great thing about R, there are many solutions, some better, some worse, some easier, some harder, and some just different. That's why I like being part of the R-Bloggers community. I am new to all this an it really helps increase my R toolbox. I have heard some praise for the Hmisc package but haven't ever really tried it. I like the sound of xYplot, I'll have to give it a try. I have lots of longitudinal data so it could be really handy.

Here's a message that I got via email that I thought was good advice and worth sharing:R is excellent for such an exporative analysis. Good that you had fun doing it. Some suggestions: – if you plot error bars (first) and means (second) independently (using functions segments() and points()) you could avoid overplotting of mean values by error bars of the other group (e.g. June 2011)- you could also shift depletion data points slightly to left and reference to right, such that error bars do not overlap But most important:- think about which conclusion you would like the observer to draw from your data/plot?- do you need to leave huge gaps without data points? — imo only if you want to show a (linear) trend over time that would be skewed if you don't- are you interested in absolute numbers or in the difference towards a reference value? — if the latter than plot the differences per time point instead of both original values- are you interested in differences between years? — hopefully not, as you sampled not within the same time period each year Basically: Think about what is your hypothesis and which value derived from your data illustrates the answer to this hypothesis best. This is what you should put in the focus of your plot.