High Resolution Figures in R

LogisticAs I was recently preparing a manuscript for PLOS ONE, I realized the default resolution of R and RStudio images are insufficient for publication. PLOS ONE requires 300 ppi images in TIFF or EPS (encapsulated postscript) format. In R plots are exported at 72 ppi by default. I love RStudio but was disappointed to find that there was no options for exporting figures at high resolution.

PLOS ONE has extensive instructions for scaling, compressing, and converting image files to meet their standards. Unfortunately, there is no good way to go from low resolution to high resolution (i.e. rescaling in Photoshop) as my friend Liam Revell, and phytools author, pointed out with entertaining illustration from PhD comics (upper right panel). The point is if the original image isn’t created at a high enough resolution, then using Photoshop to artificially increase the resolution has problems of graininess because Photoshop can only guess how to interpolate between the points. This might not be a big problem with simple plots created in R because interpolation between points in a line shouldn’t be difficult, particularly when starting with a PDF.

Even if scaling up from a low resolution PDF would work, it would be better to have a direct solution in R. It took some time to figure out but here are some trials and the ultimate solution I came up with:

x <- 1:100
y <- 1:100
plot(x, y) # Make plot
tiff("Plot1.tiff")
dev.off()

Nothing happens in this case. However, by setting up the tiff file first, then making the plot, the resulting TIFF file is saved to your working directory and is 924 KB, 72 ppi, 480 x 480 pixels.

To increase the resolution I tried the following:

tiff("Plot2.tif", res = 300)
plot(x, y) # Make plot
dev.off()

but in RStudio the plot could not be printed and hence not saved because it was too large for the print area. Therefore, I had to open up R directly and run the code. Interestingly, a blank TIFF file was created of the same size as Plot1.tiff. This is where I got hung up for a while. I eventually found that R can’t figure out the other image parameters when resolution is changes or because the default is too big to print, so they have to be specified directly as such:

tiff("Plot3.tiff", width = 4, height = 4, units = 'in', res = 300)
plot(x, y) # Make plot
dev.off()

This creates a TIFF file that is 5,800 KB, 300 ppi, 4 inches by 4 inches. Surprisingly, on a Mac it still indicates that it’s only 72 ppi when viewed in Preview. The larger size indicates that it is actually 300 ppi. I ran the same code but specifically specified res = 72 and the file was only 334 KB, suggesting that Preview is incorrect and the file is really 300 ppi. I played with various compressions but lzw and none were the same while rle resulted in a larger file (less compression). That seems odd again.

Finally, I tried using the bitmap function to create a TIFF:

bitmap("Plot7.tiff", height = 4, width = 4, units = 'in', type="tifflzw", res=300)
plot(x, y)
dev.off()
par(mfrow = c(1,1))

Interestingly, this file is only 9 KB but is listed as 300 dpi, 1200 x 1200 pixels. I’m really not sure why these functions don’t seem to be working as smoothly as expected but hopefully this can help get high resolution images directly from R for publication. I plan to use the bitmap function in the future to create high resolution TIFF files for publication. This is what is desired by outlets such as PLOS ONE. It’s easier than dealing with postscript. I also don’t know if EPS files from R or RStudio are created with LaTeX. I know that can be a problem for PLOS ONE.

 EDIT: Just found a nice blog post with recommendations on device 
outputs on Revolutions here

Below is all the code that includes comparison of sizes with PDF, PNG, JPEG, and EPS files as well.

plot(x, y) # Make plot
tiff("Plot1.tiff")
dev.off()

tiff("Plot2.tiff", res = 300)
plot(x, y) # Make plot
dev.off()

tiff("Plot3.tiff", width = 4, height = 4, units = 'in', res = 300)
plot(x, y) # Make plot
dev.off()

tiff("Plot3.72.tiff", width = 4, height = 4, units = 'in', res = 72)
plot(x, y) # Make plot
dev.off()

tiff("Plot4.tiff", width = 4, height = 4, units = 'in', res = 300, compression = 'lzw')
plot(x, y) # Make plot
dev.off()

tiff("Plot5.tiff", width = 4, height = 4, units = 'in', res = 300, compression = 'none')
plot(x, y) # Make plot
dev.off()

tiff("Plot6.tiff", width = 4, height = 4, units = 'in', res = 300, compression = 'rle')
 # Make plot
dev.off()

bitmap("Plot7.tiff", height = 4, width = 4, units = 'in', type="tifflzw", res=300)
plot(x, y)
dev.off()
par(mfrow = c(1,1))

bitmap("Plot8.tiff", height = 480, width = 480, type="tifflzw", res=300)
plot(x, y)
dev.off()
par(mfrow = c(1,1))

jpeg("Plot3.jpeg", width = 4, height = 4, units = 'in', res = 300)
plot(x, y) # Make plot
dev.off()

tiff("Plot4b.tiff", width = 4, height = 4, pointsize = 1/300, units = 'in', res = 300)
plot(x, y) # Make plot
dev.off()

png("Plot3.png", width = 4, height = 4, units = 'in', res = 300)
plot(x, y) # Make plot
dev.off()

pdf("Plot3.pdf", width = 4, height = 4)
plot(x, y) # Make plot
dev.off()

postscript("Plot3.eps", width = 480, height = 480)
plot(x, y) # Make plot
dev.off()

Building R packages: missing path to pdflatex

Recently whiling trying to build an R package for generalized estimating equation model selection (QICpack on github), I was getting an error related to latex creating the PDF package manuals. It seems like this is a relatively common problem on some versions of Mac OS X, but I did not find it easy to find an answer so I thought I’d describe my ultimate solution.

The specific error I was getting was:

error in texi2dvi(file = file, pdf = true, clean = clean, quiet = quiet, : 
pdflatex is not available

I was getting this error regardless of whether building the package using R CMD packagename in the Terminal or using build in R or Rstudio via the devtools package.

I tried reinstalling MacTex and restarting my computer to no avail. It turned out that it was a problem with not having a path to the Tex installation. You can check using the following code in the Terminal or in R:

Sys.which("pdflatex")
Sys.getenv("PATH")

In my case, I got something like:

Sys.which(“pdflatex”)

 ""

Sys.getenv(“PATH”)

"/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin"

I don’t know much about setting paths, profiles, and environments, but this indicated that there is no path to latex and so the build function can’t find pdflatex needed to convert the .Rd files (LaTeX code of manuals in R packages) to PDF files for the R package. I was able to fix the problem by adding a path to the latex binaries using:

Sys.setenv(PATH=paste(Sys.getenv("PATH"),"/usr/texbin",sep=":"))

Now when I run Sys.getenv("PATH") I get

"/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/texbin"

Now R is able to find the LaTeX binaries and when I run Sys.which("pdflatex") I get

              pdflatex 
"/usr/texbin/pdflatex"

That should fix the problem and now the build function can turn the .Rd files into PDF documents of the R functions.

GEE QIC update

Here is improved code for calculating QIC from geeglm in geepack in R (original post). Let me know how it works. I haven’t tested it much, but is seems that QIC may select overparameterized models. In the code below, I had to replace <- with = because wordpress didn’t except <- within code or pre tags. It should still work just fine.

Here is a quick example of how to run this function. First, highlight and run the code below in R. That will save the function in your workspace. Then run your gee model using geeglm in geepack (package available from CRAN). Next, run QIC(your_gee_model) and you get the QIC. You can then repeat this with alternative a priori models. Below the function is an example using data available as part of geepack.

    
############################################################
# QIC for GEE models
# Daniel J. Hocking
############################################################
QIC = function(model.R) {
      library(MASS)
      model.indep = update(model.R, corstr = "independence")
      # Quasilikelihood
      mu.R = model.R$fitted.values
      y = model.R$y
      type = family(model.R)$family
      quasi.R = switch(type,
                    poisson = sum((y*log(mu.R)) - mu.R),
                    gaussian = sum(((y - mu.R)^2)/-2),
                    binomial = sum(y*log(mu.R/(1 - mu.R)) + log(1 - mu.R)),
                    Gamma = sum(-y/(mu.R - log(mu.R))),
                    stop("Error: distribution not recognized"))
  # Trace Term (penalty for model complexity)
  omegaI = ginv(model.indep$geese$vbeta.naiv) # Omega-hat(I) via Moore-Penrose generalized inverse of a matrix in MASS package
  #AIinverse = solve(model.indep$geese$vbeta.naiv) # solve via indenity
  Vr = model.R$geese$vbeta
  trace.R = sum(diag(omegaI %*% Vr))
  px = length(mu.R) # number non-redunant columns in design matrix
  # QIC
  QIC = 2*(trace.R - quasi.R) [EDIT: original post was missing '*']
  #QICu = (-2)*quasi.R + 2*px    # Approximation assuming model structured correctly
  output = c(QIC, quasi.R, trace.R, px)
  names(output) = c('QIC', 'Quasi Lik', 'Trace', 'px')
  return(output)
}

Here’s an example you can run in R.

library(geepack)

data(dietox)
dietox$Cu = as.factor(dietox$Cu)
mf = formula(Weight ~ Cu * (Time + I(Time^2) + I(Time^3)))
gee1 = geeglm(mf, data = dietox, id = Pig, family = poisson, corstr = "ar1")
gee1
summary(gee1)

mf2 = formula(Weight ~ Cu * Time + I(Time^2) + I(Time^3))
gee2 = geeglm(mf2, data = dietox, id = Pig, family = poisson, corstr = "ar1")
summary(gee2)
anova(gee2)
anova(gee1, gee2)

mf3 = formula(Weight ~ Cu + Time + I(Time^2))
gee3 = geeglm(mf3, data = dietox, id = Pig, family = poisson, corstr = "ar1")
gee3.I = update(gee3, corstr = "independence")
gee3.Ex = update(gee3, corstr = "exchangeable")

sapply(list(gee1, gee2, gee3, gee3.I, gee3.Ex), QIC)

In the output of this model it suggests that model gee1 is the best model. I have some concerns that QIC will almost inevitably choose the most complex model. More testing with simulated data will be necessary.

               [,1]      [,2]      [,3]      [,4]      [,5]
QIC       -333199.7 -333188.0 -333187.5 -333181.8 -333153.6
Quasi Lik  166623.0  166622.7  166620.4  166622.2  166615.4
Trace          23.2      28.7      26.6      31.3      38.6
px            861.0     861.0     861.0     861.0     861.0

You will get warnings when running this model because it uses a Poisson distribution for continuous data. I will work on finding a better example in the future before I make this available as an R package.

Plotting 95% Confidence Bands in R

I am comparing estimates from subject-specific GLMMs and population-average GEE models as part of a publication I am working on. As part of this, I want to visualize predictions of each type of model including 95% confidence bands.

First I had to make a new data set for prediction. I could have compared fitted values with confidence intervals but I am specifically interested in comparing predictions for particular variables while holding others constant. For example, soil temperature is especially important for salamanders, so I am interested in the predicted effects of soil temperature from the different models. I used the expand.grid and model.matrix functions in R to generate a new data set where soil temperature varied from 0 to 30 C. The other variables were held constant at their mean levels during the study. Because of the nature of the contrast argument in the model.matrix function, I had to include more than one level of the factor “season”. I then removed all season except spring. In effect I am asking, what is the effect of soil temperature on salamander activity during the spring when the other conditions are constant (e.g. windspeed = 1.0 m/s, rain in past 24 hours =  This code is based on code from Ben Bolker via http://glmm.wikidot.com

# Compare Effects of SoilT with 95% CIs
formula(glmm1)
newdat.soil <- expand.grid(
SoilT = seq(0, 30, 1),
RainAmt24 = mean(RainAmt24),
RH = mean(RH),
windspeed = mean(windspeed),
season = c("spring", "summer", "fall"),
droughtdays = mean(droughtdays),
count = 0
)
newdat.soil$SoilT2 <- newdat.soil$SoilT^2

# Spring
newdat.soil.spring <- newdat.soil[newdat.soil$season == 'spring', ]

mm = model.matrix(terms(glmm1), newdat.soil)

Next I calculated the 95% confidence intervals for both the GLMM and GEE models. For the GLMM the plo and phi are the low and high confidence intervals for the fixed effects assuming zero effect of the random sites. tlo and thi account for the uncertainty in the random effects.

newdat.soil$count = mm %*% fixef(glmm1)
pvar1 <- diag(mm %*% tcrossprod(vcov(glmm1),mm))
tvar1 <- pvar1+VarCorr(glmm1)$plot[1]
newdat.soil <- data.frame(
newdat.soil
, plo = newdat.soil$count-2*sqrt(pvar1)
, phi = newdat.soil$count+2*sqrt(pvar1)
, tlo = newdat.soil$count-2*sqrt(tvar1)
, thi = newdat.soil$count+2*sqrt(tvar1)
)
mm.geeEX = model.matrix(terms(geeEX), newdat.soil)
newdat.soil$count.gee = mm.geeEX %*% coef(geeEX)
tvar1.gee <- diag(mm.geeEX %*% tcrossprod(geeEX$geese$vbeta, mm.geeEX))
newdat.soil <- data.frame(
newdat.soil
, tlo.gee = newdat.soil$count-2*sqrt(tvar1.gee)
, thi.gee = newdat.soil$count+2*sqrt(tvar1.gee)
)

The standard error of the fixed effects are larger in the GEE model than in the GLMM, but when the variation associated with the random effects are accounted for, the uncertainty (95% CI) around the estimates is greater in the GLMM. This is especially evident when the estimated values are large since the random effects are exponential on the original scale. This can be seen in the below plots

Although this plot does the job, it isn’t an efficient use of space, nor is it easy to compare exactly where the different lines fall. It would be nice to plot everything on one set of axes. The only trouble is that all the lines could be difficult to see just using solid and dashed/dotted lines. To help with this, I combine the plots but added color and shading using the polygon function. The code and plot are below

plot(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count.gee),
xlab = "Soil temperature (C)",
ylab = 'Predicted salamander observations',
type = 'l',
ylim = c(0, 25))
polygon(c(newdat.soil.spring$SoilT
, rev(newdat.soil.spring$SoilT))
, c(exp(newdat.soil.spring$thi.gee)
, rev(exp(newdat.soil.spring$tlo.gee)))
, col = 'grey'
, border = NA)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$thi.gee),
type = 'l',
lty = 2)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$tlo.gee),
type = 'l',
lty = 2)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count.gee),
type = 'l',
lty = 1,
col = 2)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$count),
col = 1)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$thi),
type = 'l',
lty = 2)
lines(newdat.soil.spring$SoilT, exp(newdat.soil.spring$tlo),
type = 'l',
lty = 2)

GLMM vs GEE plot with 95% confidence intervals

Now you can directly compare the results of the GLMM and GEE models. The predicted values (population-averaged) for the GEE is represented by the red line, while the average (random effects = 0, just fixed effects) from the GLMM are represented by the solid black line. The dashed lines represent the 95% confidence intervals for the GLMM and the shaded area is the 95% confidence envelope for the GEE model. As you can see, the GEE has much higher confidence in it’s prediction of soil temperature effects on salamander surface activity than the GLMM model. This would not be apparent without visualizing the predictions with confidence intervals because the standard errors of the fixed effects were lower in the GLMM than in the GEE. This is because the SEs in the GEE include the site-level (random effect) variation while the GLMM SEs of the covariates do not include this variation and are interpreted as the effect of a change of 1 X on Y at a given site.