Category Archives: R

Lags and Moving Means in dplyr

I’ve been a big fan of dplyr from the start. While learning any new package can be frustrating, overall dplyr is fast and consistent. The Nonstandard Evaluation (see vignette(“nse”) for details) is both a blessing and a curse, overall dply is making life with big(ish) data much easier. The rapid development makes it a challenge to keep up with new functions and capabilities, but it also makes dplyr more useful every day.

My favorite use of the week has been to create lagged terms and moving means. For example, I often need new variables like yesterday’s air temperature or the average temperature over the past 5 days or even the total amount of precipitation in the preceding 30 days. I had been using the slide function within the DataCombine package, but it was somewhat slow when dealing ~10 million rows of data. Also, the slide function was fine for creating a lag but it didn’t create a moving mean or moving sum. Here’s an example using the slide function


library(dplyr)
library(zoo)
library(DataCombine)

df = expand.grid(site = factor(seq(10)),
                    year = 2000:2004,
                    day = 1:50)
# use Poisson to make math easy to check moving means of temperature
df$temp = rpois(dim(df)[1], 5) 
# Assume rains 33% of the days and averages 5 mm each time but highly variable
df$precip = rbinom(dim(df)[1], 1, 1/3) * rlnorm(dim(df)[1], log(5), 1)

# Order by group and date
df = df[order(df$site, df$year, df$day), ]
df.slide = slide(df, Var = "temp", GroupVar = c("site", "year"), slideBy = -1, NewVar='temp.lag1')
head(df.slide, 75)
 

The slide function does give the nice reminder to

Remember to order test by site and the time variable before running.

However, I do like the consistency and speed of dplyr since I’m using it for so much. It’s also easy to do multiple things in one pipe series, such as creating a lag and a moving mean. Here’s how I’m doing it in dplyr:

# moving mean for that day and previous days (e.g. 5 represents the mean of that day and the for previous days)
df2 = df %>%
  group_by(site, year) %>%
  arrange(site, year, day) %>%
  mutate(temp.5 = rollmean(x = temp, 5, align = "right", fill = NA))
head(df2, 75)

# moving mean for the previous days not including the current day (e.g. 5 represents the mean of the 5 previous days)
df2 = df2 %>%
  mutate(temp.lag1 = lag(temp, n = 1)) %>%
  mutate(temp.5.previous = rollapply(data = temp.lag1, 
                            width = 5, 
                            FUN = mean, 
                            align = "right", 
                            fill = NA, 
                            na.rm = T))
head(df2, 75)

This easily gives the option to create a moving mean based on different starting points (that day or the previous day) using the lag term that’s created in the same pipeline. It is also easy to use dplyr and zoo (for the rollapply function) to create a 30-day rolling sum of precipitation as an indication of drought status.

df2 = df2 %>%
mutate(precip.30 = rollsum(x = precip, 30, align = "right", fill = NA))
head(df2, 75)
/

The nice thing about dplyr and magittr (where it gets the %>% function) is that all of this can be easily combined into one pipeline, as such

df2 = df %>%
  group_by(site, year) %>%
  arrange(site, year, day) %>%
  mutate(temp.5 = rollsum(x = temp, 5, align = "right", fill = NA),
         temp.5.previous = rollapply(data = temp.lag1, 
                                     width = 2, 
                                     FUN = mean, 
                                     align = "right", 
                                     fill = NA, 
                                     na.rm = T)
         precip.30 = rollsum(x = precip, 30, align = "right", fill = NA))
head(df2, 75)

Output

Source: local data frame [75 x 8]
Groups: site, year

   site year day temp     precip temp.lag1 temp.2 precip.30
1     1 2000   1    7   39.41016        NA     NA        NA
2     1 2000   2    9    0.00000         7    7.0        NA
3     1 2000   3    6  132.17521         9    8.0        NA
4     1 2000   4    4    0.00000         6    7.5        NA
5     1 2000   5    5  619.73564         4    5.0        NA
6     1 2000   6    7    0.00000         5    4.5        NA
7     1 2000   7    2  122.49375         7    6.0        NA
8     1 2000   8    3    0.00000         2    4.5        NA
9     1 2000   9    6    0.00000         3    2.5        NA
10    1 2000  10    5  365.61625         6    4.5        NA
11    1 2000  11    6    0.00000         5    5.5        NA
12    1 2000  12    5    0.00000         6    5.5        NA
13    1 2000  13    7    0.00000         5    5.5        NA
14    1 2000  14    4    0.00000         7    6.0        NA
15    1 2000  15    8    0.00000         4    5.5        NA
16    1 2000  16    7    0.00000         8    6.0        NA
17    1 2000  17    5  130.09170         7    7.5        NA
18    1 2000  18    6    0.00000         5    6.0        NA
19    1 2000  19    7  305.02369         6    5.5        NA
20    1 2000  20    4    0.00000         7    6.5        NA
21    1 2000  21    7    0.00000         4    5.5        NA
22    1 2000  22    4  459.30043         7    5.5        NA
23    1 2000  23    3  117.71606         4    5.5        NA
24    1 2000  24    5  328.37299         3    3.5        NA
25    1 2000  25    3    0.00000         5    4.0        NA
26    1 2000  26    6    0.00000         3    4.0        NA
27    1 2000  27    3 1279.23160         6    4.5        NA
28    1 2000  28    4    0.00000         3    4.5        NA
29    1 2000  29    1   88.35674         4    3.5        NA
30    1 2000  30    5    0.00000         1    2.5  3987.524
31    1 2000  31    4    0.00000         5    3.0  3948.114
32    1 2000  32    4    0.00000         4    4.5  3948.114
33    1 2000  33    3    0.00000         4    4.0  3815.939
34    1 2000  34    3    0.00000         3    3.5  3815.939
35    1 2000  35    6   63.27625         3    3.0  3259.479
36    1 2000  36    6  119.32251         6    4.5  3378.802
37    1 2000  37    9    0.00000         6    6.0  3256.308
38    1 2000  38    4 3047.85230         9    7.5  6304.161
39    1 2000  39    5    0.00000         4    6.5  6304.161
40    1 2000  40    5    0.00000         5    4.5  5938.544
41    1 2000  41    4    0.00000         5    5.0  5938.544
42    1 2000  42    7  511.26682         4    4.5  6449.811
43    1 2000  43    7    0.00000         7    5.5  6449.811
44    1 2000  44    5   48.02796         7    7.0  6497.839
45    1 2000  45    4    0.00000         5    6.0  6497.839
46    1 2000  46    1    0.00000         4    4.5  6497.839
47    1 2000  47    6    0.00000         1    2.5  6367.747
48    1 2000  48    4   56.90025         6    3.5  6424.648
49    1 2000  49    6    0.00000         4    5.0  6119.624
50    1 2000  50    7    0.00000         6    5.0  6119.624
51    1 2001   1    6    0.00000        NA     NA        NA
52    1 2001   2    6    0.00000         6    6.0        NA
53    1 2001   3    7  302.76758         6    6.0        NA
54    1 2001   4    3   52.95309         7    6.5        NA
55    1 2001   5    0  261.93515         3    5.0        NA
56    1 2001   6    6    0.00000         0    1.5        NA
57    1 2001   7    8    0.00000         6    3.0        NA
58    1 2001   8    5    0.00000         8    7.0        NA
59    1 2001   9    5   58.45233         5    6.5        NA
60    1 2001  10    4  267.24615         5    5.0        NA
61    1 2001  11    6  128.48269         4    4.5        NA
62    1 2001  12    4  330.67765         6    5.0        NA
63    1 2001  13    7  966.20692         4    5.0        NA
64    1 2001  14    8    0.00000         7    5.5        NA
65    1 2001  15    4    0.00000         8    7.5        NA
66    1 2001  16    4    0.00000         4    6.0        NA
67    1 2001  17    5    0.00000         4    4.0        NA
68    1 2001  18    3   86.79884         5    4.5        NA
69    1 2001  19    3  797.27860         3    4.0        NA
70    1 2001  20    6    0.00000         3    3.0        NA
71    1 2001  21    2    0.00000         6    4.5        NA
72    1 2001  22    3    0.00000         2    4.0        NA
73    1 2001  23    7    0.00000         3    2.5        NA
74    1 2001  24    4    0.00000         7    5.0        NA
75    1 2001  25    6    0.00000         4    5.5        NA

dplyr is great…but

I have been loving Hadley Wickham’s new dplyr package for R. It creates a relatively small number of verbs to quickly and easily manipulate data. It is generally as fast as data.table but unless you’re already very familiar with data.table the syntax is much easier. There are a number of great introductions and tutorials, which I won’t attempt to recreate but here are some links:

http://rpubs.com/justmarkham/dplyr-tutorial

http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

http://blog.rstudio.org/2014/01/17/introducing-dplyr/

http://datascience.la/hadley-wickhams-dplyr-tutorial-at-user-2014-part-1/

The only challenge with going to dplyr has been dealing with the non-standard evaluation (see vignette(“nse”) and the new tbl_df class that gets created. In theory the resulting dataframes created from dplyr should act both as a data.frame object and as a new tbl_df object. There are some nice new features associated with tbl_df objects but also some idiosyncrasies that can break existing code.

For example, I had a function:

stdCovs <- function (x, y, var.names) {
    for (i in 1:length(var.names)) {
        x[, var.names[i]] <- (x[, var.names[i]] - mean(y[, var.names[i]], na.rm = T)) / 
            sd(y[, var.names[i]], na.rm = T)
    }
    return(x)
}

that took the standardized covariates from one dataframe and applied them to standardize another dataframe for a particular set of covariates. This was particularly useful for doing predictions for new data using coefficients from a different dataset that had been standardized (y.standard). The code above worked when I was using data.frame objects. However, I have now started using dplyr for a variety of reasons but the code no longer works. I believe this is due to how tbl_df objects handle bracketing (e.g. y[ , var]). There is an issue on GitHub related to this.

In the end it turned out that double bracketing to extract the column works: Rather than the y[ , var] formulation it should be y[[var]]. Not too hard but the type of thing that wastes an hour when going to a new package. Despite this challenge and similar things, I highly recommend dplyr. Now hopefully the simplicity of the verbs and speed of computation can make up for the few hours spent learning and adjusting code.

Here’s my final code:

stdCovs <- function(x, y, var.names){
  for(i in 1:length(var.names)){
    x[ , var.names[i]] <- (x[ , var.names[i]] - mean(y[[var.names[i]]], na.rm = T)) / 
        sd(y[[var.names[i]]], na.rm = T)
  }
  return(x)
}

One of the best features of dplyr is it’s use of magrittr for piping functions together with the %>% symbol. It’s not shown above but it’s worth learning. Previously, you would either have a bunch of useless intermediate objects created or a long string of nested functions that were hard (impossible at times) to decipher. The “then” command (%>%) does away with this as shown here by Revolution Analytics:

hourly_delay <- filter( 
  summarise(
    group_by( 
      filter(
        flights, 
        !is.na(dep_delay)
      ), 
      date, hour
    ), 
    delay = mean(dep_delay), 
    n = n()
  ), 
  n > 10 
)

Here’s the same code, but rather than nesting one function call inside the next, data is passed from one function to the next using the %>% operator:

hourly_delay <- flights %>% 
 filter(!is.na(dep_delay)) %>% 
 group_by(date, hour) %>% 
 summarise( 
   delay = mean(dep_delay), 
   n = n() ) %>% 
 filter(n > 10)

Beautiful, right. Pipes are fantastic and combined with dplyr’s simple verbs make coding a lot easier and more readable once you get the hang of it.

Blue Jay and Scrub Jay : Using rvertnet to check the distributions in R

Daniel Hocking:

This is a great post about a great package. Unlike some other broad vertebrate databases, VertNet has good QAQC. It’s great being able to access it directly in R where the data manipulation, analysis, and even plotting will occur. Thank you to the ROpenSci team for development of the rvertnet package and Vijay Barve for the clear, simple tutorial.

Originally posted on Vijay Barve:

As part of my Google Summer of Code, I am also working on another package for R called rvertnet. This package is a wrapper in R for VertNet websites. Vertnet is a vertebrate distributed database network consisting of FishNet2MaNISHerpNET, and ORNIS. Out of that currently Fishnet, HerpNET and ORNIS have their v2 portals serving data. rvertnet has functions now to access this data and import them into R data frames.

Some of my lab mates faced a difficulty in downloading data for Scrub Jay (Aphelocoma spp. ) due to large number of records (220k+) on ORNIS, so decided to try using rvertnet package which was still in development that time. The package really helpe and got the results quickly. So while ecploring that data I came up with this case study.

So here to get data for Blue Jay (Cyanocitta cristata) which is distributed in…

View original 208 more words

Teaching Scientific Computing: Peer Review

computingThis post is going out on a bit of a limb because I am not familiar with the pedagogical literature relating to teaching scientific computing. As such, I can only speak from my very limited experience. I’ve taken a couple short courses on scientific computing, but the only formal full-semester course I’ve taken was Introduction to C Programming for Engineers 15 years ago. In that course, the instructor spent 50 minutes 3 days a week writing code on the chalk board in front of us and we were expected to learn. Homework was to write increasingly large programs throughout the semester. If they didn’t work we got a 0%, if they produced the wrong output we got a 50%, and if they worked properly we got a 100%. Obviously it was a terrible course (although a number of my statistic courses that involved programming were not very different so this might be more common than I’d like to believe). Besides some of the conspicuous instructional problems, I was just thinking that scientific programming courses could learn from pedagogy in the humanities. The University of New Hampshire requires undergraduates to take a number of writing intensive courses. To qualify as writing intensive a course my meet 3 criteria:

  1. Students in the course should do substantial writing that enhances learning and demonstrates knowledge of the subject or the discipline. Writing should be an integral part of the course and should account for a significant part (approximately 50% or more) of the final grade.
  2. Writing should be assigned in such a manner as to require students to write regularly throughout the course. Major assignments should integrate the process of writing (prewriting, drafting, revision, editing). Students should be able to receive constructive feedback of some kind (peer response, workshop, professor, TA, etc.) during the drafting/revising process to help improve their writing.
  3. The course should include both formal (graded) and informal (heuristic) writing.  There should be papers written outside of class which are handed in for formal evaluation as well as informal assignments designed to promote learning, such as invention activities, in-class essays, reaction papers, journals, reading summaries, or other appropriate exercises.

I think these criteria could be applied or at least adapted for scientific computing courses. The 1st one is easy. The 2nd and 2rd are what I think computing courses could really take advantage of. From what I’ve seen, there is often not a lot of time spent of informal feedback from instructors and peers to help with revision. In programming, especially with flexible languages like R, there are often many solutions to the same problems. Useful assignments could be to critic the programs of peers, find ways to improve code efficiency, and provide alternative solutions to sections of code. This could include critics of the commenting and README files.

In introductory courses there is often an emphasis on cover content. Some people will balk at the idea of spending time of learning alternatives to simple options when there is clearly 1 best solution and so much material to cover to get students writing even simple scripts. However, it’s better in my opinion to learn a few things well than many things superficially. By evaluating, revising, and developing alternatives to code written by peers, students will learn how to program better. There is a reason that informal assessment, peer review, and revision is a required part of writing intensive courses. Those same reasons apply to scientific computing courses. Just as review and revision make us better writers, it will make us better programmers.

RStudio Presentations

After finally getting around to to posting about knitr and Markdown just yesterday, RStudio comes out with an update that makes it even easier: http://www.rstudio.com/ide/docs/presentations/overview

P.S. I Love RStudio

Follow

Get every new post delivered to your Inbox.

Join 70 other followers