Lags and Moving Means in dplyr


I’ve been a big fan of dplyr from the start. While learning any new package can be frustrating, overall dplyr is fast and consistent. The Nonstandard Evaluation (see vignette(“nse”) for details) is both a blessing and a curse, overall dply is making life with big(ish) data much easier. The rapid development makes it a challenge to keep up with new functions and capabilities, but it also makes dplyr more useful every day.

My favorite use of the week has been to create lagged terms and moving means. For example, I often need new variables like yesterday’s air temperature or the average temperature over the past 5 days or even the total amount of precipitation in the preceding 30 days. I had been using the slide function within the DataCombine package, but it was somewhat slow when dealing ~10 million rows of data. Also, the slide function was fine for creating a lag but it didn’t create a moving mean or moving sum. Here’s an example using the slide function


library(dplyr)
library(zoo)
library(DataCombine)

df = expand.grid(site = factor(seq(10)),
                    year = 2000:2004,
                    day = 1:50)
# use Poisson to make math easy to check moving means of temperature
df$temp = rpois(dim(df)[1], 5) 
# Assume rains 33% of the days and averages 5 mm each time but highly variable
df$precip = rbinom(dim(df)[1], 1, 1/3) * rlnorm(dim(df)[1], log(5), 1)

# Order by group and date
df = df[order(df$site, df$year, df$day), ]
df.slide = slide(df, Var = "temp", GroupVar = c("site", "year"), slideBy = -1, NewVar='temp.lag1')
head(df.slide, 75)
 

The slide function does give the nice reminder to

Remember to order test by site and the time variable before running.

However, I do like the consistency and speed of dplyr since I’m using it for so much. It’s also easy to do multiple things in one pipe series, such as creating a lag and a moving mean. Here’s how I’m doing it in dplyr:

# moving mean for that day and previous days (e.g. 5 represents the mean of that day and the for previous days)
df2 = df %>%
  group_by(site, year) %>%
  arrange(site, year, day) %>%
  mutate(temp.5 = rollmean(x = temp, 5, align = "right", fill = NA))
head(df2, 75)

# moving mean for the previous days not including the current day (e.g. 5 represents the mean of the 5 previous days)
df2 = df2 %>%
  mutate(temp.lag1 = lag(temp, n = 1)) %>%
  mutate(temp.5.previous = rollapply(data = temp.lag1, 
                            width = 5, 
                            FUN = mean, 
                            align = "right", 
                            fill = NA, 
                            na.rm = T))
head(df2, 75)

This easily gives the option to create a moving mean based on different starting points (that day or the previous day) using the lag term that’s created in the same pipeline. It is also easy to use dplyr and zoo (for the rollapply function) to create a 30-day rolling sum of precipitation as an indication of drought status.

df2 = df2 %>%
mutate(precip.30 = rollsum(x = precip, 30, align = "right", fill = NA))
head(df2, 75)
/

The nice thing about dplyr and magittr (where it gets the %>% function) is that all of this can be easily combined into one pipeline, as such

df2 = df %>%
  group_by(site, year) %>%
  arrange(site, year, day) %>%
  mutate(temp.5 = rollsum(x = temp, 5, align = "right", fill = NA),
         temp.5.previous = rollapply(data = temp.lag1, 
                                     width = 2, 
                                     FUN = mean, 
                                     align = "right", 
                                     fill = NA, 
                                     na.rm = T)
         precip.30 = rollsum(x = precip, 30, align = "right", fill = NA))
head(df2, 75)

Output

Source: local data frame [75 x 8]
Groups: site, year

   site year day temp     precip temp.lag1 temp.2 precip.30
1     1 2000   1    7   39.41016        NA     NA        NA
2     1 2000   2    9    0.00000         7    7.0        NA
3     1 2000   3    6  132.17521         9    8.0        NA
4     1 2000   4    4    0.00000         6    7.5        NA
5     1 2000   5    5  619.73564         4    5.0        NA
6     1 2000   6    7    0.00000         5    4.5        NA
7     1 2000   7    2  122.49375         7    6.0        NA
8     1 2000   8    3    0.00000         2    4.5        NA
9     1 2000   9    6    0.00000         3    2.5        NA
10    1 2000  10    5  365.61625         6    4.5        NA
11    1 2000  11    6    0.00000         5    5.5        NA
12    1 2000  12    5    0.00000         6    5.5        NA
13    1 2000  13    7    0.00000         5    5.5        NA
14    1 2000  14    4    0.00000         7    6.0        NA
15    1 2000  15    8    0.00000         4    5.5        NA
16    1 2000  16    7    0.00000         8    6.0        NA
17    1 2000  17    5  130.09170         7    7.5        NA
18    1 2000  18    6    0.00000         5    6.0        NA
19    1 2000  19    7  305.02369         6    5.5        NA
20    1 2000  20    4    0.00000         7    6.5        NA
21    1 2000  21    7    0.00000         4    5.5        NA
22    1 2000  22    4  459.30043         7    5.5        NA
23    1 2000  23    3  117.71606         4    5.5        NA
24    1 2000  24    5  328.37299         3    3.5        NA
25    1 2000  25    3    0.00000         5    4.0        NA
26    1 2000  26    6    0.00000         3    4.0        NA
27    1 2000  27    3 1279.23160         6    4.5        NA
28    1 2000  28    4    0.00000         3    4.5        NA
29    1 2000  29    1   88.35674         4    3.5        NA
30    1 2000  30    5    0.00000         1    2.5  3987.524
31    1 2000  31    4    0.00000         5    3.0  3948.114
32    1 2000  32    4    0.00000         4    4.5  3948.114
33    1 2000  33    3    0.00000         4    4.0  3815.939
34    1 2000  34    3    0.00000         3    3.5  3815.939
35    1 2000  35    6   63.27625         3    3.0  3259.479
36    1 2000  36    6  119.32251         6    4.5  3378.802
37    1 2000  37    9    0.00000         6    6.0  3256.308
38    1 2000  38    4 3047.85230         9    7.5  6304.161
39    1 2000  39    5    0.00000         4    6.5  6304.161
40    1 2000  40    5    0.00000         5    4.5  5938.544
41    1 2000  41    4    0.00000         5    5.0  5938.544
42    1 2000  42    7  511.26682         4    4.5  6449.811
43    1 2000  43    7    0.00000         7    5.5  6449.811
44    1 2000  44    5   48.02796         7    7.0  6497.839
45    1 2000  45    4    0.00000         5    6.0  6497.839
46    1 2000  46    1    0.00000         4    4.5  6497.839
47    1 2000  47    6    0.00000         1    2.5  6367.747
48    1 2000  48    4   56.90025         6    3.5  6424.648
49    1 2000  49    6    0.00000         4    5.0  6119.624
50    1 2000  50    7    0.00000         6    5.0  6119.624
51    1 2001   1    6    0.00000        NA     NA        NA
52    1 2001   2    6    0.00000         6    6.0        NA
53    1 2001   3    7  302.76758         6    6.0        NA
54    1 2001   4    3   52.95309         7    6.5        NA
55    1 2001   5    0  261.93515         3    5.0        NA
56    1 2001   6    6    0.00000         0    1.5        NA
57    1 2001   7    8    0.00000         6    3.0        NA
58    1 2001   8    5    0.00000         8    7.0        NA
59    1 2001   9    5   58.45233         5    6.5        NA
60    1 2001  10    4  267.24615         5    5.0        NA
61    1 2001  11    6  128.48269         4    4.5        NA
62    1 2001  12    4  330.67765         6    5.0        NA
63    1 2001  13    7  966.20692         4    5.0        NA
64    1 2001  14    8    0.00000         7    5.5        NA
65    1 2001  15    4    0.00000         8    7.5        NA
66    1 2001  16    4    0.00000         4    6.0        NA
67    1 2001  17    5    0.00000         4    4.0        NA
68    1 2001  18    3   86.79884         5    4.5        NA
69    1 2001  19    3  797.27860         3    4.0        NA
70    1 2001  20    6    0.00000         3    3.0        NA
71    1 2001  21    2    0.00000         6    4.5        NA
72    1 2001  22    3    0.00000         2    4.0        NA
73    1 2001  23    7    0.00000         3    2.5        NA
74    1 2001  24    4    0.00000         7    5.0        NA
75    1 2001  25    6    0.00000         4    5.5        NA
Advertisements

5 thoughts on “Lags and Moving Means in dplyr

  1. Am I correct to believe that all methods shown require 1 value per time period (eg day) and that no time periods be missing? If so, it does not help me.

    Look how the problem was solved by Mike Gahan if dates are irregularly spaced.
    http://r.789695.n4.nabble.com/I-have-been-agnozing-over-how-to-do-a-running-cummulative-sum-over-a-particular-date-range-td4693953.html

    Also addressed here
    http://stackoverflow.com/questions/24397299/rolling-sum-by-another-variable-in-r/24400600#24400600

    I am worried when rolling or rollapply or any function relies processing a blocks of data based on the last x number of lines. I believe lag functions work that way. My problem is that is makes no allowance for the imperfect world in which we live.

    • Thanks for the links. I’ll have to think about a dplyr way to do it rather than using data.table for consistency. It might take more code. I hadn’t really thought about irregularly spaced data because in the cases I’m using it for I have decades of daily weather data from many thousands of sites. If a day is missing it still gets a row, just with an NA.

  2. Pingback: Year in Review: 2014 | Daniel J. Hocking

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s