I’ve been a big fan of dplyr from the start. While learning any new package can be frustrating, overall dplyr is fast and consistent. The Nonstandard Evaluation (see vignette(“nse”) for details) is both a blessing and a curse, overall dply is making life with big(ish) data much easier. The rapid development makes it a challenge to keep up with new functions and capabilities, but it also makes dplyr more useful every day.
My favorite use of the week has been to create lagged terms and moving means. For example, I often need new variables like yesterday’s air temperature or the average temperature over the past 5 days or even the total amount of precipitation in the preceding 30 days. I had been using the slide function within the DataCombine package, but it was somewhat slow when dealing ~10 million rows of data. Also, the slide function was fine for creating a lag but it didn’t create a moving mean or moving sum. Here’s an example using the slide function
library(dplyr)
library(zoo)
library(DataCombine)
df = expand.grid(site = factor(seq(10)),
year = 2000:2004,
day = 1:50)
# use Poisson to make math easy to check moving means of temperature
df$temp = rpois(dim(df)[1], 5)
# Assume rains 33% of the days and averages 5 mm each time but highly variable
df$precip = rbinom(dim(df)[1], 1, 1/3) * rlnorm(dim(df)[1], log(5), 1)
# Order by group and date
df = df[order(df$site, df$year, df$day), ]
df.slide = slide(df, Var = "temp", GroupVar = c("site", "year"), slideBy = -1, NewVar='temp.lag1')
head(df.slide, 75)
The slide function does give the nice reminder to
Remember to order test by site and the time variable before running.
However, I do like the consistency and speed of dplyr since I’m using it for so much. It’s also easy to do multiple things in one pipe series, such as creating a lag and a moving mean. Here’s how I’m doing it in dplyr:
# moving mean for that day and previous days (e.g. 5 represents the mean of that day and the for previous days) df2 = df %>% group_by(site, year) %>% arrange(site, year, day) %>% mutate(temp.5 = rollmean(x = temp, 5, align = "right", fill = NA)) head(df2, 75) # moving mean for the previous days not including the current day (e.g. 5 represents the mean of the 5 previous days) df2 = df2 %>% mutate(temp.lag1 = lag(temp, n = 1)) %>% mutate(temp.5.previous = rollapply(data = temp.lag1, width = 5, FUN = mean, align = "right", fill = NA, na.rm = T)) head(df2, 75)
This easily gives the option to create a moving mean based on different starting points (that day or the previous day) using the lag term that’s created in the same pipeline. It is also easy to use dplyr and zoo (for the rollapply function) to create a 30-day rolling sum of precipitation as an indication of drought status.
df2 = df2 %>% mutate(precip.30 = rollsum(x = precip, 30, align = "right", fill = NA)) head(df2, 75) /
The nice thing about dplyr and magittr (where it gets the %>% function) is that all of this can be easily combined into one pipeline, as such
df2 = df %>% group_by(site, year) %>% arrange(site, year, day) %>% mutate(temp.5 = rollsum(x = temp, 5, align = "right", fill = NA), temp.5.previous = rollapply(data = temp.lag1, width = 2, FUN = mean, align = "right", fill = NA, na.rm = T) precip.30 = rollsum(x = precip, 30, align = "right", fill = NA)) head(df2, 75)
Output
Source: local data frame [75 x 8] Groups: site, year site year day temp precip temp.lag1 temp.2 precip.30 1 1 2000 1 7 39.41016 NA NA NA 2 1 2000 2 9 0.00000 7 7.0 NA 3 1 2000 3 6 132.17521 9 8.0 NA 4 1 2000 4 4 0.00000 6 7.5 NA 5 1 2000 5 5 619.73564 4 5.0 NA 6 1 2000 6 7 0.00000 5 4.5 NA 7 1 2000 7 2 122.49375 7 6.0 NA 8 1 2000 8 3 0.00000 2 4.5 NA 9 1 2000 9 6 0.00000 3 2.5 NA 10 1 2000 10 5 365.61625 6 4.5 NA 11 1 2000 11 6 0.00000 5 5.5 NA 12 1 2000 12 5 0.00000 6 5.5 NA 13 1 2000 13 7 0.00000 5 5.5 NA 14 1 2000 14 4 0.00000 7 6.0 NA 15 1 2000 15 8 0.00000 4 5.5 NA 16 1 2000 16 7 0.00000 8 6.0 NA 17 1 2000 17 5 130.09170 7 7.5 NA 18 1 2000 18 6 0.00000 5 6.0 NA 19 1 2000 19 7 305.02369 6 5.5 NA 20 1 2000 20 4 0.00000 7 6.5 NA 21 1 2000 21 7 0.00000 4 5.5 NA 22 1 2000 22 4 459.30043 7 5.5 NA 23 1 2000 23 3 117.71606 4 5.5 NA 24 1 2000 24 5 328.37299 3 3.5 NA 25 1 2000 25 3 0.00000 5 4.0 NA 26 1 2000 26 6 0.00000 3 4.0 NA 27 1 2000 27 3 1279.23160 6 4.5 NA 28 1 2000 28 4 0.00000 3 4.5 NA 29 1 2000 29 1 88.35674 4 3.5 NA 30 1 2000 30 5 0.00000 1 2.5 3987.524 31 1 2000 31 4 0.00000 5 3.0 3948.114 32 1 2000 32 4 0.00000 4 4.5 3948.114 33 1 2000 33 3 0.00000 4 4.0 3815.939 34 1 2000 34 3 0.00000 3 3.5 3815.939 35 1 2000 35 6 63.27625 3 3.0 3259.479 36 1 2000 36 6 119.32251 6 4.5 3378.802 37 1 2000 37 9 0.00000 6 6.0 3256.308 38 1 2000 38 4 3047.85230 9 7.5 6304.161 39 1 2000 39 5 0.00000 4 6.5 6304.161 40 1 2000 40 5 0.00000 5 4.5 5938.544 41 1 2000 41 4 0.00000 5 5.0 5938.544 42 1 2000 42 7 511.26682 4 4.5 6449.811 43 1 2000 43 7 0.00000 7 5.5 6449.811 44 1 2000 44 5 48.02796 7 7.0 6497.839 45 1 2000 45 4 0.00000 5 6.0 6497.839 46 1 2000 46 1 0.00000 4 4.5 6497.839 47 1 2000 47 6 0.00000 1 2.5 6367.747 48 1 2000 48 4 56.90025 6 3.5 6424.648 49 1 2000 49 6 0.00000 4 5.0 6119.624 50 1 2000 50 7 0.00000 6 5.0 6119.624 51 1 2001 1 6 0.00000 NA NA NA 52 1 2001 2 6 0.00000 6 6.0 NA 53 1 2001 3 7 302.76758 6 6.0 NA 54 1 2001 4 3 52.95309 7 6.5 NA 55 1 2001 5 0 261.93515 3 5.0 NA 56 1 2001 6 6 0.00000 0 1.5 NA 57 1 2001 7 8 0.00000 6 3.0 NA 58 1 2001 8 5 0.00000 8 7.0 NA 59 1 2001 9 5 58.45233 5 6.5 NA 60 1 2001 10 4 267.24615 5 5.0 NA 61 1 2001 11 6 128.48269 4 4.5 NA 62 1 2001 12 4 330.67765 6 5.0 NA 63 1 2001 13 7 966.20692 4 5.0 NA 64 1 2001 14 8 0.00000 7 5.5 NA 65 1 2001 15 4 0.00000 8 7.5 NA 66 1 2001 16 4 0.00000 4 6.0 NA 67 1 2001 17 5 0.00000 4 4.0 NA 68 1 2001 18 3 86.79884 5 4.5 NA 69 1 2001 19 3 797.27860 3 4.0 NA 70 1 2001 20 6 0.00000 3 3.0 NA 71 1 2001 21 2 0.00000 6 4.5 NA 72 1 2001 22 3 0.00000 2 4.0 NA 73 1 2001 23 7 0.00000 3 2.5 NA 74 1 2001 24 4 0.00000 7 5.0 NA 75 1 2001 25 6 0.00000 4 5.5 NA
There’s a %>% missing between arrange(site, year, day) and mutate(temp.5 = rollmean(x = temp….
Thanks for catching that. It’s fixed now.
Am I correct to believe that all methods shown require 1 value per time period (eg day) and that no time periods be missing? If so, it does not help me.
Look how the problem was solved by Mike Gahan if dates are irregularly spaced.
http://r.789695.n4.nabble.com/I-have-been-agnozing-over-how-to-do-a-running-cummulative-sum-over-a-particular-date-range-td4693953.html
Also addressed here
http://stackoverflow.com/questions/24397299/rolling-sum-by-another-variable-in-r/24400600#24400600
I am worried when rolling or rollapply or any function relies processing a blocks of data based on the last x number of lines. I believe lag functions work that way. My problem is that is makes no allowance for the imperfect world in which we live.
Thanks for the links. I’ll have to think about a dplyr way to do it rather than using data.table for consistency. It might take more code. I hadn’t really thought about irregularly spaced data because in the cases I’m using it for I have decades of daily weather data from many thousands of sites. If a day is missing it still gets a row, just with an NA.
Pingback: Year in Review: 2014 | Daniel J. Hocking