dplyr is great…but

I have been loving Hadley Wickham’s new dplyr package for R. It creates a relatively small number of verbs to quickly and easily manipulate data. It is generally as fast as data.table but unless you’re already very familiar with data.table the syntax is much easier. There are a number of great introductions and tutorials, which I won’t attempt to recreate but here are some links:





The only challenge with going to dplyr has been dealing with the non-standard evaluation (see vignette(“nse”) and the new tbl_df class that gets created. In theory the resulting dataframes created from dplyr should act both as a data.frame object and as a new tbl_df object. There are some nice new features associated with tbl_df objects but also some idiosyncrasies that can break existing code.

For example, I had a function:

stdCovs <- function (x, y, var.names) {
    for (i in 1:length(var.names)) {
        x[, var.names[i]] <- (x[, var.names[i]] - mean(y[, var.names[i]], na.rm = T)) / 
            sd(y[, var.names[i]], na.rm = T)

that took the standardized covariates from one dataframe and applied them to standardize another dataframe for a particular set of covariates. This was particularly useful for doing predictions for new data using coefficients from a different dataset that had been standardized (y.standard). The code above worked when I was using data.frame objects. However, I have now started using dplyr for a variety of reasons but the code no longer works. I believe this is due to how tbl_df objects handle bracketing (e.g. y[ , var]). There is an issue on GitHub related to this.

In the end it turned out that double bracketing to extract the column works: Rather than the y[ , var] formulation it should be y[[var]]. Not too hard but the type of thing that wastes an hour when going to a new package. Despite this challenge and similar things, I highly recommend dplyr. Now hopefully the simplicity of the verbs and speed of computation can make up for the few hours spent learning and adjusting code.

Here’s my final code:

stdCovs <- function(x, y, var.names){
  for(i in 1:length(var.names)){
    x[ , var.names[i]] <- (x[ , var.names[i]] - mean(y[[var.names[i]]], na.rm = T)) / 
        sd(y[[var.names[i]]], na.rm = T)

One of the best features of dplyr is it’s use of magrittr for piping functions together with the %>% symbol. It’s not shown above but it’s worth learning. Previously, you would either have a bunch of useless intermediate objects created or a long string of nested functions that were hard (impossible at times) to decipher. The “then” command (%>%) does away with this as shown here by Revolution Analytics:

hourly_delay <- filter( 
      date, hour
    delay = mean(dep_delay), 
    n = n()
  n > 10 

Here’s the same code, but rather than nesting one function call inside the next, data is passed from one function to the next using the %>% operator:

hourly_delay <- flights %>% 
 filter(!is.na(dep_delay)) %>% 
 group_by(date, hour) %>% 
   delay = mean(dep_delay), 
   n = n() ) %>% 
 filter(n > 10)

Beautiful, right. Pipes are fantastic and combined with dplyr’s simple verbs make coding a lot easier and more readable once you get the hang of it.


One thought on “dplyr is great…but

  1. I am not a big fan of pipes – I don’t feel that they make code more readable.

    data.table is great, once you managed the steep learning curve.

    With data.table it would be something like (untested):

    list(delay = mean(dep_delay), n = length(dep_delay)),
    by = list(date, hour)][n>10]

    Maybe you notice that piping comes for free (e.g. the last filter, n > 10).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s