multiple lags with tidy evaluation

8 min read

bangbang tidyeval dplyr purrr rlang

multiple lags

This came up on twitter during rstudio::conf as a question from Simon Jackson. That’s a nice kind of question because it comes with example code, so I had a go at it, but I had not taken the time to promote it to a blog post. Let’s fix that.

Simon’s code is using the combination of mutate_at and the lazyeval style funs_ to make functions from strings.

d <- data_frame(x = seq_len(100))
lags <- seq(10)
lag_names <- paste("lag", formatC(lags, width = nchar(max(lags)), flag = "0"), 
  sep = "_")
lag_functions <- setNames(paste("dplyr::lag(., ", lags, ")"), lag_names)
lag_names
##  [1] "lag_01" "lag_02" "lag_03" "lag_04" "lag_05" "lag_06" "lag_07"
##  [8] "lag_08" "lag_09" "lag_10"
lag_functions
##                lag_01                lag_02                lag_03 
##  "dplyr::lag(.,  1 )"  "dplyr::lag(.,  2 )"  "dplyr::lag(.,  3 )" 
##                lag_04                lag_05                lag_06 
##  "dplyr::lag(.,  4 )"  "dplyr::lag(.,  5 )"  "dplyr::lag(.,  6 )" 
##                lag_07                lag_08                lag_09 
##  "dplyr::lag(.,  7 )"  "dplyr::lag(.,  8 )"  "dplyr::lag(.,  9 )" 
##                lag_10 
## "dplyr::lag(.,  10 )"

This is perfectly valid, but it has an eval/parse feel, and the funs_ is not inline with current tidy evaluation.

jetlag

My first take was to create a function, called jetlag that would do the whole operation using tidy evaluation. The name was probably reflecting my physical state at the time, as it turns out flying from Paris to San Diego with a 6 hours connection in New York makes jetlag.

jetlag <- function(data, variable, n=10){
  variable <- enquo(variable)
  
  indices <- seq_len(n)
  quosures <- map( indices, ~quo(lag(!!variable, !!.x)) ) %>% 
    set_names(sprintf("lag_%02d", indices))
  
  mutate( data, !!!quosures )
  
}
jetlag(d, x, 3)
## # A tibble: 100 x 4
##        x lag_01 lag_02 lag_03
##    <int>  <int>  <int>  <int>
##  1     1     NA     NA     NA
##  2     2      1     NA     NA
##  3     3      2      1     NA
##  4     4      3      2      1
##  5     5      4      3      2
##  6     6      5      4      3
##  7     7      6      5      4
##  8     8      7      6      5
##  9     9      8      7      6
## 10    10      9      8      7
## # ... with 90 more rows

Let’s break it down in steps. The function takes 3 parameters:

  • data a data frame we want to process
  • variable a symbol that corresponds to a column name in data
  • n the number of lags to create

The first thing it does is to pass variable to enquo. This allows variable to be passed by expression rather than by value. This is why when we call jetlag we don’t have to pass x as a string, but just as a symbol.

Next, the function makes a list of quosures by iterating with map over the seq_len(n) sequence.

# this is the same as `enquo`, but outside of a function
variable <- quo(variable)
indices <- seq_len(3)
quosures <- map( indices, ~quo(lag(!!variable, !!.x)) ) %>% 
  set_names(sprintf("lag_%02d", indices))

The first !! unquotes variable, i.e. replaces variable by x, the second !! replaces injects the value of the placeholder .x.

Piping into set_names gives names to the list, mutate can then use these names.

quosures
## $lag_01
## <quosure>
##   expr: ^lag(^variable, 1L)
##   env:  0x7fbfb7001ca0
## 
## $lag_02
## <quosure>
##   expr: ^lag(^variable, 2L)
##   env:  0x7fbfb426b788
## 
## $lag_03
## <quosure>
##   expr: ^lag(^variable, 3L)
##   env:  0x7fbfb8420238
names(quosures)
## [1] "lag_01" "lag_02" "lag_03"

Finally, the !!! injects the expressions in mutate. In essencen, what jetlag is doing is programmatically build something similar to this repetitive expression:

mutate( d, 
  lag_01 = lag(x, 1), 
  lag_02 = lag(x, 2), 
  lag_03 = lag(x, 3)
)
## # A tibble: 100 x 4
##        x lag_01 lag_02 lag_03
##    <int>  <int>  <int>  <int>
##  1     1     NA     NA     NA
##  2     2      1     NA     NA
##  3     3      2      1     NA
##  4     4      3      2      1
##  5     5      4      3      2
##  6     6      5      4      3
##  7     7      6      5      4
##  8     8      7      6      5
##  9     9      8      7      6
## 10    10      9      8      7
## # ... with 90 more rows

but with hopefully with nicer (or at least shorter) syntax:

jetlag(d, x, 3)
## # A tibble: 100 x 4
##        x lag_01 lag_02 lag_03
##    <int>  <int>  <int>  <int>
##  1     1     NA     NA     NA
##  2     2      1     NA     NA
##  3     3      2      1     NA
##  4     4      3      2      1
##  5     5      4      3      2
##  6     6      5      4      3
##  7     7      6      5      4
##  8     8      7      6      5
##  9     9      8      7      6
## 10    10      9      8      7
## # ... with 90 more rows

lags

As this is often the case, I immediately posted this on twitter.

Only to realise it a few minutes later that this is not as reusable/composable as it could be. In fact the interesting thing from this function is the creation of the list of quosures, not so much the actual call to mutate. That’s an example where doing less makes something more useful.

So my second take was to make it a smaller function, called lags that would just make the quosures. Naming functions is something I often struggle with, so when it appears as obvious as just using plural, I know I’m on the right track.

This was also an, opportunity to capture the name of the variable the lags operate on, and use it as a component to the names of the columns to create.

lags <- function(var, n=10){
  var <- enquo(var)
  
  indices <- seq_len(n)
  map( indices, ~quo(lag(!!var, !!.x)) ) %>% 
    set_names(sprintf("lag_%s_%02d", quo_text(var), indices))
  
}

lags only needs a name and a number of columns, it does not need data as the previous attempt. This makes it easier to understand what is going on:

lags( xyz, 4 )
## $lag_xyz_01
## <quosure>
##   expr: ^lag(^xyz, 1L)
##   env:  0x7fbfb8c6ff10
## 
## $lag_xyz_02
## <quosure>
##   expr: ^lag(^xyz, 2L)
##   env:  0x7fbfb9878388
## 
## $lag_xyz_03
## <quosure>
##   expr: ^lag(^xyz, 3L)
##   env:  0x7fbfb9878190
## 
## $lag_xyz_04
## <quosure>
##   expr: ^lag(^xyz, 4L)
##   env:  0x7fbfb987aca0

lags creates the quosures and then we can unquote splice them “manually”.

mutate( d, !!!lags(x, 3) )
## # A tibble: 100 x 4
##        x lag_x_01 lag_x_02 lag_x_03
##    <int>    <int>    <int>    <int>
##  1     1       NA       NA       NA
##  2     2        1       NA       NA
##  3     3        2        1       NA
##  4     4        3        2        1
##  5     5        4        3        2
##  6     6        5        4        3
##  7     7        6        5        4
##  8     8        7        6        5
##  9     9        8        7        6
## 10    10        9        8        7
## # ... with 90 more rows

This also makes it slightly easier to use it on use it on multiple variables at once.

d <- data_frame( x = 1:10, y = letters[1:10])
d %>% 
  mutate( !!!lags(x, 3), !!!lags(y,3) )
## # A tibble: 10 x 8
##        x y     lag_x_01 lag_x_02 lag_x_03 lag_y_01 lag_y_02 lag_y_03
##    <int> <chr>    <int>    <int>    <int> <chr>    <chr>    <chr>   
##  1     1 a           NA       NA       NA <NA>     <NA>     <NA>    
##  2     2 b            1       NA       NA a        <NA>     <NA>    
##  3     3 c            2        1       NA b        a        <NA>    
##  4     4 d            3        2        1 c        b        a       
##  5     5 e            4        3        2 d        c        b       
##  6     6 f            5        4        3 e        d        c       
##  7     7 g            6        5        4 f        e        d       
##  8     8 h            7        6        5 g        f        e       
##  9     9 i            8        7        6 h        g        f       
## 10    10 j            9        8        7 i        h        g

Back story

Taking the time to write this as a blog post now was inspired by Mara’s promotion of one of my early blog post about tidy evaluation when dplyr 0.7 was released.

Tidy evaluation is one of these things that feel natural once you are more familiar with the concepts, I hope this example helps. I expect to be posting more of these as I come up with uses cases.

Shoutout to rtweet

I remembered I tweeted about this during rstudio::conf, which is many tweets ago, and rtweet made it pretty easy to travel back in time.

library(rtweet)
get_timeline("romain_francois", n = 2000 ) %>% 
  filter( stringr::str_detect( text, "lags") ) %>% 
  select( status_id, text )
## # A tibble: 7 x 2
##   status_id          text                                                 
##   <chr>              <chr>                                                
## 1 959813571646377984 "@8bitscollider @drsimonj Something like this ?\n\nt…
## 2 959812374730498048 "@drsimonj Having `lags` just generate the quosures …
## 3 959811892922347520 "#tidyeval ‼️ multiple lags of the same variable. pi… 
## 4 905856726582398976 RT @rensa_co: I got #rstats ggflags working with rou…
## 5 905537589175869441 "#funwithflags \nemo::jis %&gt;% \n   filter( catego…
## 6 896029598496022528 "emoji flags and clocks rounded at the closest 30 mi…
## 7 870036927352799232 RT @rstudio: RStudio is excited to announce the avai…