multiple lags with tidy evaluation
multiple lags
This came up on twitter during rstudio::conf as a question from Simon Jackson. That’s a nice kind of question because it comes with example code, so I had a go at it, but I had not taken the time to promote it to a blog post. Let’s fix that.
#rstats peeps...
— Simon Jackson (@drsimonj) February 3, 2018
Is there a better way to create multiple lags/leads? Code at https://t.co/xUkJ8cBCpq
Approach must be easily distributed (e.g., working on spark data frame with sparklyr, etc), hence my use of #dplyr pic.twitter.com/xAWYRpH0Eh
Simon’s code is using
the combination of mutate_at
and the lazyeval style funs_
to make functions from strings.
d <- data_frame(x = seq_len(100))
lags <- seq(10)
lag_names <- paste("lag", formatC(lags, width = nchar(max(lags)), flag = "0"),
sep = "_")
lag_functions <- setNames(paste("dplyr::lag(., ", lags, ")"), lag_names)
lag_names
## [1] "lag_01" "lag_02" "lag_03" "lag_04" "lag_05" "lag_06" "lag_07"
## [8] "lag_08" "lag_09" "lag_10"
lag_functions
## lag_01 lag_02 lag_03
## "dplyr::lag(., 1 )" "dplyr::lag(., 2 )" "dplyr::lag(., 3 )"
## lag_04 lag_05 lag_06
## "dplyr::lag(., 4 )" "dplyr::lag(., 5 )" "dplyr::lag(., 6 )"
## lag_07 lag_08 lag_09
## "dplyr::lag(., 7 )" "dplyr::lag(., 8 )" "dplyr::lag(., 9 )"
## lag_10
## "dplyr::lag(., 10 )"
This is perfectly valid, but it has an eval/parse
feel, and the funs_
is not
inline with current tidy evaluation.
jetlag
My first take
was to create a function, called jetlag
that would do the
whole operation using tidy evaluation. The name was probably reflecting my
physical state at the time, as it turns out flying from Paris to San Diego with
a 6 hours connection in New York makes jetlag.
jetlag <- function(data, variable, n=10){
variable <- enquo(variable)
indices <- seq_len(n)
quosures <- map( indices, ~quo(lag(!!variable, !!.x)) ) %>%
set_names(sprintf("lag_%02d", indices))
mutate( data, !!!quosures )
}
jetlag(d, x, 3)
## # A tibble: 100 x 4
## x lag_01 lag_02 lag_03
## <int> <int> <int> <int>
## 1 1 NA NA NA
## 2 2 1 NA NA
## 3 3 2 1 NA
## 4 4 3 2 1
## 5 5 4 3 2
## 6 6 5 4 3
## 7 7 6 5 4
## 8 8 7 6 5
## 9 9 8 7 6
## 10 10 9 8 7
## # ... with 90 more rows
Let’s break it down in steps. The function takes 3 parameters:
data
a data frame we want to processvariable
a symbol that corresponds to a column name indata
n
the number of lags to create
The first thing it does is to pass variable
to enquo
. This allows variable
to be
passed by expression rather than by value. This is why when we call jetlag
we
don’t have to pass x
as a string, but just as a symbol.
Next, the function makes a list of quosures by iterating with map
over the
seq_len(n)
sequence.
# this is the same as `enquo`, but outside of a function
variable <- quo(variable)
indices <- seq_len(3)
quosures <- map( indices, ~quo(lag(!!variable, !!.x)) ) %>%
set_names(sprintf("lag_%02d", indices))
The first !!
unquotes variable
, i.e. replaces variable
by x
, the second
!!
replaces injects the value of the placeholder .x
.
Piping into set_names
gives names to the list, mutate
can then use these names.
quosures
## $lag_01
## <quosure>
## expr: ^lag(^variable, 1L)
## env: 0x7fbfb7001ca0
##
## $lag_02
## <quosure>
## expr: ^lag(^variable, 2L)
## env: 0x7fbfb426b788
##
## $lag_03
## <quosure>
## expr: ^lag(^variable, 3L)
## env: 0x7fbfb8420238
names(quosures)
## [1] "lag_01" "lag_02" "lag_03"
Finally, the !!!
injects the expressions in mutate
. In essencen, what jetlag
is doing is
programmatically build something similar to this repetitive expression:
mutate( d,
lag_01 = lag(x, 1),
lag_02 = lag(x, 2),
lag_03 = lag(x, 3)
)
## # A tibble: 100 x 4
## x lag_01 lag_02 lag_03
## <int> <int> <int> <int>
## 1 1 NA NA NA
## 2 2 1 NA NA
## 3 3 2 1 NA
## 4 4 3 2 1
## 5 5 4 3 2
## 6 6 5 4 3
## 7 7 6 5 4
## 8 8 7 6 5
## 9 9 8 7 6
## 10 10 9 8 7
## # ... with 90 more rows
but with hopefully with nicer (or at least shorter) syntax:
jetlag(d, x, 3)
## # A tibble: 100 x 4
## x lag_01 lag_02 lag_03
## <int> <int> <int> <int>
## 1 1 NA NA NA
## 2 2 1 NA NA
## 3 3 2 1 NA
## 4 4 3 2 1
## 5 5 4 3 2
## 6 6 5 4 3
## 7 7 6 5 4
## 8 8 7 6 5
## 9 9 8 7 6
## 10 10 9 8 7
## # ... with 90 more rows
lags
As this is often the case, I immediately posted this on twitter.
#tidyeval ‼️ multiple lags of the same variable. ping @drsimonj
— Romain François 🦄 (@romain_francois) February 3, 2018
lags <- function(var, n=10){
var <- enquo(var)
indices <- seq_len(n)
map( indices, ~quo(lag(!!var, !!.x)) ) %>%
set_names(sprintf("lag_%s_%02d", quo_text(var), indices))
}https://t.co/RuitnEDIHY pic.twitter.com/XU6QYJrvxe
Only to realise it a few minutes later that this is not as
reusable/composable as it could be. In fact the interesting
thing from this function is the creation of the list of quosures, not so much the
actual call to mutate
. That’s an example where doing less makes something more
useful.
So my second take was to make it a
smaller function, called lags
that would just make the quosures. Naming functions is something
I often struggle with, so when it appears as obvious as just using plural,
I know I’m on the right track.
This was also an, opportunity to capture the name of the variable the lags operate on, and use it as a component to the names of the columns to create.
lags <- function(var, n=10){
var <- enquo(var)
indices <- seq_len(n)
map( indices, ~quo(lag(!!var, !!.x)) ) %>%
set_names(sprintf("lag_%s_%02d", quo_text(var), indices))
}
lags
only needs a name and a number of columns, it does not need data
as the
previous attempt. This makes it easier to understand what is going on:
lags( xyz, 4 )
## $lag_xyz_01
## <quosure>
## expr: ^lag(^xyz, 1L)
## env: 0x7fbfb8c6ff10
##
## $lag_xyz_02
## <quosure>
## expr: ^lag(^xyz, 2L)
## env: 0x7fbfb9878388
##
## $lag_xyz_03
## <quosure>
## expr: ^lag(^xyz, 3L)
## env: 0x7fbfb9878190
##
## $lag_xyz_04
## <quosure>
## expr: ^lag(^xyz, 4L)
## env: 0x7fbfb987aca0
lags
creates the quosures and then we can unquote splice them “manually”.
mutate( d, !!!lags(x, 3) )
## # A tibble: 100 x 4
## x lag_x_01 lag_x_02 lag_x_03
## <int> <int> <int> <int>
## 1 1 NA NA NA
## 2 2 1 NA NA
## 3 3 2 1 NA
## 4 4 3 2 1
## 5 5 4 3 2
## 6 6 5 4 3
## 7 7 6 5 4
## 8 8 7 6 5
## 9 9 8 7 6
## 10 10 9 8 7
## # ... with 90 more rows
This also makes it slightly easier to use it on use it on multiple variables at once.
d <- data_frame( x = 1:10, y = letters[1:10])
d %>%
mutate( !!!lags(x, 3), !!!lags(y,3) )
## # A tibble: 10 x 8
## x y lag_x_01 lag_x_02 lag_x_03 lag_y_01 lag_y_02 lag_y_03
## <int> <chr> <int> <int> <int> <chr> <chr> <chr>
## 1 1 a NA NA NA <NA> <NA> <NA>
## 2 2 b 1 NA NA a <NA> <NA>
## 3 3 c 2 1 NA b a <NA>
## 4 4 d 3 2 1 c b a
## 5 5 e 4 3 2 d c b
## 6 6 f 5 4 3 e d c
## 7 7 g 6 5 4 f e d
## 8 8 h 7 6 5 g f e
## 9 9 i 8 7 6 h g f
## 10 10 j 9 8 7 i h g
Back story
Taking the time to write this as a blog post now was inspired by Mara’s promotion of one of my early blog post about tidy evaluation when dplyr 0.7 was released.
😮How'd I missed this tidy eval gem?!
— Mara Averick (@dataandme) February 28, 2018
"filter with context" ✏️ @romain_francois https://t.co/tgWw9Fn8QL #rstats #tidyeval #tidyverse pic.twitter.com/b957n234kz
Tidy evaluation is one of these things that feel natural once you are more familiar with the concepts, I hope this example helps. I expect to be posting more of these as I come up with uses cases.
Thanks. I’ll post more of these, or at least convert things i have in gists, … to proper blog posts.
— Romain François 🦄 (@romain_francois) March 1, 2018
Shoutout to rtweet
I remembered I tweeted about this during rstudio::conf, which is many tweets ago, and rtweet made it pretty easy to travel back in time.
library(rtweet)
get_timeline("romain_francois", n = 2000 ) %>%
filter( stringr::str_detect( text, "lags") ) %>%
select( status_id, text )
## # A tibble: 7 x 2
## status_id text
## <chr> <chr>
## 1 959813571646377984 "@8bitscollider @drsimonj Something like this ?\n\nt…
## 2 959812374730498048 "@drsimonj Having `lags` just generate the quosures …
## 3 959811892922347520 "#tidyeval ‼️ multiple lags of the same variable. pi…
## 4 905856726582398976 RT @rensa_co: I got #rstats ggflags working with rou…
## 5 905537589175869441 "#funwithflags \nemo::jis %>% \n filter( catego…
## 6 896029598496022528 "emoji flags and clocks rounded at the closest 30 mi…
## 7 870036927352799232 RT @rstudio: RStudio is excited to announce the avai…