Roll Apply and Alignment Issues in R

Rolling Issue

One issue with back testing time-series data in R is aligning functions and indicator results over a previous window with the data itself.In this post, I will identify a few potential issues and offer a few solutions using functions like rollapply(), rollmean(), rollsd()

Issues with rollapply() and Alignment

Using rollapply on time-series data will not fill the first values of the window that are NA - the function only returns from the start of the look back window to the end of the dataset.

For example, let my.data = rnorm(100). Use the rollapply function from the zoo package. The width argument is set to 20 - this means that “[i]n the simplest case this is an integer specifying the window width (in numbers of observations) which is aligned to the original sample according to the align argument.” R Documentation on rollapply. So at each observation along my.data it will compute the function argument (in this case the mean() function) over the prior 20 periods.

The function is rollapply(my.data, 20, mean).

set.seed(123)
my.data <- rnorm(100)
my.roll.apply <- rollapply(my.data, 20, mean)
length(my.roll.apply)
#[1] 81
length(my.data)
#[1] 100

rollapply() Defaults to `center` Alignment

The first issue is the alignment of the rollapply results. my.data has 100 observations but my.roll.apply only has 81 observations.

This is to be expected - the function can’t loopback 20 periods until the 20th period occurs! But if you were to bind my.data and my.roll.apply R would throw an error since the first observation of my.roll.apply doesn’t occur until the 20th observation of my.data.

cbind.data.frame(my.data, my.roll.apply)
# Error in data.frame(..., check.names = FALSE) :
#   arguments imply differing number of rows: 100, 81

A simple way to fix this issue is to change the align argument to right and the fill argument to either 0 or NA. The align argument specifies “whether the index of the result should be left- or right-aligned or centered (default) compared to the rolling window of observations.” R Documentation on rollapply. The fill argument compliments the align argument by adding in a value of your choosing. This is generally NA, NULL or 0.

my.roll.apply.2 <- rollapply(my.data, 20, mean, fill = 0, align = "right")

round(my.roll.apply.2, 2)
#  [1]  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
# [14]  0.00  0.00  0.00  0.00  0.00  0.00 -0.10 -0.18 -0.11 -0.04 -0.09 -0.15 -0.15
# [27] -0.12 -0.05  0.08  0.07  0.12  0.03  0.15  0.17  0.14  0.13  0.14  0.06  0.12

Now the length of my.roll.apply.2 is equal to the length of my.data. It’s also important to note that it didn’t modify how the calculation was made - the last 81 observations of my.roll.apply.2 are equal to all of the observations of my.roll.apply.

length(my.roll.apply.2) == length(my.data)
# [1] TRUE

all.equal(my.roll.apply.2[20:100],my.roll.apply)
# [1] TRUE

Note that R offers other functions to solve the problem presented above. From the zoo package you can use the rollapplyr() wrapper which has align = 'right' as the default. You can also use the rollmean() function and set align = 'right' and fill = 0.

QuantTools also offers several roll functions. For the mean function you would use the sma() function and set n = 20. Note that sma() doesn’t have a fill = argument so you will have to manually do that for each data set.

my.sma <- sma(my.data, 20)
# - manually replacing NA values with 0
my.sma[is.na(my.sma)] = 0
# - check if my.sma is equal to my.roll.apply.2
all.equal(my.sma, my.roll.apply.2)
# [1] TRUE

Other functions can be calculated on a rolling basis. These include rollsum(), rollmedian() and rollmax()/rollmin from the zoo package. QuantTools has the same functions but also offers roll_lm() for rolling linear regression, roll_sd() for rolling standard deviation, roll_correlation() for correlations and roll_range() to find the minimum and maximum values over n past values. See the QuantTools Manual for more information.

rollapply() inclusion of current observation

Another issue to be at least cognizant of is that rollapply() includes the current observation in the calculation window. So at the 20th observation of my.data, it is including the 20th observation when calculating the FUN = mean. As noted above, that is why my.roll.apply is 81 observations long, rather than 80 observations long.

The issue may be important in time-series analysis since it assumes the inclusion of the current value in the look back calculation. For instance, at the 20th row of my.data, rollapply(my.data, 20, mean) will include from the 1st to the 20th value of my.data.

This might be a problem if you are trying to base an indicator on the last 20 values and not include the current observation.

For example, assume that you are looking at the average over the last 20 periods of 1-minute data evaluated at the open. Some sample data:

my.data.2
#                  time open high  low close volume
#1: 2020-09-30 09:31:00 2.35 2.40 2.11  2.11    124
#2: 2020-09-30 09:32:00 2.08 2.14 2.03  2.04    225
#3: 2020-09-30 09:33:00 2.00 2.00 1.88  1.96    286
#4: 2020-09-30 09:34:00 1.90 2.00 1.82  1.94    550
#5: 2020-09-30 09:35:00 1.91 2.01 1.89  1.97    440
#---                                                
#1162: 2020-10-02 15:56:00 1.70 1.80 1.70  1.77     40
#1163: 2020-10-02 15:57:00 1.85 1.86 1.71  1.77     26
#1164: 2020-10-02 15:58:00 1.85 1.91 1.78  1.86    182
#1165: 2020-10-02 15:59:00 1.83 1.86 1.80  1.81    288
#1166: 2020-10-02 16:00:00 1.80 1.82 1.75  1.78     97

To illustrate, I will add a column with rollapply() calculating the mean over the last 20 periods.

my.data.2.mean <- my.data.2 %>%
                    mutate("roll.mean"= rollapply(data = open,
                                                  width = 20,
                                                  FUN = function(x) mean(x),
                                                  align = "right", fill = NA)
                    )

Using this setup, the roll.mean column is calculated from the 20th open. In reality, you would not receive this information until the 20th open, and then you could act on it (such as entering long or short) immediately after it. Most likely, the open value would not be 1.82 by the time you acted - it would either be more or less than 1.82. This delay may be acceptable in back testing - you could simply assume slippage (“slippage is the difference between where the computer signaled the entry and exit for a trade and where actual clients, with actual money, entered and exited the market using the computer’s signals.” Wikipedia - Slippage(finance)).

You could also use the lag() function and add another column showing the lagged value or change the width = argument to a list reflecting the offset. For a 20 period look back, you would set width = list(-1:-20).

my.data.2.mean <- my.data.2 %>%
                    mutate("roll.mean"= rollapply(data = open,
                                                  width = 20,
                                                  FUN = function(x) mean(x),
                                                  align = "right", fill = NA),
                            "roll.mean.lag" = lag(roll.mean,1,NA),
                            "roll.mean.width" = rollapply(data = open,
                                                          width = list(-1:-20),
                                                          FUN = function(x) mean(x),
                                                          align = "right", fill = NA)
                    )

Whatever solution you choose, make sure your data lines up!!

Rolling Issue

Issues with rollapply() and Alignment

rollapply() Defaults to center Alignment

rollapply() inclusion of current observation

Notes & Research

rollapply() Defaults to `center` Alignment