Avoid loops in R! Really?

It must have been around the year 2000, when I wrote my first snipped of SPLUS/R code. One thing I’ve learned back then:

Loops are slow. Replace them with

  1. vectorized calculations or
  2. if vectorization is not possible, use sapply() et al.

Since then, the R core team and the community has invested tons of time to improve R and also to make it faster. There are things like RCPP and parallel computing to speed up loops.

But what still relatively few R users know: loops are not that slow anymore. We want to demonstrate this using two examples.

Example 1: sqrt()

We use three ways to calculate the square root of a vector of random numbers:

  1. Vectorized calculation. This will be the way to go because it is internally optimized in C.
  2. A loop. This must be super slow for large vectors.
  3. vapply() (as safe alternative to sapply).

The three approaches are then compared via bench::mark() regarding their speed for different numbers n of vector lengths. The results are then compared first regarding absolute median times, and secondly (using an independent run), on a relative scale (1 is the vectorized approach).

library(tidyverse)
library(bench)

# Calculate square root for each element in loop
sqrt_loop <- function(x) {
  out <- numeric(length(x))
  for (i in seq_along(x)) {
    out[i] <- sqrt(x[i])
  }
  out
}

# Example
sqrt_loop(1:4) # 1.000000 1.414214 1.732051 2.000000

# Compare its performance with two alternatives
sqrt_benchmark <- function(n) {
  x <- rexp(n)
  mark(
    vectorized = sqrt(x),
    loop = sqrt_loop(x),
    vapply = vapply(x, sqrt, FUN.VALUE = 0.0),
    # relative = TRUE
  )
}

# Combine results of multiple benchmarks and plot results
multiple_benchmarks <- function(one_bench, N) {
  res <- vector("list", length(N))
  for (i in seq_along(N)) {
    res[[i]] <- one_bench(N[i]) %>% 
      mutate(n = N[i], expression = names(expression))
  }
  
  ggplot(bind_rows(res), aes(n, median, color = expression)) +
    geom_point(size = 3) +
    geom_line(size = 1) +
    scale_x_log10() +
    ggtitle(deparse1(substitute(one_bench))) +
    theme(legend.position = c(0.8, 0.15))
}

# Apply simulation
multiple_benchmarks(sqrt_benchmark, N = 10^seq(3, 6, 0.25))

Absolute timings

Absolute median times on the “sqrt()” task

Relative timings (using a second run)

Relative median times of a separate run on the “sqrt()” task

We see:

  • Run times increase quite linearly with vector size.
  • Vectorization is more than ten times faster than the naive loop.
  • Most strikingly, vapply() is much slower than the naive loop. Would you have thought this?

Example 2: paste()

For the second example, we use a less simple function, namely

paste(“Number”, prettyNum(x, digits = 5))

What will our three approaches (vectorized, naive loop, vapply) show on this task?

pretty_paste <- function(x) {
  paste("Number", prettyNum(x, digits = 5))
}

# Example
pretty_paste(pi) # "Number 3.1416"

# Again, call pretty_paste() for each element in a loop
paste_loop <- function(x) {
  out <- character(length(x))
  for (i in seq_along(x)) {
    out[i] <- pretty_paste(x[i])
  }
  out
}

# Compare its performance with two alternatives
paste_benchmark <- function(n) {
  x <- rexp(n)
  mark(
    vectorized = pretty_paste(x),
    loop = paste_loop(x),
    vapply = vapply(x, pretty_paste, FUN.VALUE = ""),
    # relative = TRUE
  )
}

multiple_benchmarks(paste_benchmark, N = 10^seq(3, 5, 0.25))

Absolute timings

Absolute median times on the “paste()” task

Relative timings (using a second run)

Relative median times of a separate run on the “paste()” task
  • In contrast to the first example, vapply() is now as fast as the naive loop.
  • The time advantage of the vectorized approach is much less impressive. The loop takes in median only 50% longer.

Conclusion

  1. Vectorization is fast and easy to read. If available, use this. No surprise.
  2. If you use vapply/sapply/lapply, do it for the style, not for the speed. In some cases, the loop will be faster. And, depending on the situation and the audience, a loop might actually be even easier to read.

The code can be found on github.

The runs have been made on a Windows 11 system with a four core Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz processor.


Posted

in

by

Tags:

Comments

16 responses to “Avoid loops in R! Really?”

  1. Psyoskeptic Avatar
    Psyoskeptic

    Loops were never much slower than apply family functions and there are cases they are faster. The point of avoiding loops over *apply functions is not because loops are slower than them. It’s because often one wants to make multiple manipulations in sequence. Using loops it’s common to just put everything in the loop. However, with *apply you will more likely vectorize the commands that can be vectorized and just *apply those that can’t. This results in a substantial overall speedup and ease of readability. However, you can do that with loops too, the syntax just doesn’t lend itself to it.

    1. Michael Mayer Avatar

      Fully agree with your summary, thanks. However, I keep hearing too often “why are you using a slow loop here when you can go for a fast sapply”. This is the reason for writing the post.

  2. Carlos Q. Avatar
    Carlos Q.

    You haven’t tried much larger and more intensive functions, including read operations and processing of the data read. I don’t think two functions as basic as sqrt() or paste() are useful for drawing general conclusions, but it’s a good analysis to use as a starting point.

    1. Michael Mayer Avatar

      Thanks for your comment, which which I disagree: With more complex functions, you will see the same as with paste(): vapply and the loop will be similarly fast (and there won’t be a vectorized version for benchmarking them).

  3. […] Michael Mayer notes that loops in R aren’t actually all that bad: […]

  4. Paul van Oppen Avatar
    Paul van Oppen

    Thanks Micheal, great post!
    I have been using apply and map functions for many years now and appreciate how they allow you to think conceptually about a problem and not get distracted by the details of loops: initialization and stop conditions.
    After a (substantial) learning curve I now use map, map2, pmap, walk etc. when writing code for e.g. modelling and creating graphics with ggplot for complex data objects.

    1. Michael Mayer Avatar

      Same here, thanks Paul, for your comment.

  5. Jebyrnes Avatar
    Jebyrnes

    For more novice coders, the big difference I believe is the preallocation of an object of the correct size to return. Many novice coders don’t or forget to do this, and as such, it slows down the code tremendously. Although this be curious to see you try. The apply functions do this without one having to think about it, so more novice coders will see an immediate speed improvement.

    1. Michael Mayer Avatar

      Fully agree. I almost always prefer an *apply over a loop, basically for the same reasons as you explain. It is for style, not for performance.

  6. […] in some cases, running a for loop might even be faster than using an apply() function. Check out this blog post by Michael Mayer for a great comparison of different […]

  7. […] in some cases, running a for loop might even be faster than using an apply() function. Check out this blog post by Michael Mayer for a great comparison of different […]

  8. Troy Avatar
    Troy

    You say, “Since then, the R core team and the community has invested tons of time to improve R and also to make it faster”, but can you provide any specifics of how the R core team has achieved that or is even trying to?

    1. Michael Mayer Avatar

      I obviously do not know how many hours the core team really invested into this, so I will just provide an example from the release notes of R 3.4:
      “The JIT (‘Just In Time’) byte-code compiler is now enabled by default at its level 3. This means functions will be compiled on first or second use and top-level loops will be compiled and then run. (Thanks to Tomas Kalibera for extensive work to make this possible.)”.
      Since it seems a topic that interests you, maybe you can try to install different R versions from the last 20 years and see how and where the speed ups happened. I remember that someone did something similar but I would not find the link.

  9. Simon Avatar
    Simon

    Hi Michael, thanks for your benchmark. Could you share the R kernel version used for this test?
    Thanks,
    Simon

    1. Michael Mayer Avatar

      Hello Simon. I think it was R 4.1.2 (with 90% certainty).

      1. Luke Avatar
        Luke

        Hi Michael, whats the confidence interval on your subjective uncertainty estimate?
        Thanks, Luke

Leave a Reply

Your email address will not be published. Required fields are marked *