Optimize R foreach loops for parallelism: avoid the .combine argument

Cute boat in a glass bottle. Art generated by FLUX.1 [dev] + cozy-book-800 LoRA adapter.
Cute boat in a glass bottle. Art generated by FLUX.1 [dev] + cozy-book-800 LoRA adapter.

The benchmarking script used in this post is available as a GitHub gist.

This post recommends avoiding the use of foreach(.combine = "rbind") in parallel loops in R. Instead, manually combine the results after the foreach() call using functions like data.table::rbindlist() or dplyr::bind_rows().

Performance issues with rbind()

The do.call(rbind, list) pattern in R is an expressive and functional way to bind data frames from a list into a single data frame. Meanwhile, your code could also take a potential performance hit as the list size gets large. The convinient argument .combine in foreach() could have similar performance traps. This means by using .combine = "rbind", the parallel foreach() call can have significantly degraded performance as the number of iterations increases.

Benchmarking setup

To quantify this issue, I created a simple benchmark comparing two approaches:

  • Method 1: foreach(.combine = "rbind")
  • Method 2: foreach() without .combine, followed by data.table::rbindlist()

First, I created a CPU-intensive function that performs singular value decomposition (SVD) on 300 randomly generated 10x10 positive semidefinite matrices. The function returns a 300x7 data frame summarizing the results:

anysvd <- function(id, dim = 10, nrep = 300) {
  results <- vector("list", nrep)

  for (j in 1:nrep) {
    X <- matrix(rnorm(dim^2), dim, dim)
    A <- crossprod(X)

    s <- svd(A)

    results[[j]] <- data.frame(
      id = id,
      sub_id = j,
      sv1 = s$d[1],
      sv2 = s$d[2],
      sv3 = s$d[3],
      matrix_norm = norm(A, type = "F"),
      matrix_trace = sum(diag(A))
    )
  }

  as.data.frame(data.table::rbindlist(results))
}

I ran the benchmarks across different numbers of iterations (1000, 10000, 50000, 100000) using the %dofuture% operator on a 32-worker setup (7950X3D):

library(doFuture)

plan(multisession, workers = 32)

nsim_grid <- c(1000, 10000, 50000, 100000)

df_bench <- data.frame(
  nsim = integer(),
  method = character(),
  time = numeric()
)

for (nsim in nsim_grid) {
  message("Running benchmark with nsim=", nsim)

  # Method 1: foreach(.combine = "rbind")
  set.seed(42)
  tictoc::tic.clearlog()
  tictoc::tic(paste0("method1_nsim", nsim))
  df_rbind <- foreach(
    i = 1:nsim,
    .combine = "rbind",
    .options.future = list(seed = TRUE)
  ) %dofuture% {
    anysvd(i)
  }
  tictoc::toc(log = TRUE, quiet = TRUE)

  # Method 2: foreach then rbindlist
  set.seed(42)
  tictoc::tic(paste0("method2_nsim", nsim))
  lst_rbindlist <- foreach(
    i = 1:nsim,
    .options.future = list(seed = TRUE)
  ) %dofuture% {
    anysvd(i)
  }
  df_rbindlist <- as.data.frame(data.table::rbindlist(lst_rbindlist))
  tictoc::toc(log = TRUE, quiet = TRUE)

  if (!identical(df_rbind, df_rbindlist)) warning("Discrepant results for nsim=", nsim)

  lst_log <- tictoc::tic.log(format = FALSE)
  df_bench <- rbind(
    df_bench,
    data.frame(
      nsim = nsim,
      method = "foreach(.combine=\"rbind\")",
      time = lst_log[[1]]$toc - lst_log[[1]]$tic
    ),
    data.frame(
      nsim = nsim,
      method = "rbindlist",
      time = lst_log[[2]]$toc - lst_log[[2]]$tic
    ),
    make.row.names = FALSE
  )

  tictoc::tic.clearlog()
}

Benchmark results

The results in the figure demonstrate that while both methods perform similarly at smaller scales (1000 iterations). The .combine = "rbind" method quickly becomes inefficient as the number of iterations increase, while the manual combine method sees linear time growth.

Comparison of combine methods performance.
Comparison of combine methods performance.

The table below shows the original timing numbers. At least for this particular benchmark, we see that at higher iteration counts (10000+), combining results with .combine dramatically increases execution time. Specifically, at 100000 iterations, using .combine = "rbind" approximately doubles the total execution time compared to manually combining with rbindlist() — from CPU monitoring, half of the time was spent on burning one CPU core.

nsim Method Time (seconds)
1000 foreach(.combine="rbind") 8.52
1000 rbindlist 8.70
10000 foreach(.combine="rbind") 54.56
10000 rbindlist 47.77
50000 foreach(.combine="rbind") 353.61
50000 rbindlist 225.17
100000 foreach(.combine="rbind") 927.05
100000 rbindlist 450.90

Conclusion

Based on this benchmark, you should consider avoiding .combine = "rbind" in parallel foreach() loops. Instead, aggregate the results after parallel execution using more efficient alternatives such as data.table::rbindlist() or dplyr::bind_rows(). Further investigation into the other methods supported by .combine is recommended to fully understand their performance implications.