diff --git a/_pkgdown.yml b/_pkgdown.yml index 5c7cd2b4..57a1d2d6 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -65,6 +65,7 @@ articles: contents: - arbitraryhazard - modestWLRTVignette + - parallel - pvalue_maxcomboVignette - simtrialroutines - workflow diff --git a/vignettes/parallel.Rmd b/vignettes/parallel.Rmd new file mode 100644 index 00000000..7b2e9dbe --- /dev/null +++ b/vignettes/parallel.Rmd @@ -0,0 +1,278 @@ +--- +title: "Simulating time-to-event trials in parallel" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Simulating time-to-event trials in parallel} + %\VignetteEngine{knitr::rmarkdown} +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +## Overview + +This vignette demonstrates the ability to implement `sim_fixed_n()` using user-defined backends to parallelize simulations. We will consider the backends supported by the [future](https://future.futureverse.org/) framework. + +The backends supported by the future package include: + +* `sequential` - default and non-parallel backend. +* `multisession` - uses multiple background R sessions on a single machine. +* `multicore` - uses multiple forked R processes on a single non-Windows machine outside of RStudio. +* `cluster` - supports external R sessions across multiple machines. + +You can also choose other backend types supported by additional future extension packages, such as the HPC job scheduler backends from future.batchtools. + +The function `sim_fixed_n()` provides a simulation workflow for a two-arm trial with a single endpoint. +We can vary the parameters of the trial using different functions outlined in the documentation. +This function now provides users the opportunity to implement their simulations using the previously described parallel backends to accelerate the computation. + +## Background + +Without specifying a backend, `sim_fixed_n()` will execute sequentially. +The sequential execution will run all `n_sim` iterations within the same process or session of R. +In order to execute in parallel, we must define the environment prior to calling the function. +Setting your seed prior to calling the function will ensure that the results are reproducible. + +Suppose you want to investigate the duration of a trial under two possible enrollments strategies. +Both enrollments are piecewise, but have varying durations and rates. + +```{r dependencies, message=FALSE, warning=FALSE} +library(simtrial) +library(tibble) +library(future) +library(doFuture) +``` + +```{r set-plan-sequential, echo=FALSE} +plan("sequential") # Ensure that the backend is sequential +``` + +```{r enrollments, fig.height=4, fig.width=6, fig.align="center"} +set.seed(1) + +n <- 5000 +enroll_rate1 <- tibble(rate = c(5, 20, 10), duration = c(100, 150, 150)) +enroll_rate2 <- tibble(rate = c(10, 15, 30), duration = c(150, 175, 75)) +x1 <- rpw_enroll(n = n, enroll_rate = enroll_rate1) +x2 <- rpw_enroll(n = n, enroll_rate = enroll_rate2) + +plot( + x1, 1:n, + type = "l", + col = "blue", + xlim = c(0, max(x1, x2)), + main = "Piecewise enrollments", + xlab = "Time", + ylab = "Enrollment" +) +lines(x2, 1:n, col = "orange") +legend( + 250, 1500, + legend = c("Enrollment 1", "Enrollment 2"), + col = c("blue", "orange"), + lty = c(1, 1) +) +``` + +We see that *Enrollment 2* enrolls individuals more quickly than *Enrollment 1* at onset. +Later, *Enrollment 1* will outpace *Enrollment 2* before eventually being overtaken again. +As such, we want to consider how the duration of the study changes under these enrollments. + +## The sequential run + +Naively, we can execute these simulations sequentially. +We set the target of a total enrollment of 3000 individuals with the trial ending after observing 700 events. +We use `timing_type = 2` to return the correct trial duration. + +```{r confirm-sequential} +set.seed(1) + +n_sim <- 200 + +start_sequential <- proc.time() + +seq_result1 <- sim_fixed_n( + n_sim = n_sim, + sample_size = 3000, + target_event = 700, + enroll_rate = enroll_rate1, + timing_type = 2 # Time until targeted event count achieved +) + +seq_result2 <- sim_fixed_n( + n_sim = n_sim, + sample_size = 3000, + target_event = 700, + enroll_rate = enroll_rate2, + timing_type = 2 # Time until targeted event count achieved +) + +duration_sequential <- proc.time() - start_sequential +``` + +A message automatically appears in the console that indicates what backend is being used for processing. + +The calls to `proc.time()` allow us to evaluate the computation time of these procedures. +This function provides three outputs, we will focus on user and elapsed time. +User time represents the CPU time spent evaluating the function and elapsed time represents the "wall clock" time spent by the end user waiting for the results. + +```{r sequential-time} +print(duration_sequential) +``` + +We can see that the CPU time is `r sprintf("%.2f", duration_sequential[[1]])` and the elapsed time is `r sprintf("%.2f", duration_sequential[[3]])` seconds. +These provide our baseline for the computation time. + +As you may have anticipated, we see that for a lower number of events, enrollment 2 has a shorter average duration of +`r sprintf("%.1f", mean(seq_result2$duration))` +over that of enrollment 1, which is +`r sprintf("%.1f", mean(seq_result1$duration))`. + +We also see that there is a distinction between the duration of the study under the proposed enrollment strategies. + +```{r sequential-display-results, eval=FALSE, echo=FALSE} +seq_result1 %>% + head(5) %>% + kable(digits = 2) +seq_result2 %>% + head(5) %>% + kable(digits = 2) +``` + +## Setting up a parallel backend + +If we instead, wanted to run more simulations for each enrollment, we can expect the time to run our simulations to increase. +As we vary and increase the number of parameter inputs that we consider, we expect the simulation process to continue to increase in duration. +To help combat the growing computational burden, we can run these simulations in parallel using the `multisession` backend available to us in `plan()`. + +We can adjust the default number of cores with the function `parallelly::availableCores()`. +The multisession backend will automatically use all available cores by default. +To initialize our backend, we change our plan. + +```{r multisession} +plan(multisession, workers = availableCores()) +``` + +## Execution in parallel + +Once we have configured the backend details, we can execute the same code as before to automatically distribute the `n_sim` simulations across the available cores. + +```{r confirm-multisession} +set.seed(1) + +start_sequential <- proc.time() + +seq_result1m <- sim_fixed_n( + n_sim = n_sim, + sample_size = 3000, + target_event = 700, + enroll_rate = enroll_rate1, + timing_type = 2 # Time until targeted event count achieved +) + +seq_result2m <- sim_fixed_n( + n_sim = n_sim, + sample_size = 3000, + target_event = 700, + enroll_rate = enroll_rate2, + timing_type = 2 # Time until targeted event count achieved +) + +duration_sequential <- proc.time() - start_sequential +``` + +```{r time-parallel} +print(duration_sequential) +``` + +We can see that the CPU time is `r sprintf("%.2f", duration_sequential[[1]])` and the elapsed time is `r sprintf("%.2f", duration_sequential[[3]])` seconds. +The user time here appears to be drastically reduced because of how R keeps track of time; the time used by the parent process and not the children processes are reported for the user time. +Therefore, we compare the elapsed time to see the real-world impact of the parallelization. + +To change the implementation back to a sequential backend, we simply use what is below. + +```{r plan-sequential} +plan(sequential) +``` + +We can also verify that the simulation results are identical because of setting a seed and that the backend type will not affect the results. Below, it is clear that the results from our sequential and multisession backends match completely. + +```{r compare-results} +sum(seq_result1 != seq_result1m) +sum(seq_result2 != seq_result2m) +``` + +*Note:* A parallel implementation may not always be faster than a serial implementation. +If there is substantial overhead associated with executing in parallel, sequential evaluation may be faster. +For a low number of simulations or available cores, it may be preferable to continue computation in serial rather than parallel. +We leave it to the end user to determine this difference based on the resources available to them. + +## A nested parallel example + +We provide an additional example using a nested parallel structure for users with more extensive resources, such as high-performance computing clusters, available to them. +Because these resources are not commonly available, we will not execute the below code herein. +Consider that you have two accessible nodes, each with three cores (shown in the diagram below). + +```{r schema, echo=FALSE, fig.cap="Available resource schematic.", fig.align="center", out.width="90%"} +knitr::include_graphics("schema.png") +``` + +Ideally, all available resources will be used when executing the simulations. +To do this, we need to correctly define our backend using `plan()` and run the same code as previously. +The different structures, or topologies, for a backend can be changed with a more in depth explanation given in the [future topologies vignette](https://future.futureverse.org/articles/future-3-topologies.html). +Our below example follows closely their example. + +In our below snippet, we consider the two nodes named `n1` and `n2` and create a function to select the number of cores to use on those named nodes. +While trivial here, a courteous user of shared machines would specify fewer than all available cores and can do such using a modification of the below code. +We then implement our backend using a list that follows the hierarchy of the available resources. + +```{r nested-topology, eval=FALSE} +nodes <- c("n1", "n2") +customCores <- function() { + switch(Sys.info()[["nodename"]], + "n1" = 3L, # Modify here for number of cores on node1 + "n2" = 3L, # Modify here for number of cores on node2 + ## Default: + availableCores() + ) +} +plan(list( + tweak(cluster, workers = nodes), + tweak(multisession, workers = customCores) +)) +``` + +The function `tweak()` is necessary to override the inherent protection of nested parallelism, meant to help avoid overloading one's resources by errantly starting too many processes. +Because of the need to tweak backends, the message echoed to the console for nested backends reflects the highest level of the nested hierarchy. + +With the backend in place, we then can run the identical code from before using all available resources and return the same results as before. + +```{r confirm-cluster, eval=FALSE} +set.seed(1) + +enroll_rates <- list(enroll_rate1, enroll_rate2) + +seq_resultc <- foreach::foreach( + i = 1:2, + .combine = "list", + .options.future = list(seed = TRUE) +) %dofuture% { + sim_fixed_n( + n_sim = n_sim, + sample_size = 3000, + target_event = 700, + enroll_rate = enroll_rates[[i]], + timing_type = 2 # Time until targeted event count achieved + ) +} +``` + +Then, we reset the `plan` to sequential to avoid accidentally continuing to execute later calls within these resources. + +```{r plan-sequential2, eval=FALSE} +plan(sequential) +``` diff --git a/vignettes/schema.png b/vignettes/schema.png new file mode 100644 index 00000000..0ff12a99 Binary files /dev/null and b/vignettes/schema.png differ