class: center, middle, title-slide # Optimizing your code in R ### Mi Jun Keng
mijun.keng@ndph.ox.ac.uk
### Health Economics Research Centre (HERC), University of Oxford ### 9th October 2020 --- # Why optimise? Many stochastic elements in health economics analysis such as bootstrapping, probabilistic sensitivity analyses, Monte-Carlo simulations etc. - Repetitive / messy code - Time consuming to run - Memory issues -- Optimising your code may resolve these problems. Also makes it easier to correct error, and makes your code neater and more readable. --- Consider a dummy example where we want to estimate the average cost and standard error around this cost of treatment from a sample of 100 patients. ```r ## Create dummy dataset n_sample <- 100 df <- data.frame(id = c(1:n_sample), cost = rnorm(n_sample, 50, 3)) ``` <table style='width:50%;'> <caption>First 5 patients in dummy dataset</caption> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:right;"> cost </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 48.68526 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 50.33959 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 49.67819 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 49.41089 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 42.49260 </td> </tr> </tbody> </table> --- To obtain the standard error, we perform bootstrapping with 500 re-samples. ```r ## Create empty vector to store mean cost in each bootstrap sample boot_mean <- numeric() ## Sample without replacement and calculate mean for each bootstrap sample boot_mean[1] <- mean(sample(df$cost, n_sample, replace = FALSE)) boot_mean[2] <- mean(sample(df$cost, n_sample, replace = FALSE)) ... ## Keep repeating this code 500 times ... boot_mean[500] <- mean(sample(df$cost, n_sample, replace = FALSE)) ``` -- <br /> <center> But wait... </center> -- <br /> <center> After all that copying-and-pasting... </center> -- <br /> <center> You now realized you should have sampled with replacement! </center> --- class: middle <center><img src="https://media.giphy.com/media/NQRRqqkImJ3Da/giphy.gif" height="400"></center> --- # Using functions and loops Wrapping the task you want to run repetitively in a function ```r calculate_boot_mean <- function(data, n, replacement){ mean(sample(data, n, replace = replacement)) } ``` Using a `for` loop to perform task iteratively ```r boot_mean <- numeric() for(i in c(1:500)){ boot_mean[i] <- calculate_boot_mean( data = df$cost, n = n_sample, replacement = TRUE) } ``` -- ✓ Easier to debug and correct error -- __? What is R actually doing in each iteration of the `for` loop?__ --- ## `for` loops ```r boot_mean <- numeric() for(i in c(1:500)){ boot_mean[i] <- calculate_boot_mean( data = df$cost, n = n_sample, replacement = TRUE) } ``` - Re-sizing vector `boot_mean` and re-allocating memory for it - Temporary variable `i` created and gets updated in the global environment <!-- For more complex code, you may have several temporary variables being created within the loop --> -- ```r *boot_mean <- numeric(500) for(i in c(1:500)){ boot_mean[i] <- calculate_boot_mean( data = df$cost, n = n_sample, replacement = TRUE) } ``` ✓ Pre-allocate a vector that fits all values to avoid re-sizing/re-allocating memory <!-- But subject to mispecification of vector size --> --- # Functional programming - A functional is a function that takes a function as an input, applies the function iteratively to each element in a list and returns the results - This can be achieved using functions from the `apply()` family<sup>†</sup> - Runs `for` loop internally in C ```r boot_mean <- lapply(c(1:500), function(i){ calculate_boot_mean(data = df$cost, n = n_sample, replacement = TRUE) }) ``` .footnote[ [†] Read more about the `apply()` family of functions in [Carlo Fanara | Datacamp tutorial on the R Apply Family](https://www.datacamp.com/community/tutorials/r-tutorial-apply-family). Functional programming can also be implemented using the `map()` family of functions in [`purrr`](https://purrr.tidyverse.org/). More details in [Garrett Grolemund & Hadley Wickham | R for Data Science - Chapter 21 Iteration](https://r4ds.had.co.nz/iteration.html) ] -- ✓ No need to pre-define vector for storing results ✓ Variables in working environment unaffected ✓ Makes parallelisation easy --- __? Is it faster?__ -- ```r boot_mean <- numeric(500) for(i in c(1:500)){ boot_mean[i] <- calculate_boot_mean( data = df$cost, n = n_sample, replacement = TRUE) } ``` Median time taken = 10.16ms ```r boot_mean <- lapply(c(1:500), function(i){ calculate_boot_mean(data = df$cost, n = n_sample, replacement = TRUE) }) ``` Median time taken = 8.54ms -- ___Sometimes___ --- # Vectorisation - Many operations in R are vectorised - Instead of working on each element of the vector individually (as in the case of loops), it works on the entire vector. ```r vector1 <- c(1:4); vector2 <- c(6:9) ## Element-wise computation sum <- numeric(length(vector1)) for(i in seq_along(vector1)) { sum[i] <- vector1[i] + vector2[i] } ## Vectorised computation vector1 + vector2 ``` .footnote[ 1. [Hadley Wickham | Advanced R - Chapter 24.5 Vectorise](https://adv-r.hadley.nz/perf-improve.html#vectorise) 2. [Patrick Burns | The R Inferno - Chapter 3 Failing to Vectorize](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) ] --- Suppose we want to test in our dummy cost dataset if the cost of treatment is more than £50. ```r ## Element-wise computation lapply(seq_along(df$cost), function(i){df$cost[i] > 50}) ``` Median time taken = 141.59µs ```r ## Vectorised computation df$cost > 50 ``` Median time taken = 1.51µs --- # Parallelisation - Running several processes simultaneously on multiple processors/cores of the computer - Many different packages to achieve this. Some examples: + [parallel](https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf)::mclapply() + [future](https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html) + [furrr](https://davisvaughan.github.io/furrr/)::future_map() + [foreach](https://cran.r-project.org/web/packages/foreach/vignettes/foreach.html) & [doParallel](https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf) + [snowfall](https://cran.r-project.org/web/packages/snowfall/vignettes/snowfall.pdf) .footnote[ Some useful resources on parallelising 1. [Max Gordan | How-to go parallel in R – basics + tips](http://gforge.se/2015/02/how-to-go-parallel-in-r-basics-tips/) 2. [CRAN Task View: High-Performance and Parallel Computing with R](https://cran.r-project.org/web/views/HighPerformanceComputing.html) ] --- # Parallelising your code ```r boot_mean <- furrr::future_map(c(1:500), function(i){ calculate_boot_mean( data = df$cost, n = n_sample, replacement = TRUE) }) ``` Median time taken = 23.65ms (previously 8.54ms) ```r furrr::future_map(seq_along(df$cost), function(i){df$cost[i] > 50}) ``` Median time taken = 12.01ms (previously 0.14159ms) -- __? Parallelising here is actually _SLOWER_?__ --- ### One of the caveats of parallelisation: overhead time<sup>†</sup> - Each parallel process has its own memory space and data/package needs to be loaded across them all - A "start-up" time is required before actual computation occurs - Need to weigh up the cost-benefit of using parallelisation + Usually cost > benefit for smaller task .footnote[ [†] Read more about this in [Imre Gera | Parallelization caveats in R #1: performance issues](https://towardsdatascience.com/parallelization-caveats-in-r-1-the-basics-multiprocessing-and-multithreading-performance-eb584b7e850e) ] --- # Other ways to optimise your code ### Break down the task #### Make your function do as little as possible ```r df_cost <- df$cost boot_mean <- lapply(c(1:500), function(i){ calculate_boot_mean( data = df_cost, n = n_sample, replacement = TRUE) }) ``` Median time taken = 8.1ms (previously 8.54ms) - Dimension reduction --- ### Break down the task #### Do pre-calculations e.g. Very simple simulation model (finite combination of states) predicting the state patient is in, and performing costs and QALYs calculation in each cycle - Pre-calculate costs/QALYs for all possible combinations of states - Apply pre-calculations after simulation -- ### Using a faster language Sometimes R is just slow. But R has interfaces to other faster languages via packages. - C++: [Rcpp](http://www.rcpp.org/) [probably the most popular] - Python: rPython - Java: rJava --- # Case study with a markov model ### Acknowledgements <img src="https://avatars0.githubusercontent.com/u/57440859?s=200&v=4" height="150" align="right" /> Materials used in this case study is based on the work conducted as part of the 2019 R for Health Economics Hackathon. The team behind this work comprise of: - Howard Thom, University of Bristol - Iryna Schlackow, University of Oxford - Mi Jun Keng, University of Oxford - Robert Ashton, Imperial College London - Sam Abbott, London School of Hygiene and Tropical Medicine - Amy Chang, University of Sheffield - Han Fu, Imperial College London - Houra Haghpanahan, University of Glasgow - Zoe Turner, Nottinghamshire Healthcare NHS Foundation Trust The R code for the original work is accessible from [GitHub](https://github.com/HealthEconomicsHackathon/hermes6). .footnote[ For an implementation of this work, see Sam Abbott's package [`SpeedyMarkov`](https://github.com/seabbs/SpeedyMarkov) (presentation on 12th October 0910-0935) ] --- Markov model with 10 states (time-homogeneous; all transitions permitted) <img src="image/markov-model.png" height="400" /> - `n.treatments = 2` | no. of treatment arms - `n.samples = 25000` | no. of PSA samples - `n.cyles = 100` | number of cycles to run --- Extract of code for Markov model (minimal, non-executable) ```r # Loop over the treatment options *for(i.treatment in 1:n.treatments){ # Extract transition matrix for treatment transition.matrices_tr <- transition.matrices[i.treatment,,,] # Loop over the PSA samples * for(i.sample in 1:n.samples){ transition.matrices_tr_sample <- transition.matrices_tr[i.sample,,] # Loop over the cycles * for(i.cycle in 2:n.cycles){ # Markov update cohort.vectors[i.treatment, i.sample,i.cycle,]<- cohort.vectors[i.treatment, i.sample,i.cycle-1,] %*% transition.matrices_tr_sample } cohort.vectors_tr_sample <- cohort.vectors[i.treatment,i.sample,,] } } ``` --- class: center, middle <table> <thead> <tr> <th style="text-align:left;"> Tx </th> <th style="text-align:left;"> PSA sample </th> <th style="text-align:left;"> Cycle </th> <th style="text-align:right;"> Median time across 20 iterations </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 16.1s </td> </tr> </tbody> </table> --- ### `lapply()` over treatments ```r # Functional loop over treatment options *lapply(c(1:n.treatments), function(i.treatment){ transition.matrices_tr <- transition.matrices[i.treatment,,,] ## Dimension reduction cohort.vectors_tr <- cohort.vectors[i.treatment,,,] for(i.sample in 1:n.samples){ transition.matrices_tr_sample <- transition.matrices_tr[i.sample,,] ## Dimension reduction cohort.vectors_tr_sample <- cohort.vectors_tr[i.sample,,] for(i.cycle in 2:n.cycles){ cohort.vectors_tr_sample[i.cycle,] <- cohort.vectors_tr_sample[i.cycle-1,] %*% transition.matrices_tr_sample } } }) ``` --- class: center, middle <table> <thead> <tr> <th style="text-align:left;"> Tx </th> <th style="text-align:left;"> PSA sample </th> <th style="text-align:left;"> Cycle </th> <th style="text-align:right;"> Median time across 20 iterations </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 16.1s </td> </tr> <tr> <td style="text-align:left;"> <span style=" color: blue !important;" >lapply</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 11.3s </td> </tr> </tbody> </table> --- ### `lapply()` over PSA samples ```r for(i.treatment in c(1:n.treatments)){ transition.matrices_tr <- transition.matrices[i.treatment,,,] cohort.vectors_tr <- cohort.vectors[i.treatment,,,] # Functional loop over PSA samples * lapply(c(1:n.samples), function(i.sample){ transition.matrices_tr_sample <- transition.matrices_tr[i.sample,,] cohort.vectors_tr_sample <- cohort.vectors_tr[i.sample,,] for(i.cycle in 2:n.cycles){ cohort.vectors_tr_sample[i.cycle,] <- cohort.vectors_tr_sample[i.cycle-1,] %*% transition.matrices_tr_sample } }) } ``` --- class: center, middle <table> <thead> <tr> <th style="text-align:left;"> Tx </th> <th style="text-align:left;"> PSA sample </th> <th style="text-align:left;"> Cycle </th> <th style="text-align:right;"> Median time across 20 iterations </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 16.1s </td> </tr> <tr> <td style="text-align:left;"> <span style=" color: blue !important;" >lapply</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 11.3s </td> </tr> <tr> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;" >lapply</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 14.3s </td> </tr> </tbody> </table> --- ### Parallelise over PSA samples ```r for(i.treatment in c(1:n.treatments)){ transition.matrices_tr <- transition.matrices[i.treatment,,,] cohort.vectors_tr <- cohort.vectors[i.treatment,,,] # Parallelised over PSA samples * furrr::future_map(c(1:n.samples), function(i.sample){ transition.matrices_tr_sample <- transition.matrices_tr[i.sample,,] cohort.vectors_tr_sample <- cohort.vectors_tr[i.sample,,] for(i.cycle in 2:n.cycles){ cohort.vectors_tr_sample[i.cycle,] <- cohort.vectors_tr_sample[i.cycle-1,] %*% transition.matrices_tr_sample } }) } ``` --- class: center, middle <table> <thead> <tr> <th style="text-align:left;"> Tx </th> <th style="text-align:left;"> PSA sample </th> <th style="text-align:left;"> Cycle </th> <th style="text-align:right;"> Median time across 20 iterations </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 16.1s </td> </tr> <tr> <td style="text-align:left;"> <span style=" color: blue !important;" >lapply</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 11.3s </td> </tr> <tr> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;" >lapply</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 14.3s </td> </tr> <tr> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: red !important;" >parallel</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 14.8s </td> </tr> </tbody> </table> --- ### `lapply()` over treatments & vectorised operation over samples ```r # Functional loop over the treatment options *lapply(c(1:n.treatments), function(i.treatment){ transition.matrices_tr <- transition.matrices[i.treatment,,,] cohort.vectors_tr <- cohort.vectors[i.treatment,,,] for(i.cycle in 2:n.cycles){ ## Reformulated such that vectorised operation over PSA samples * for(i.state in 1:n.states){ * cohort.vectors_tr[ ,i.cycle,i.state]<- * rowSums(cohort.vectors_tr[ ,i.cycle-1, ] * * transition.matrices_tr[,,i.state]) * } } }) ``` --- class: center, middle <table> <thead> <tr> <th style="text-align:left;"> Tx </th> <th style="text-align:left;"> PSA sample </th> <th style="text-align:left;"> Cycle </th> <th style="text-align:right;"> Median time across 20 iterations </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 16.06s </td> </tr> <tr> <td style="text-align:left;"> <span style=" color: blue !important;" >lapply</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 11.28s </td> </tr> <tr> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;" >lapply</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 14.27s </td> </tr> <tr> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: red !important;" >parallel</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 14.79s </td> </tr> <tr> <td style="text-align:left;"> <span style=" color: blue !important;" >lapply</span> </td> <td style="text-align:left;"> <span style=" color: red !important;" >vectorised</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 9.21s </td> </tr> </tbody> </table> --- ### `lapply()` over treatments & `rcpp_loop()` over cycles ```r # Functional loop over the treatment options *lapply(c(1:n.treatments), function(i.treatment){ transition.matrices_tr <- transition.matrices[i.treatment,,,] cohort.vectors_tr <- cohort.vectors[i.treatment,,,] for(i.sample in 1:n.samples){ transition.matrices_tr_sample <- transition.matrices_tr[i.sample,,] cohort.vectors_tr_sample <- cohort.vectors_tr[i.sample,,] # Loop over the cycles using Rcpp * cohort.vectors_tr_sample <- * rcpp_loop(mat_in = cohort.vectors_tr_sample, * transition = transition.matrices_tr_sample, * n = n.cycles) * } }) ``` --- class: center, middle <table> <thead> <tr> <th style="text-align:left;"> Tx </th> <th style="text-align:left;"> PSA sample </th> <th style="text-align:left;"> Cycle </th> <th style="text-align:right;"> Median time across 20 iterations </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 16.06s </td> </tr> <tr> <td style="text-align:left;"> <span style=" color: blue !important;" >lapply</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 11.28s </td> </tr> <tr> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;" >lapply</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 14.27s </td> </tr> <tr> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: red !important;" >parallel</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 14.79s </td> </tr> <tr> <td style="text-align:left;"> <span style=" color: blue !important;" >lapply</span> </td> <td style="text-align:left;"> <span style=" color: red !important;" >vectorised</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:right;"> 9.21s </td> </tr> <tr> <td style="text-align:left;"> <span style=" color: blue !important;" >lapply</span> </td> <td style="text-align:left;"> <span style=" color: black !important;" >for loop</span> </td> <td style="text-align:left;"> <span style=" color: red !important;" >rcpp</span> </td> <td style="text-align:right;"> 5.85s </td> </tr> </tbody> </table> --- class: center ### How to optimise? Functionals | Vectorising | Parallelising etc. -- ## What should you optimise? Always aim for readability | Profile your code to find bottlenecks<sup>†</sup> .footnote[ [†] One way to easily profile your code is with [`profvis`](https://rstudio.github.io/profvis/). More about code profiling in [Hadley Wickham | Advanced R - Chapter 23 Measuring performance](https://adv-r.hadley.nz/perf-measure.html) ] -- # Should you even optimise at all? First make it work correctly. Then, perhaps. -- <blockquote><i> We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. — Donald Knuth </i></blockquote> --- # References <p><cite>Gillespie, C. and R. Lovelace (2017). <em>Efficient R Programming : A Practical Guide to Smarter Programming</em>. Eng. First. Sebastopol, CA: O'Reilly. ISBN: 978-1-4919-5078-4.</cite></p> <p><cite>Hadley Wickham and Garrett Grolemund (2016). <em>R for Data Science</em>. Eng. 1st. O'Reilly Media, Inc..</cite></p> <p><cite>Wickham, H. (2019). <em>Advanced R</em>. 2nd edition.. Chapman & Hall/CRC the r Series. Boca Raton. ISBN: 978-1-351-20129-2.</cite></p>