Optimizing your code in R

# Optimizing your code in R
### Mi Jun Keng <a href="mailto:mijun.keng@ndph.ox.ac.uk" class="email">mijun.keng@ndph.ox.ac.uk</a>
### Health Economics Research Centre (HERC), University of Oxford
### 9th October 2020

---

# Why optimise?

Many stochastic elements in health economics analysis such as bootstrapping, probabilistic sensitivity analyses, Monte-Carlo simulations etc.

- Repetitive / messy code
- Time consuming to run
- Memory issues

Optimising your code may resolve these problems. Also makes it easier to correct error, and makes your code neater and more readable.

---

Consider a dummy example where we want to estimate the average cost and standard error around this cost of treatment from a sample of 100 patients.

```r
## Create dummy dataset
n_sample <- 100

df <- data.frame(id = c(1:n_sample), 
                 cost = rnorm(n_sample, 50, 3))
```

<table style='width:50%;'>
<caption>First 5 patients in dummy dataset</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> id </th>
   <th style="text-align:right;"> cost </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 48.68526 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 50.33959 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 49.67819 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 49.41089 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 42.49260 </td>
  </tr>
</tbody>
</table>

---

To obtain the standard error, we perform bootstrapping with 500 re-samples.

```r
## Create empty vector to store mean cost in each bootstrap sample
boot_mean <- numeric()

## Sample without replacement and calculate mean for each bootstrap sample 
boot_mean[1] <- mean(sample(df$cost, n_sample, replace = FALSE)) 
boot_mean[2] <- mean(sample(df$cost, n_sample, replace = FALSE)) 
... 
## Keep repeating this code 500 times
...
boot_mean[500] <- mean(sample(df$cost, n_sample, replace = FALSE)) 
```

--
<br />
<center> But wait... </center>
--
<br />
<center> After all that copying-and-pasting... </center>
--
<br />
<center> You now realized you should have sampled with replacement! </center>

---
class: middle

---

# Using functions and loops

Wrapping the task you want to run repetitively in a function

```r
calculate_boot_mean <- function(data, n, replacement){
  mean(sample(data, n, replace = replacement)) 
}
```

Using a `for` loop to perform task iteratively

```r
boot_mean <- numeric()

for(i in c(1:500)){
  boot_mean[i] <- calculate_boot_mean(
    data = df$cost, n = n_sample, replacement = TRUE)
}
```

✓ Easier to debug and correct error

__? What is R actually doing in each iteration of the `for` loop?__

---

## `for` loops

```r
boot_mean <- numeric()

for(i in c(1:500)){
  boot_mean[i] <- calculate_boot_mean(
    data = df$cost, n = n_sample, replacement = TRUE)
}
```

- Re-sizing vector `boot_mean` and re-allocating memory for it
- Temporary variable `i` created and gets updated in the global environment

```r
*boot_mean <- numeric(500)

for(i in c(1:500)){
  boot_mean[i] <- calculate_boot_mean(
    data = df$cost, n = n_sample, replacement = TRUE)
}
```

✓ Pre-allocate a vector that fits all values to avoid re-sizing/re-allocating memory

---

# Functional programming

- A functional is a function that takes a function as an input, applies the function iteratively to each element in a list and returns the results
- This can be achieved using functions from the `apply()` family<sup>†</sup>
- Runs `for` loop internally in C

```r
boot_mean <- lapply(c(1:500), function(i){
  calculate_boot_mean(data = df$cost, n = n_sample, replacement = TRUE)
})
```

.footnote[
[†] Read more about the `apply()` family of functions in [Carlo Fanara | Datacamp tutorial on the R Apply Family](https://www.datacamp.com/community/tutorials/r-tutorial-apply-family). Functional programming can also be implemented using the `map()` family of functions in [`purrr`](https://purrr.tidyverse.org/). More details in [Garrett Grolemund & Hadley Wickham | R for Data Science - Chapter 21 Iteration](https://r4ds.had.co.nz/iteration.html)
]

✓ No need to pre-define vector for storing results

✓ Variables in working environment unaffected

✓ Makes parallelisation easy

---

__? Is it faster?__

```r
boot_mean <- numeric(500)

for(i in c(1:500)){
  boot_mean[i] <- calculate_boot_mean(
    data = df$cost, n = n_sample, replacement = TRUE)
}
```

Median time taken = 10.16ms

```r
boot_mean <- lapply(c(1:500), function(i){
  calculate_boot_mean(data = df$cost, n = n_sample, replacement = TRUE)
})
```

Median time taken = 8.54ms

___Sometimes___

---

# Vectorisation

- Many operations in R are vectorised 
- Instead of working on each element of the vector individually (as in the case of loops), it works on the entire vector.

```r
vector1 <- c(1:4); vector2 <- c(6:9)

## Element-wise computation
sum <- numeric(length(vector1))
for(i in seq_along(vector1)) {
  sum[i] <- vector1[i] + vector2[i]
}

## Vectorised computation 
vector1 + vector2
```

.footnote[
1. [Hadley Wickham | Advanced R - Chapter 24.5 Vectorise](https://adv-r.hadley.nz/perf-improve.html#vectorise)
2. [Patrick Burns | The R Inferno - Chapter 3 Failing to Vectorize](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf)
]

---

Suppose we want to test in our dummy cost dataset if the cost of treatment is more than £50.

```r
## Element-wise computation
lapply(seq_along(df$cost), function(i){df$cost[i] > 50})
```
Median time taken = 141.59µs

```r
## Vectorised computation 
df$cost > 50
```
Median time taken = 1.51µs

---

# Parallelisation

- Running several processes simultaneously on multiple processors/cores of the computer
- Many different packages to achieve this. Some examples:
  + [parallel](https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf)::mclapply()
  + [future](https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html)
      + [furrr](https://davisvaughan.github.io/furrr/)::future_map()
  + [foreach](https://cran.r-project.org/web/packages/foreach/vignettes/foreach.html) & [doParallel](https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf)
  + [snowfall](https://cran.r-project.org/web/packages/snowfall/vignettes/snowfall.pdf)

.footnote[
Some useful resources on parallelising
1. [Max Gordan | How-to go parallel in R – basics + tips](http://gforge.se/2015/02/how-to-go-parallel-in-r-basics-tips/)
2. [CRAN Task View: High-Performance and Parallel Computing with R](https://cran.r-project.org/web/views/HighPerformanceComputing.html)
]

---

# Parallelising your code

```r
boot_mean <- furrr::future_map(c(1:500), function(i){
  calculate_boot_mean(
    data = df$cost, n = n_sample, replacement = TRUE)
})
```

Median time taken = 23.65ms (previously 8.54ms)

```r
furrr::future_map(seq_along(df$cost), function(i){df$cost[i] > 50})
```

Median time taken = 12.01ms (previously 0.14159ms)

__? Parallelising here is actually _SLOWER_?__

---

### One of the caveats of parallelisation: overhead time<sup>†</sup> 
- Each parallel process has its own memory space and data/package needs to be loaded across them all
- A "start-up" time is required before actual computation occurs
- Need to weigh up the cost-benefit of using parallelisation
  + Usually cost > benefit for smaller task

.footnote[
[†] Read more about this in [Imre Gera | Parallelization caveats in R #1: performance issues](https://towardsdatascience.com/parallelization-caveats-in-r-1-the-basics-multiprocessing-and-multithreading-performance-eb584b7e850e)
]

---

# Other ways to optimise your code

### Break down the task

#### Make your function do as little as possible

```r
df_cost <- df$cost

boot_mean <- lapply(c(1:500), function(i){
  calculate_boot_mean(
    data = df_cost, n = n_sample, replacement = TRUE)
})
```

Median time taken = 8.1ms (previously 8.54ms)

- Dimension reduction

---

### Break down the task

#### Do pre-calculations

e.g. Very simple simulation model (finite combination of states) predicting the state patient is in, and performing costs and QALYs calculation in each cycle

- Pre-calculate costs/QALYs for all possible combinations of states 
- Apply pre-calculations after simulation

### Using a faster language

Sometimes R is just slow. But R has interfaces to other faster languages via packages.

- C++: [Rcpp](http://www.rcpp.org/) [probably the most popular]
- Python: rPython
- Java: rJava

---

# Case study with a markov model

### Acknowledgements

Materials used in this case study is based on the work conducted as part of the 2019 R for Health Economics Hackathon. The team behind this work comprise of:

- Howard Thom, University of Bristol
- Iryna Schlackow, University of Oxford
- Mi Jun Keng, University of Oxford
- Robert Ashton, Imperial College London
- Sam Abbott, London School of Hygiene and Tropical Medicine
- Amy Chang, University of Sheffield
- Han Fu, Imperial College London
- Houra Haghpanahan, University of Glasgow
- Zoe Turner, Nottinghamshire Healthcare NHS Foundation Trust

The R code for the original work is accessible from [GitHub](https://github.com/HealthEconomicsHackathon/hermes6).

.footnote[
For an implementation of this work, see Sam Abbott's package [`SpeedyMarkov`](https://github.com/seabbs/SpeedyMarkov) (presentation on 12th October 0910-0935)
]

---

Markov model with 10 states (time-homogeneous; all transitions permitted)

- `n.treatments = 2` | no. of treatment arms 
- `n.samples = 25000` | no. of PSA samples
- `n.cyles = 100` | number of cycles to run

---

Extract of code for Markov model (minimal, non-executable)

```r
# Loop over the treatment options
*for(i.treatment in 1:n.treatments){
  
  # Extract transition matrix for treatment
  transition.matrices_tr <- transition.matrices[i.treatment,,,]
    
  # Loop over the PSA samples
* for(i.sample in 1:n.samples){
    
    transition.matrices_tr_sample <- transition.matrices_tr[i.sample,,]
    
    # Loop over the cycles
*   for(i.cycle in 2:n.cycles){
      
      # Markov update  
      cohort.vectors[i.treatment, i.sample,i.cycle,]<-
        cohort.vectors[i.treatment, i.sample,i.cycle-1,] %*%
        transition.matrices_tr_sample
    }
    
    cohort.vectors_tr_sample <- cohort.vectors[i.treatment,i.sample,,]
  }
}
```

---

---

### `lapply()` over treatments

```r
# Functional loop over treatment options
*lapply(c(1:n.treatments), function(i.treatment){
  
  transition.matrices_tr <- transition.matrices[i.treatment,,,]
  ## Dimension reduction 
  cohort.vectors_tr <- cohort.vectors[i.treatment,,,] 
  
  for(i.sample in 1:n.samples){ 
    
    transition.matrices_tr_sample <- transition.matrices_tr[i.sample,,]
    ## Dimension reduction 
    cohort.vectors_tr_sample <- cohort.vectors_tr[i.sample,,] 
    
    for(i.cycle in 2:n.cycles){ 
      
      cohort.vectors_tr_sample[i.cycle,] <- cohort.vectors_tr_sample[i.cycle-1,] %*% transition.matrices_tr_sample
    }
  }
})
```

---
class: center, middle

---

### `lapply()` over PSA samples

```r
for(i.treatment in c(1:n.treatments)){ 
  
  transition.matrices_tr <- transition.matrices[i.treatment,,,]
  cohort.vectors_tr <- cohort.vectors[i.treatment,,,]
  
  # Functional loop over PSA samples
* lapply(c(1:n.samples), function(i.sample){
    
    transition.matrices_tr_sample <- transition.matrices_tr[i.sample,,]
    cohort.vectors_tr_sample <- cohort.vectors_tr[i.sample,,]

for(i.cycle in 2:n.cycles){ 
      
      cohort.vectors_tr_sample[i.cycle,] <-
        cohort.vectors_tr_sample[i.cycle-1,] %*%
        transition.matrices_tr_sample
    }
  }) 
}
```

---
class: center, middle

---

### Parallelise over PSA samples

```r
for(i.treatment in c(1:n.treatments)){

transition.matrices_tr <- transition.matrices[i.treatment,,,]
  cohort.vectors_tr <- cohort.vectors[i.treatment,,,]

# Parallelised over PSA samples
* furrr::future_map(c(1:n.samples), function(i.sample){
  
    transition.matrices_tr_sample <- transition.matrices_tr[i.sample,,]
    cohort.vectors_tr_sample <- cohort.vectors_tr[i.sample,,]

for(i.cycle in 2:n.cycles){ 
      
      cohort.vectors_tr_sample[i.cycle,] <- cohort.vectors_tr_sample[i.cycle-1,] %*% transition.matrices_tr_sample
    }
  }) 
}
```

---
class: center, middle

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Tx </th>
   <th style="text-align:left;"> PSA sample </th>
   <th style="text-align:left;"> Cycle </th>
   <th style="text-align:right;"> Median time across 20 iterations </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:right;"> 16.1s </td>
  </tr>
  <tr>
   <td style="text-align:left;"> <span style="     color: blue !important;" >lapply</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:right;"> 11.3s </td>
  </tr>
  <tr>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: blue !important;" >lapply</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:right;"> 14.3s </td>
  </tr>
  <tr>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: red !important;" >parallel</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:right;"> 14.8s </td>
  </tr>
</tbody>
</table>

---

### `lapply()` over treatments & vectorised operation over samples

for(i.cycle in 2:n.cycles){
    
    ## Reformulated such that vectorised operation over PSA samples
*   for(i.state in 1:n.states){
*      cohort.vectors_tr[ ,i.cycle,i.state]<-
*        rowSums(cohort.vectors_tr[ ,i.cycle-1, ] *
*                  transition.matrices_tr[,,i.state])
*   }
  }
})
```

---

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Tx </th>
   <th style="text-align:left;"> PSA sample </th>
   <th style="text-align:left;"> Cycle </th>
   <th style="text-align:right;"> Median time across 20 iterations </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:right;"> 16.06s </td>
  </tr>
  <tr>
   <td style="text-align:left;"> <span style="     color: blue !important;" >lapply</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:right;"> 11.28s </td>
  </tr>
  <tr>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: blue !important;" >lapply</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:right;"> 14.27s </td>
  </tr>
  <tr>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: red !important;" >parallel</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:right;"> 14.79s </td>
  </tr>
  <tr>
   <td style="text-align:left;"> <span style="     color: blue !important;" >lapply</span> </td>
   <td style="text-align:left;"> <span style="     color: red !important;" >vectorised</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:right;"> 9.21s </td>
  </tr>
</tbody>
</table>

---

### `lapply()` over treatments & `rcpp_loop()` over cycles

```r
# Functional loop over the treatment options
*lapply(c(1:n.treatments), function(i.treatment){
  
  transition.matrices_tr <- transition.matrices[i.treatment,,,]
  cohort.vectors_tr <- cohort.vectors[i.treatment,,,]
  
  for(i.sample in 1:n.samples){ 
    
    transition.matrices_tr_sample <- transition.matrices_tr[i.sample,,]
    
    cohort.vectors_tr_sample <- cohort.vectors_tr[i.sample,,]
    
    # Loop over the cycles using Rcpp
*   cohort.vectors_tr_sample <-
*     rcpp_loop(mat_in = cohort.vectors_tr_sample,
*               transition = transition.matrices_tr_sample,
*               n = n.cycles)
* }
})
```

---

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Tx </th>
   <th style="text-align:left;"> PSA sample </th>
   <th style="text-align:left;"> Cycle </th>
   <th style="text-align:right;"> Median time across 20 iterations </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:right;"> 16.06s </td>
  </tr>
  <tr>
   <td style="text-align:left;"> <span style="     color: blue !important;" >lapply</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:right;"> 11.28s </td>
  </tr>
  <tr>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: blue !important;" >lapply</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:right;"> 14.27s </td>
  </tr>
  <tr>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: red !important;" >parallel</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:right;"> 14.79s </td>
  </tr>
  <tr>
   <td style="text-align:left;"> <span style="     color: blue !important;" >lapply</span> </td>
   <td style="text-align:left;"> <span style="     color: red !important;" >vectorised</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:right;"> 9.21s </td>
  </tr>
  <tr>
   <td style="text-align:left;"> <span style="     color: blue !important;" >lapply</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;" >for loop</span> </td>
   <td style="text-align:left;"> <span style="     color: red !important;" >rcpp</span> </td>
   <td style="text-align:right;"> 5.85s </td>
  </tr>
</tbody>
</table>

---

### How to optimise?

Functionals | Vectorising | Parallelising etc.

## What should you optimise?

Always aim for readability | Profile your code to find bottlenecks<sup>†</sup>

[†] One way to easily profile your code is with [`profvis`](https://rstudio.github.io/profvis/). More about code profiling in [Hadley Wickham | Advanced R - Chapter 23 Measuring performance](https://adv-r.hadley.nz/perf-measure.html)

]

# Should you even optimise at all?

First make it work correctly. Then, perhaps.

<blockquote><i>
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. — Donald Knuth 
</i></blockquote>

---

# References

<p><cite>Gillespie, C. and R. Lovelace
(2017).
<em>Efficient R Programming : A Practical Guide to Smarter Programming</em>.
Eng.
First.
Sebastopol, CA: O'Reilly.
ISBN: 978-1-4919-5078-4.</cite></p>

<p><cite>Hadley Wickham and Garrett Grolemund
(2016).
<em>R for Data Science</em>.
Eng.
1st.
O'Reilly Media, Inc..</cite></p>

<p><cite>Wickham, H.
(2019).
<em>Advanced R</em>.
2nd edition..
Chapman &amp; Hall/CRC the r Series.
Boca Raton.
ISBN: 978-1-351-20129-2.</cite></p>