+ - 0:00:00
Notes for current slide
Notes for next slide

Optimizing your code in R

Mi Jun Keng

Health Economics Research Centre (HERC), University of Oxford

9th October 2020

1 / 32

Why optimise?

Many stochastic elements in health economics analysis such as bootstrapping, probabilistic sensitivity analyses, Monte-Carlo simulations etc.

  • Repetitive / messy code
  • Time consuming to run
  • Memory issues
2 / 32

Why optimise?

Many stochastic elements in health economics analysis such as bootstrapping, probabilistic sensitivity analyses, Monte-Carlo simulations etc.

  • Repetitive / messy code
  • Time consuming to run
  • Memory issues

Optimising your code may resolve these problems. Also makes it easier to correct error, and makes your code neater and more readable.

2 / 32

Consider a dummy example where we want to estimate the average cost and standard error around this cost of treatment from a sample of 100 patients.

## Create dummy dataset
n_sample <- 100
df <- data.frame(id = c(1:n_sample),
cost = rnorm(n_sample, 50, 3))
First 5 patients in dummy dataset
id cost
1 48.68526
2 50.33959
3 49.67819
4 49.41089
5 42.49260
3 / 32

To obtain the standard error, we perform bootstrapping with 500 re-samples.

## Create empty vector to store mean cost in each bootstrap sample
boot_mean <- numeric()
## Sample without replacement and calculate mean for each bootstrap sample
boot_mean[1] <- mean(sample(df$cost, n_sample, replace = FALSE))
boot_mean[2] <- mean(sample(df$cost, n_sample, replace = FALSE))
...
## Keep repeating this code 500 times
...
boot_mean[500] <- mean(sample(df$cost, n_sample, replace = FALSE))
4 / 32

To obtain the standard error, we perform bootstrapping with 500 re-samples.

## Create empty vector to store mean cost in each bootstrap sample
boot_mean <- numeric()
## Sample without replacement and calculate mean for each bootstrap sample
boot_mean[1] <- mean(sample(df$cost, n_sample, replace = FALSE))
boot_mean[2] <- mean(sample(df$cost, n_sample, replace = FALSE))
...
## Keep repeating this code 500 times
...
boot_mean[500] <- mean(sample(df$cost, n_sample, replace = FALSE))


But wait...
4 / 32

To obtain the standard error, we perform bootstrapping with 500 re-samples.

## Create empty vector to store mean cost in each bootstrap sample
boot_mean <- numeric()
## Sample without replacement and calculate mean for each bootstrap sample
boot_mean[1] <- mean(sample(df$cost, n_sample, replace = FALSE))
boot_mean[2] <- mean(sample(df$cost, n_sample, replace = FALSE))
...
## Keep repeating this code 500 times
...
boot_mean[500] <- mean(sample(df$cost, n_sample, replace = FALSE))


But wait...

After all that copying-and-pasting...
4 / 32

To obtain the standard error, we perform bootstrapping with 500 re-samples.

## Create empty vector to store mean cost in each bootstrap sample
boot_mean <- numeric()
## Sample without replacement and calculate mean for each bootstrap sample
boot_mean[1] <- mean(sample(df$cost, n_sample, replace = FALSE))
boot_mean[2] <- mean(sample(df$cost, n_sample, replace = FALSE))
...
## Keep repeating this code 500 times
...
boot_mean[500] <- mean(sample(df$cost, n_sample, replace = FALSE))


But wait...

After all that copying-and-pasting...

You now realized you should have sampled with replacement!
4 / 32
5 / 32

Using functions and loops

Wrapping the task you want to run repetitively in a function

calculate_boot_mean <- function(data, n, replacement){
mean(sample(data, n, replace = replacement))
}

Using a for loop to perform task iteratively

boot_mean <- numeric()
for(i in c(1:500)){
boot_mean[i] <- calculate_boot_mean(
data = df$cost, n = n_sample, replacement = TRUE)
}
6 / 32

Using functions and loops

Wrapping the task you want to run repetitively in a function

calculate_boot_mean <- function(data, n, replacement){
mean(sample(data, n, replace = replacement))
}

Using a for loop to perform task iteratively

boot_mean <- numeric()
for(i in c(1:500)){
boot_mean[i] <- calculate_boot_mean(
data = df$cost, n = n_sample, replacement = TRUE)
}

✓ Easier to debug and correct error

6 / 32

Using functions and loops

Wrapping the task you want to run repetitively in a function

calculate_boot_mean <- function(data, n, replacement){
mean(sample(data, n, replace = replacement))
}

Using a for loop to perform task iteratively

boot_mean <- numeric()
for(i in c(1:500)){
boot_mean[i] <- calculate_boot_mean(
data = df$cost, n = n_sample, replacement = TRUE)
}

✓ Easier to debug and correct error

? What is R actually doing in each iteration of the for loop?

6 / 32

for loops

boot_mean <- numeric()
for(i in c(1:500)){
boot_mean[i] <- calculate_boot_mean(
data = df$cost, n = n_sample, replacement = TRUE)
}
  • Re-sizing vector boot_mean and re-allocating memory for it
  • Temporary variable i created and gets updated in the global environment
7 / 32

for loops

boot_mean <- numeric()
for(i in c(1:500)){
boot_mean[i] <- calculate_boot_mean(
data = df$cost, n = n_sample, replacement = TRUE)
}
  • Re-sizing vector boot_mean and re-allocating memory for it
  • Temporary variable i created and gets updated in the global environment
boot_mean <- numeric(500)
for(i in c(1:500)){
boot_mean[i] <- calculate_boot_mean(
data = df$cost, n = n_sample, replacement = TRUE)
}

✓ Pre-allocate a vector that fits all values to avoid re-sizing/re-allocating memory

7 / 32

Functional programming

  • A functional is a function that takes a function as an input, applies the function iteratively to each element in a list and returns the results
  • This can be achieved using functions from the apply() family
  • Runs for loop internally in C
boot_mean <- lapply(c(1:500), function(i){
calculate_boot_mean(data = df$cost, n = n_sample, replacement = TRUE)
})

[†] Read more about the apply() family of functions in Carlo Fanara | Datacamp tutorial on the R Apply Family. Functional programming can also be implemented using the map() family of functions in purrr. More details in Garrett Grolemund & Hadley Wickham | R for Data Science - Chapter 21 Iteration

8 / 32

Functional programming

  • A functional is a function that takes a function as an input, applies the function iteratively to each element in a list and returns the results
  • This can be achieved using functions from the apply() family
  • Runs for loop internally in C
boot_mean <- lapply(c(1:500), function(i){
calculate_boot_mean(data = df$cost, n = n_sample, replacement = TRUE)
})

[†] Read more about the apply() family of functions in Carlo Fanara | Datacamp tutorial on the R Apply Family. Functional programming can also be implemented using the map() family of functions in purrr. More details in Garrett Grolemund & Hadley Wickham | R for Data Science - Chapter 21 Iteration

✓ No need to pre-define vector for storing results

✓ Variables in working environment unaffected

✓ Makes parallelisation easy

8 / 32

? Is it faster?

9 / 32

? Is it faster?

boot_mean <- numeric(500)
for(i in c(1:500)){
boot_mean[i] <- calculate_boot_mean(
data = df$cost, n = n_sample, replacement = TRUE)
}

Median time taken = 10.16ms

boot_mean <- lapply(c(1:500), function(i){
calculate_boot_mean(data = df$cost, n = n_sample, replacement = TRUE)
})

Median time taken = 8.54ms

9 / 32

? Is it faster?

boot_mean <- numeric(500)
for(i in c(1:500)){
boot_mean[i] <- calculate_boot_mean(
data = df$cost, n = n_sample, replacement = TRUE)
}

Median time taken = 10.16ms

boot_mean <- lapply(c(1:500), function(i){
calculate_boot_mean(data = df$cost, n = n_sample, replacement = TRUE)
})

Median time taken = 8.54ms

Sometimes

9 / 32

Vectorisation

  • Many operations in R are vectorised
  • Instead of working on each element of the vector individually (as in the case of loops), it works on the entire vector.
vector1 <- c(1:4); vector2 <- c(6:9)
## Element-wise computation
sum <- numeric(length(vector1))
for(i in seq_along(vector1)) {
sum[i] <- vector1[i] + vector2[i]
}
## Vectorised computation
vector1 + vector2
10 / 32

Suppose we want to test in our dummy cost dataset if the cost of treatment is more than £50.

## Element-wise computation
lapply(seq_along(df$cost), function(i){df$cost[i] > 50})

Median time taken = 141.59µs

## Vectorised computation
df$cost > 50

Median time taken = 1.51µs

11 / 32

Parallelisation

  • Running several processes simultaneously on multiple processors/cores of the computer
  • Many different packages to achieve this. Some examples:
12 / 32

Parallelising your code

boot_mean <- furrr::future_map(c(1:500), function(i){
calculate_boot_mean(
data = df$cost, n = n_sample, replacement = TRUE)
})

Median time taken = 23.65ms (previously 8.54ms)

furrr::future_map(seq_along(df$cost), function(i){df$cost[i] > 50})

Median time taken = 12.01ms (previously 0.14159ms)

13 / 32

Parallelising your code

boot_mean <- furrr::future_map(c(1:500), function(i){
calculate_boot_mean(
data = df$cost, n = n_sample, replacement = TRUE)
})

Median time taken = 23.65ms (previously 8.54ms)

furrr::future_map(seq_along(df$cost), function(i){df$cost[i] > 50})

Median time taken = 12.01ms (previously 0.14159ms)

? Parallelising here is actually SLOWER?

13 / 32

One of the caveats of parallelisation: overhead time

  • Each parallel process has its own memory space and data/package needs to be loaded across them all
  • A "start-up" time is required before actual computation occurs
  • Need to weigh up the cost-benefit of using parallelisation
    • Usually cost > benefit for smaller task
14 / 32

Other ways to optimise your code

Break down the task

Make your function do as little as possible

df_cost <- df$cost
boot_mean <- lapply(c(1:500), function(i){
calculate_boot_mean(
data = df_cost, n = n_sample, replacement = TRUE)
})

Median time taken = 8.1ms (previously 8.54ms)

  • Dimension reduction
15 / 32

Break down the task

Do pre-calculations

e.g. Very simple simulation model (finite combination of states) predicting the state patient is in, and performing costs and QALYs calculation in each cycle

  • Pre-calculate costs/QALYs for all possible combinations of states
  • Apply pre-calculations after simulation
16 / 32

Break down the task

Do pre-calculations

e.g. Very simple simulation model (finite combination of states) predicting the state patient is in, and performing costs and QALYs calculation in each cycle

  • Pre-calculate costs/QALYs for all possible combinations of states
  • Apply pre-calculations after simulation

Using a faster language

Sometimes R is just slow. But R has interfaces to other faster languages via packages.

  • C++: Rcpp [probably the most popular]
  • Python: rPython
  • Java: rJava
16 / 32

Case study with a markov model

Acknowledgements

Materials used in this case study is based on the work conducted as part of the 2019 R for Health Economics Hackathon. The team behind this work comprise of:

  • Howard Thom, University of Bristol
  • Iryna Schlackow, University of Oxford
  • Mi Jun Keng, University of Oxford
  • Robert Ashton, Imperial College London
  • Sam Abbott, London School of Hygiene and Tropical Medicine
  • Amy Chang, University of Sheffield
  • Han Fu, Imperial College London
  • Houra Haghpanahan, University of Glasgow
  • Zoe Turner, Nottinghamshire Healthcare NHS Foundation Trust

The R code for the original work is accessible from GitHub.

For an implementation of this work, see Sam Abbott's package SpeedyMarkov (presentation on 12th October 0910-0935)

17 / 32

Markov model with 10 states (time-homogeneous; all transitions permitted)

  • n.treatments = 2 | no. of treatment arms
  • n.samples = 25000 | no. of PSA samples
  • n.cyles = 100 | number of cycles to run
18 / 32

Extract of code for Markov model (minimal, non-executable)

# Loop over the treatment options
for(i.treatment in 1:n.treatments){
# Extract transition matrix for treatment
transition.matrices_tr <- transition.matrices[i.treatment,,,]
# Loop over the PSA samples
for(i.sample in 1:n.samples){
transition.matrices_tr_sample <- transition.matrices_tr[i.sample,,]
# Loop over the cycles
for(i.cycle in 2:n.cycles){
# Markov update
cohort.vectors[i.treatment, i.sample,i.cycle,]<-
cohort.vectors[i.treatment, i.sample,i.cycle-1,] %*%
transition.matrices_tr_sample
}
cohort.vectors_tr_sample <- cohort.vectors[i.treatment,i.sample,,]
}
}
19 / 32
Tx PSA sample Cycle Median time across 20 iterations
for loop for loop for loop 16.1s
20 / 32

lapply() over treatments

# Functional loop over treatment options
lapply(c(1:n.treatments), function(i.treatment){
transition.matrices_tr <- transition.matrices[i.treatment,,,]
## Dimension reduction
cohort.vectors_tr <- cohort.vectors[i.treatment,,,]
for(i.sample in 1:n.samples){
transition.matrices_tr_sample <- transition.matrices_tr[i.sample,,]
## Dimension reduction
cohort.vectors_tr_sample <- cohort.vectors_tr[i.sample,,]
for(i.cycle in 2:n.cycles){
cohort.vectors_tr_sample[i.cycle,] <- cohort.vectors_tr_sample[i.cycle-1,] %*% transition.matrices_tr_sample
}
}
})
21 / 32
Tx PSA sample Cycle Median time across 20 iterations
for loop for loop for loop 16.1s
lapply for loop for loop 11.3s
22 / 32

lapply() over PSA samples

for(i.treatment in c(1:n.treatments)){
transition.matrices_tr <- transition.matrices[i.treatment,,,]
cohort.vectors_tr <- cohort.vectors[i.treatment,,,]
# Functional loop over PSA samples
lapply(c(1:n.samples), function(i.sample){
transition.matrices_tr_sample <- transition.matrices_tr[i.sample,,]
cohort.vectors_tr_sample <- cohort.vectors_tr[i.sample,,]
for(i.cycle in 2:n.cycles){
cohort.vectors_tr_sample[i.cycle,] <-
cohort.vectors_tr_sample[i.cycle-1,] %*%
transition.matrices_tr_sample
}
})
}
23 / 32
Tx PSA sample Cycle Median time across 20 iterations
for loop for loop for loop 16.1s
lapply for loop for loop 11.3s
for loop lapply for loop 14.3s
24 / 32

Parallelise over PSA samples

for(i.treatment in c(1:n.treatments)){
transition.matrices_tr <- transition.matrices[i.treatment,,,]
cohort.vectors_tr <- cohort.vectors[i.treatment,,,]
# Parallelised over PSA samples
furrr::future_map(c(1:n.samples), function(i.sample){
transition.matrices_tr_sample <- transition.matrices_tr[i.sample,,]
cohort.vectors_tr_sample <- cohort.vectors_tr[i.sample,,]
for(i.cycle in 2:n.cycles){
cohort.vectors_tr_sample[i.cycle,] <- cohort.vectors_tr_sample[i.cycle-1,] %*% transition.matrices_tr_sample
}
})
}
25 / 32
Tx PSA sample Cycle Median time across 20 iterations
for loop for loop for loop 16.1s
lapply for loop for loop 11.3s
for loop lapply for loop 14.3s
for loop parallel for loop 14.8s
26 / 32

lapply() over treatments & vectorised operation over samples

# Functional loop over the treatment options
lapply(c(1:n.treatments), function(i.treatment){
transition.matrices_tr <- transition.matrices[i.treatment,,,]
cohort.vectors_tr <- cohort.vectors[i.treatment,,,]
for(i.cycle in 2:n.cycles){
## Reformulated such that vectorised operation over PSA samples
for(i.state in 1:n.states){
cohort.vectors_tr[ ,i.cycle,i.state]<-
rowSums(cohort.vectors_tr[ ,i.cycle-1, ] *
transition.matrices_tr[,,i.state])
}
}
})
27 / 32
Tx PSA sample Cycle Median time across 20 iterations
for loop for loop for loop 16.06s
lapply for loop for loop 11.28s
for loop lapply for loop 14.27s
for loop parallel for loop 14.79s
lapply vectorised for loop 9.21s
28 / 32

lapply() over treatments & rcpp_loop() over cycles

# Functional loop over the treatment options
lapply(c(1:n.treatments), function(i.treatment){
transition.matrices_tr <- transition.matrices[i.treatment,,,]
cohort.vectors_tr <- cohort.vectors[i.treatment,,,]
for(i.sample in 1:n.samples){
transition.matrices_tr_sample <- transition.matrices_tr[i.sample,,]
cohort.vectors_tr_sample <- cohort.vectors_tr[i.sample,,]
# Loop over the cycles using Rcpp
cohort.vectors_tr_sample <-
rcpp_loop(mat_in = cohort.vectors_tr_sample,
transition = transition.matrices_tr_sample,
n = n.cycles)
}
})
29 / 32
Tx PSA sample Cycle Median time across 20 iterations
for loop for loop for loop 16.06s
lapply for loop for loop 11.28s
for loop lapply for loop 14.27s
for loop parallel for loop 14.79s
lapply vectorised for loop 9.21s
lapply for loop rcpp 5.85s
30 / 32

How to optimise?

Functionals | Vectorising | Parallelising etc.

31 / 32

How to optimise?

Functionals | Vectorising | Parallelising etc.

What should you optimise?

Always aim for readability | Profile your code to find bottlenecks

[†] One way to easily profile your code is with profvis. More about code profiling in Hadley Wickham | Advanced R - Chapter 23 Measuring performance

31 / 32

How to optimise?

Functionals | Vectorising | Parallelising etc.

What should you optimise?

Always aim for readability | Profile your code to find bottlenecks

[†] One way to easily profile your code is with profvis. More about code profiling in Hadley Wickham | Advanced R - Chapter 23 Measuring performance

Should you even optimise at all?

First make it work correctly. Then, perhaps.

31 / 32

How to optimise?

Functionals | Vectorising | Parallelising etc.

What should you optimise?

Always aim for readability | Profile your code to find bottlenecks

[†] One way to easily profile your code is with profvis. More about code profiling in Hadley Wickham | Advanced R - Chapter 23 Measuring performance

Should you even optimise at all?

First make it work correctly. Then, perhaps.

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. — Donald Knuth
31 / 32

References

Gillespie, C. and R. Lovelace (2017). Efficient R Programming : A Practical Guide to Smarter Programming. Eng. First. Sebastopol, CA: O'Reilly. ISBN: 978-1-4919-5078-4.

Hadley Wickham and Garrett Grolemund (2016). R for Data Science. Eng. 1st. O'Reilly Media, Inc..

Wickham, H. (2019). Advanced R. 2nd edition.. Chapman & Hall/CRC the r Series. Boca Raton. ISBN: 978-1-351-20129-2.

32 / 32

Why optimise?

Many stochastic elements in health economics analysis such as bootstrapping, probabilistic sensitivity analyses, Monte-Carlo simulations etc.

  • Repetitive / messy code
  • Time consuming to run
  • Memory issues
2 / 32
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow