UsingRinHPC/parallelComputing.Rmd at master · JocelyneSze/UsingRinHPC · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
---
title: "Learning to parallelise"
urlcolor: blue
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

References that I found useful in trying to understand this vast topic:

* R bloggers [how to go parallel in R basics tips](https://www.r-bloggers.com/how-to-go-parallel-in-r-basics-tips/)
* this site on [running the same R script on many different input files](http://www.cureffi.org/2014/01/15/running-r-batch-mode-linux/)
* The many documents on HPC and parallel computing put up by various universities around the world, particularly
* the [HPC User Guide provided by the University of Sussex](https://info.hpc.sussex.ac.uk/hpc-guide/how-to/index.html)
* the webpage on [Parallel Processing in R at the Dept of Statistics,  University of Michigan](https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html)

There's [more information on different ways of using parallel computing on Sheffield's HPC site](https://docs.hpc.shef.ac.uk/en/latest/parallel/index.html) (though I don't understand most of it...)

**When to parallelise**

* If you have to repeat the same steps several times. Which you might do with a for loop, but you should do with the apply family of functions (`lapply`, `tapply` etc.) instead because they're more efficient.
* If you have multiple cores on your computer (most modern computers do) that you can utilise (because R only runs one process aka thread by default, which runs on only one core)
* If you have access to a HPC

A core on the HPC cluster is what you automatically get when you make a job request, it specifies a single machine in the cluster network. On your laptop or desktop, you may have just one processor but it'll have multiple cores. Check using the R package `parallel`
```{R, eval=FALSE}
library(parallel) # comes with base, no need to install
detectCores()
```
On my MacBook Pro Late 2011 version with its 2.4 GHz Intel Core i5 processor, there are 4 cores. 2 physical (when I check system report on my Mac), 4 virtual. Though I'm not sure what the differences between physical and virtual cores are...


## Parallel computing on your own computer
You would basically be splitting a 'lapply'-like function over the multiple cores in your computer. There are two main methods: 1) PSOCK: Parallel Socket Cluster and 2) FORK

PSOCK launches a new version of R on each core and works on all systems including Windows, but the environment is empty so they need to be explicitly shared. It appears to be more time-consuming and more complicated.

FORKing copies the entire current version of R and transfers it to a new core, so has a shared environment (no need to explicitly share variables and libraries again) and apart from the fact that it does not work on Windows, it seems to be the generally preferred method (if you're not running on Windows).

**Using base package `parallel`**
The base package `parallel` allows a lot of parallelisation to be done without additional packages installed. For example, using `parLapply` (PSOCK method) or `mclapply` (FORK method) instead of `lapply`.

Using the PSOCK method:
```{R, eval=FALSE}
library(parallel)
library(lme4)

# the function we will use as an example
ex <- function(x) {
  lmer(Petal.width ~ Sepal.Width + Petal.Length + (1 | Species), data=iris)
}
# assume we also need a variable
x <- c(1:5)

# number of cores to use (-1 so as not to freeze the computer)
noCores <- detectCores() - 1
# Initiate a cluster
cl <- makeCluster(noCores)
# 'share' library to each core
clusterEvalQ(cl, library(lme4))
# if there are variables, they also need to be shared
clusterExport(cl, x)
# call the parallel version of lapply to run the function 10 times
parLapply(cl, 1:10, ex)
# stop the cluster to free up resources
stopCluster(cl)
```

Or using `mclapply`
```{R, eval=FALSE}
library(parallel)
library(lme4)

# the function we will use as an example
ex <- function(x) {
  lmer(Petal.width ~ Sepal.Width + Petal.Length + (1 | Species), data=iris)
}

# call the parallel version of lapply to run the function 10 times
mclapply(1:10, ex)

```

**Other packages**
Alternatively, there are also specific libraries you can install to use, like `foreach` which requires the `doParallel` package too. The `doParallel` package is a wrapper for `parallel`.
```{R, eval=FALSE}
# install.packages(c('foreach', 'doParallel'))
library(parallel)
library(foreach)
library(lme4)

# the function we will use as an example
ex <- function(x) {
  lmer(Petal.width ~ Sepal.Width + Petal.Length + (1 | Species), data=iris)
}
# assume we also need a variable
x <- c(1:5)


# start a cluster
cl <- makeCluster(noCores)
registerDoParallel(cl) # this is necessary for the parallelisation to work

# use foreach
foreach(1:10,
        .combine=list, # or you can output a vector 'c', or matrix 'rbind' or 'cbind'
        .export='x', # .export is the same as clusterExport
        .packages='lme4') %dopar% # .packages is the same as clusterEvalQ
  ex

stopCluster(cl)
```

## Parallel computing on the HPC cluster
There are several different ways to do this on the cluster. The simplest one, which is perhaps most like how you would do it on your computer, is a **task array**, where each task is independent of the other and will be run on different cores on different nodes at different times (as determined by the job scheduler SGE). Then there is **Shared Memory Parallelism**, where the tasks have to share the same memory and so will be run on different cores on the same node. A little more complicated, is **Message Passing Interface**, where the tasks might be run on different cores but need to share memory, and so need to communicate data/other information by passing messages. Then there's a hybrid smp/mpi way. Unfortunately, I've not quite got the hang of smp and/or mpi at the moment, so this guide won't include those.

### Setting up a task array job
This is (I think) what one is most likely to use the HPC to do parallel computing for. It takes multiple input files and runs the same R script on them to get the corresponding output files. From my search, it seemed like there may be many different ways to do this/to write the necessary scripts, but this is what I've done which works for me on ShARC.

#### Writing the bash script
Similar to submitting a batch serial job, this requires a bash script (that I'll call `taskArray.sh`) but with modifications.

```{bash, eval=FALSE}
#!/bin/bash
#$ -l rmem=XXG
#$ -t 1-N
#$ -N jobName
#$ -M UserName@email.address
#$ -m bea
#$ -o /file/path/to/Output.txt
#$ -j y

cd /working/directory

# specify input files in a text file called Index, one file per line
INPUT_FILE=$(sed -n -e "$SGE_TASK_ID p" Index.txt)

# output the input file name and task ID to keep track
echo "The input file is $INPUT_FILE"
echo "Task ID is $SGE_TASK_ID"

module load apps/R/3.6.3/gcc-8.2.0

Rscript myScript.R $INPUT_FILE
```

The specifications are the same, the only additional argument is `-t 1-N` which tells SGE this is a task array job, and to run the R script `myScript.R` 1 to N times, for the N input files you have. It can only call your input files using the $SGE_TASK_ID (which would be 1 to N) assigned; I found it easier to have my input file names in a plain text document (called Index.txt) instead of renaming my files (for e.g. input.1, input.2, ... input.N).

My plain text file contains the list of my input files, including file path if needed relative to the working directory I specified in my bash script, and looks like this
```{bash, eval=FALSE}
directory/firstFile.csv
directory/secondFile.csv
directory/thirdFile.csv
```

The line `INPUT_FILE=$(sed -n -e "$SGE_TASK_ID p" Index.txt)` basically calls the corresponding line in the Index.txt file and feeds that into the Rscript command. This bash script has to be made executable as well, and both the bash script `taskArray.sh` and the text file with the list of input files `Index.txt` have to be in the same directory.

If you don't have multiple different input files, and just want to run the same script repeatedly, you can just drop the lines about INPUT FILES. Within your R script though, assuming you have different output for each run, you will likely need to include a line to get the array number: `arrayid <- as.numeric(Sys.getenv('SGE_TASK_ID'))`.

#### Writing the R script
The R script has to be modified slightly since you're not hard-coding the input file into the script itself. This R script should be in the same directory as taskArray.sh and Index.txt files.

```{R, eval=FALSE}
shhh <- suppressPackageStartupMessages # so I don't get a bunch of start up messages in the output file, a tip I encountered while searching through StackOverflow...
shhh(library('packageName'))
shhh(library('packageName'))

## read in dataset
args <- commandArgs(trailingOnly = TRUE) # to get the input files from command line
dat <- read.csv(args[1]) # feed the input file into the read.csv function
```

Submitting the batch job is basically the same process, being connected on the VPN, logging on the the HPC system, then running `qsub taskArray.sh`. I find the `commandArgs` function quite useful to be able to feed in different input arguments from a bash serial script without having to change the R script.