title | author | date | output | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cluster Computing |
Brian High |
May 20, 2020 |
|
Today's presentation addresses these objectives:
- Know what a compute cluster is and when you would use it
- Differentiate between a compute node and the head node
- Know the resources available on "deohs-brain"
- Know how to connect to the compute cluster "deohs-brain"
- Know how to start and manage jobs on the cluster
- Know how to use the cluster for parallel processing
A compute cluster is a collection of computers configured with:
- One or more compute nodes
- A head node that runs a job scheduler
- Access control to limit access to the cluster
You will want to use a compute cluster when:
- Your work is too resource intensive for your other systems
- Your work would benefit from more cores and memory
- You can configure your software (code) to use more cores
- You want to use the resources of multiple machines simultaneously
The head node is where you:
- Connect to the cluster
- Configure your software environment (e.g., install packages)
- Configure, launch, and manage batch jobs
- Launch interactive sessions
- Transfer data into and out of the cluster
The compute nodes are where you run your jobs and interactive sessions.
Head node | deohs-brain |
# Queues | 5+ |
Total # Cores | 500+ |
# Cores/node | 24-32 |
# Slots/node | 48-64 |
Total Memory (RAM) | 4+ TB |
Memory/node | 384 GB |
Documentation | wiki |
You can connect to "deohs-brain" through:
- SSH via terminal app like PuTTY or from command-line
- SCP, SFTP via terminal app or GUI app like CyberDuck
- Remote Desktop or X2Go (for interactive "desktop" sessions)
- X2Go will likely be preferred for Windows users, as it will be easier to configure
- Mac users will have a better experience using Remote Desktop versus X2Go
- Or you can run X2Go from within a Remote Desktop server session
- Documented in the wiki
You can launch either interactive sessions or batch jobs:
- Interactive sessions are launched with
qlogin
:qlogin -q QUEUE -pe smp NSLOTS
- Yes, you can run RStudio Desktop on a compute node
- Batch jobs are launched with
qsub
- Batch jobs can use a single machine (smp) or more (mpi)
- You can view jobs with
qstat
(yours) orqstat -f -u "*"
(all) - You can delete jobs with
qdel
- Batch jobs are often launched using a job file:
- Some software has multicore capabilities built-in.
- For R, you can use packages like parallel and BiocParallel.
- Batch jobs run across multiple nodes also need Rmpi.
Installation of some packages may be a little tricky. See:
- MPI is required for use across multiple nodes (Use:
-pe mpi
) - FORK and SOCK only run on single nodes (Use:
-pe smp
) - The incremental speedup from additional cores will diminish
- Test and tune your code before running full workload
- Performance varies with communications overhead
- With a test script, we found:
- Cluster types FORK, SOCK, and MPI are comparable
- BiocParallel is slower when using many workers (> 16)
- Your code may perform differently so do your own tests.
It's like the benefits of mass transit and carpooling...
Each vehicle has overhead: fuel, maintenance, insurance, congestion, etc. So, you want to fit passengers taking the same route at the same time on as few vehicles as possible.
You can reduce overhead by splitting the number of total replications ("passengers") by the number of workers ("vehicles"). That way, each worker only gets initialized once.
# Setup
workers <- 8
R <- 10000
# Total replicates (R) is much greater than the number of workers
ci_boot <- mclapply(1:R, f, mc.cores = workers)
# Splitting replicates by number of workers will speed up processing
X.split <- split(1:R, rep_len(1:workers, length(1:R)))
mclapply(X.split, f, mc.cores = workers)
Alternatives to using split are clusterSplit and parLapply.
In one test, using 8 workers and 10,000 total replications, splitting improved speed by 36-38%.
pkg | fun | splitlen | elapsed |
---|---|---|---|
base | lapply | R | 94.321 |
parallel | mclapply | R | 14.648 |
parallel | mclapply | workers | 9.071 |
BiocParallel | bplapply | R | 15.696 |
BiocParallel | bplapply | workers | 10.001 |
- Without splitting, MPI can take over 6x longer than single core
- With splitting, MPI may be slower than FORK and SOCK
- Splitting may speed-up FORK ("mclapply") 40% or more
- Splitting may speed-up SOCK ("clusterApply") 65% or more
- "split" and "clusterSplit" methods perform equally well
- "parLapply" does the splitting for you, but is slower
- Keep track of your sessions and jobs
- Close any unused/idle sessions and jobs
- Verify that what you have closed has actually ended
- Install packages on the head node
- Run heavy-duty jobs on compute nodes
- Use one more slot than the number of workers
- Number of parallel tasks should equal the number of workers
- Clean up your "home" and "scratch" folders regularly
- Use terminal sessions when you don't really need a GUI
- If you use a graphical desktop, log out when finished