title

author

date

output

Cluster Computing

Brian High

May 20, 2020

ioslides_presentation

css	fig_caption	fig_height	fig_retina	fig_width	incremental	keep_md	logo	smaller	template
inc/deohs-ioslides-theme.css	true	3	1	5	false	true	img/logo_128.png	false	inc/deohs-default-ioslides.html

Cluster Computing

Today's presentation addresses these objectives:

Know what a compute cluster is and when you would use it
Differentiate between a compute node and the head node
Know the resources available on "deohs-brain"
Know how to connect to the compute cluster "deohs-brain"
Know how to start and manage jobs on the cluster
Know how to use the cluster for parallel processing

What is a compute cluster?

A compute cluster is a collection of computers configured with:

One or more compute nodes
A head node that runs a job scheduler
Access control to limit access to the cluster

When would you use compute cluster?

You will want to use a compute cluster when:

Your work is too resource intensive for your other systems
Your work would benefit from more cores and memory
You can configure your software (code) to use more cores
You want to use the resources of multiple machines simultaneously

What are head and compute nodes?

The head node is where you:

Connect to the cluster
Configure your software environment (e.g., install packages)
Configure, launch, and manage batch jobs
Launch interactive sessions
Transfer data into and out of the cluster

The compute nodes are where you run your jobs and interactive sessions.

The DEOHS compute cluster


Head node	deohs-brain
# Queues	5+
Total # Cores	500+
# Cores/node	24-32
# Slots/node	48-64
Total Memory (RAM)	4+ TB
Memory/node	384 GB
Documentation	wiki

How to connect

You can connect to "deohs-brain" through:

SSH via terminal app like PuTTY or from command-line
SCP, SFTP via terminal app or GUI app like CyberDuck
Remote Desktop or X2Go (for interactive "desktop" sessions)
- X2Go will likely be preferred for Windows users, as it will be easier to configure
- Mac users will have a better experience using Remote Desktop versus X2Go
  - Or you can run X2Go from within a Remote Desktop server session
Documented in the wiki

How to manage jobs

You can launch either interactive sessions or batch jobs:

Interactive sessions are launched with qlogin:
- qlogin -q QUEUE -pe smp NSLOTS
- Yes, you can run RStudio Desktop on a compute node
Batch jobs are launched with qsub
Batch jobs can use a single machine (smp) or more (mpi)
You can view jobs with qstat (yours) or qstat -f -u "*" (all)
You can delete jobs with qdel
Batch jobs are often launched using a job file:
- mpi example and smp example

Parallel processing on the cluster

Some software has multicore capabilities built-in.
For R, you can use packages like parallel and BiocParallel.
Batch jobs run across multiple nodes also need Rmpi.

Installation of some packages may be a little tricky. See:

Rmpi install script
BiocParallel install script

Parallel R performance

MPI is required for use across multiple nodes (Use: -pe mpi)
FORK and SOCK only run on single nodes (Use: -pe smp)
The incremental speedup from additional cores will diminish
Test and tune your code before running full workload
Performance varies with communications overhead
With a test script, we found:
- Cluster types FORK, SOCK, and MPI are comparable
- BiocParallel is slower when using many workers (> 16)
Your code may perform differently so do your own tests.

Parallel R performance plot

Reducing "overhead"

It's like the benefits of mass transit and carpooling...

Each vehicle has overhead: fuel, maintenance, insurance, congestion, etc. So, you want to fit passengers taking the same route at the same time on as few vehicles as possible.

Splitting tasks by workers

You can reduce overhead by splitting the number of total replications ("passengers") by the number of workers ("vehicles"). That way, each worker only gets initialized once.

# Setup
workers <- 8
R <- 10000

# Total replicates (R) is much greater than the number of workers
ci_boot <- mclapply(1:R, f, mc.cores = workers)

# Splitting replicates by number of workers will speed up processing
X.split <- split(1:R, rep_len(1:workers, length(1:R)))
mclapply(X.split, f, mc.cores = workers)

Alternatives to using split are clusterSplit and parLapply.

Splitting tasks by workers

In one test, using 8 workers and 10,000 total replications, splitting improved speed by 36-38%.

pkg	fun	splitlen	elapsed
base	lapply	R	94.321
parallel	mclapply	R	14.648
parallel	mclapply	workers	9.071
BiocParallel	bplapply	R	15.696
BiocParallel	bplapply	workers	10.001

Splitting tasks by workers plot

Splitting tasks by workers: Conclusions

Without splitting, MPI can take over 6x longer than single core
- With splitting, MPI may be slower than FORK and SOCK
Splitting may speed-up FORK ("mclapply") 40% or more
Splitting may speed-up SOCK ("clusterApply") 65% or more
"split" and "clusterSplit" methods perform equally well
- "parLapply" does the splitting for you, but is slower

Tips

Keep track of your sessions and jobs
Close any unused/idle sessions and jobs
Verify that what you have closed has actually ended
Install packages on the head node
Run heavy-duty jobs on compute nodes
Use one more slot than the number of workers
Number of parallel tasks should equal the number of workers
Clean up your "home" and "scratch" folders regularly
Use terminal sessions when you don't really need a GUI
If you use a graphical desktop, log out when finished

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!