-
Notifications
You must be signed in to change notification settings - Fork 3
Running a Compute Job
Brain supports two types of computing jobs: batch and interactive. What type of job to use, and when to use it will depend on your needs. For example, if you are performing exploratory data analysis, you will want an interactive job. On the other hand, if you have long running task that once started, will run without user input and eventually output a result, you should use a batch job.
Ideally, you should aim to work towards converting as many of your tasks to batch jobs. With batch jobs, you help maximize the cluster resource utilization, by allowing the job scheduler to start your task as soon resources are available. In other words, you don't have to be around to start / stop the job, the scheduler will do this automatically, with the goal of getting every job done as soon as it can.
PE (Parallel Environment)
There two types of parallel environments on Brain: MPI and SMP.
- mpi: Jobs utilizing MPI (typically distributed across nodes)
- smp: Jobs utilizing 1 or more CPU cores on a single node
Slot
A slot is a fraction of a compute node. From a CPU standpoint, a slot represents 1 virtual CPU core, or hyperthread.
JOBID
The resource scheduler assigns a unique ID to every job. You can use the ID number check on job status, terminate the job, etc.
IMPORTANT: When scheduling a compute job, you must specify the amount CPU/Memory your job will require. If your job exceeds those limits, the scheduler may terminate your job without warning.
Each cluster node is configured with approximately 180GB of high speed scratch space. The scratch space is local to the node itself. For maximum performance, your should job should copy any required data to the local scratch disk during startup. Then as the job processes, it should output to that local scratch disk. Once the job completes, your job should be configured to copy the data from scratch to your home folder. After the copy completes, be sure to cleanup (remove) any files you have on scratch.
Scratch is conveniently located at /scratch.
NOTE: For long running (several days or weeks) batch jobs, your job should be designed to save checkpoints that it can resume from. That way, in the event the job is terminated due resource usage, node crash, etc. you don't lose all of your work.
A batch job, starts with a job file that defines where (what directory) the job should run, the job queue to use, and the amount of resources it requires. A simple example of a script, we'll call test.sh is below:
#!/bin/bash
#
#$ -cwd
#$ -q YOURQUEUE.q
#$ -pe PE SLOTS
#$ -S /bin/bash
date
sleep 60
date
Once the job script has been created, you can submit it to the scheduler:
qsub test.sh
Your job will be executed as soon as there is available cluster capacity. For most jobs, this is likely to be immediate. However, if you submit several jobs within a short period of time, your total resource demand may exceed your queue limit. In which case, some jobs will be delayed while they wait for resources.
Requesting an interactive session on a compute node:
qlogin -q YOURQUEUE.q -pe PE SLOTS
Once your request is submitted, it should be processed in seconds. If there is insufficient available capacity, the resource scheduler will deny your request.
List jobs you have running:
qstat
Terminating a running job:
qdel JOBID