Skip to content

Running a Compute Job

John Yocum edited this page Apr 19, 2019 · 23 revisions

Brain supports two types of computing jobs: batch and interactive. What type of job to use, and when to use it will depend on your needs. For example, if you are performing exploratory data analysis, you will want an interactive job. On the other hand, if you have long running task that once started, will run without user input and eventually output a result, you should use a batch job.

Ideally, you should aim to work towards converting as many of your tasks to batch jobs. With batch jobs, you help maximize the cluster resource utilization, by allowing the job scheduler to start your task as soon resources are available. In other words, you don't have to be around to start / stop the job, the scheduler will do this automatically, with the goal of getting every job done as soon as it can.

Terms

PE (Parallel Environment)

There two types of parallel environments on Brain: MPI and SMP.

  • mpi: Jobs utilizing MPI (typically distributed across nodes)
  • smp: Jobs utilizing 1 or more CPU cores on a single node

Slot

A slot is a fraction of a compute node. From a CPU standpoint, a slot represents 1 virtual CPU core, or hyperthread.

JOBID

The resource scheduler assigns a unique ID to every job. You can use the ID number check on job status, terminate the job, etc.

Scheduling a Job

IMPORTANT: When scheduling a compute job, you must specify the amount CPU/Memory your job will require. If your job exceeds those limits, the scheduler may terminate your job without warning.

Scratch Space

Each cluster node is configured with approximately 180GB of high speed scratch space. The scratch space is local to the node itself. For maximum performance, your should job should copy any required data to the local scratch disk during startup. Then as the job processes, it should output to that local scratch disk. Once the job completes, your job should be configured to copy the data from scratch to your home folder. After the copy completes, be sure to cleanup (remove) any files you have on scratch.

Scratch is conveniently located at /scratch.

Batch

NOTE: For long running (several days or weeks) batch jobs, your job should be designed to save checkpoints that it can resume from. That way, in the event the job is terminated due resource usage, node crash, etc. you don't lose all of your work.

Interactive

Requesting an interactive session on a compute node:

qlogin -q YOURQUEUE.q -pe PE SLOTS

Once your request is submitted, it should be processed in seconds. If there is insufficient available capacity, the resource scheduler will deny your request.

Managing Jobs

List jobs you have running:

qstat

Terminating a running job:

qdel JOBID