-
Notifications
You must be signed in to change notification settings - Fork 3
Running a Compute Job
Brain supports two types of computing jobs: batch and interactive. What type of job to use, and when to use it will depend on your needs. For example, if you are performing exploratory data analysis, you will want an interactive job. On the other hand, if you have long running task that once started, will run without user input and eventually output a result, you should use a batch job.
Ideally, you should aim to work towards converting as many of your tasks to batch jobs. With batch jobs, you help maximize the cluster resource utilization, by allowing the job scheduler to start your task as soon resources are available. In other words, you don't have to be around to start / stop the job, the scheduler will do this automatically, with the goal of getting every job done as soon as it can.
Each cluster node is configured with approximately 180GB of high speed scratch space. The scratch space is local to the node itself. For maximum performance, your should job should copy any required data to the local scratch disk during startup. Then as the job processes, it should output to that local scratch disk. Once the job completes, your job should be configured to copy the data from scratch to your home folder. After the copy completes, be sure to cleanup (remove) any files you have on scratch.
Scratch is conveniently located at /scratch.
IMPORTANT: When scheduling a compute job, you must specify the amount CPU/Memory your job will require. If your job exceeds those limits, the scheduler may terminate your job without warning.
PE (Parallel Environment)
There two types of parallel environments on Brain: MPI and SMP.
- mpi: Jobs utilizing MPI (typically distributed across nodes)
- smp: Jobs utilizing 1 or more CPU cores on a single node
NOTE: For long running (several days or weeks) batch jobs, your job should be designed to save checkpoints that it can resume from. That way, in the event the job is terminated due resource usage, node crash, etc. you don't lose all of your work.
qlogin -q YOURQUEUE.q -pe PE SLOTS
List jobs you have running:
qstat
Terminating a running job:
qdel JOBID