Scientific Computing Survival Guide: October 2013

Working on a cluster or a supercomputer necessitates playing well with others. The workstation that you may sit at during work is rarely shared with more than a few people, but large shared resources can service tens to thousands of people. The computations that you run must be submitted to a "scheduler", which determines when the computation runs. A scheduler sets up the order of "job" execution.

The Portable Batch System (PBS) is the de facto standard for NSF, DOE and many other computer systems. The PBS system defines a set of commands and directives that can be used to control a job. A list (not exhaustive) of the main commands are: qsub, qstat, qhold, qalter and qdel. Really you probably only need qsub, qstat and qdel until your jobs get a little more advanced.

To describe briefly the each command: When creating a job, you can submit it from the command line with a command such as "qsub -l nodes=1 ..." command or pass a file that describes the job such as "qsub filename". I prefer the second, and suggest you use a file too. Once you've submitted a job, you can check where in line and what the status of the job is using the qstat command. And in the off-chance that you want to delete a job you've already submitted, you can issue the command "qdel jobid" where the job ID is that reported by qstat.

Job Submission

Submitting a job is the most important, and having a good submission script can really make your life easy, as types with a one line submission cause a lot of aggravation; I've spent dozens of hours debugging crappy qsub lines. The general form of a queue script is:

<directives>

<environment setup>

<job submission>

Each of the subsections are rather important. Directives tell the scheduler how to run the job. These include accounting for the number of resources needed, the queue to use (if there are multiple queues available), definition of the job name and any reporting based on errors. I usually use a set of directives like:

#!/bin/sh
#PBS -N <job name>
#PBS -l nodes=<number of nodes>,walltime=<HH>:<MM>:<SS>
#PBS -j <job output file>
#PBS -abe <email@site.domain>

#PBS -q <queue name>

Here, each of the variables in between < > must be specified. The first is the job name that shows up in the qstat command. Next you may specify the number of nodes used as well as the number of hours, minutes and seconds the job should run, as an example 11:59:59 which will run for almost 12 hours. The next command defines the file that writes output files for both normal and error output. This output, is not the normal output from the job (we define it later in the file), but the output and error logging from the batch scheduler and any job issues are written to the file. The next command says to email the specified address when the jobs beings, "b", ends, "e", or aborts due to abnormal execution "a". This option will spam your email address and you may want to only specify the "e" or "b" option. The final option specifies which queue to submit the job to (in the case that there are multiple jobs; otherwise leave this option out).

The next two sections of the file that you may want to specify are the environment and the actual command to run the simulation. The environment probably looks like:

# Set up environment
export PATH=/Paths/to/different/programs:$PATH
export LD_LIBRARY_PATH=/Paths/to/required/libraries:$LD_LIBRARY_PATH

module unload <list of modules>
module load <list of modules>

Generally when you set up an environment on a supercomputer the first two lines of the previous post are desirable. These you can use to specify the locations of the executables used by your simulation and where the libraries your simulation needs are located. The following lines are used mostly on very professional supercomputers, such as those provided by the NSF, DOE etc., and allow you to load specific software into your environment. These will be a topic of a later post. Next you will want to define output files and actually run the job.

# Define Program
PROGRAM=/path/to/program/program.executable
# Define output file
OUTPUT=outfilename.log

mpirun -n <processors> $PROGRAM >& $OUTPUT

The first line is optional, but I think it makes the script more modular. It defines the program to be executed. Next we define the file that the output from the program will be written to. This is different from the file that is specified earlier with the $PBS -j option in that it only contains output from the program itself and not from the scheduler. Finally, the mpirun option runs the simulation with the number of processors specified. This of course requires that your code be written to take advantage of multiple processors/nodes with MPI, though I assume you've already taken care of that. (NOTE: some supercomputers have their own wrapper around mpirun; for example, DOE supercomputers use the "aprun" variant which makes life a little easier, especially if you are distributing many instances of your program over many nodes.)

Job Monitoring

Once you have submitted a job you need to monitor its progress (or more likely where it is in the queue). The two commands to do this are qstat and showq. I know that qstat is standard among all PBS systems; although I have seen showq on most the systems I've been on, it is not standard. qstat gives just a list of all the jobs. The output looks like:

job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
87725 0.39704 d2_eq03.co user1 r 09/05/2013 16:22:36 ge3@compute-0-11.local 32
87826 0.30814 CoPy5H_Co3 user2 r 09/05/2013 14:23:54 ge3@compute-0-10.local 4
...

The columns are pretty self explanatory. The first, second and fifth column are the most important. The first column is the job ID which is used for all job modification operations. Job priority describes how close it is to the front of the queue, though I haven't quite figured out what the number means, though I think it runs from 0 to 1. States can be r, qw or h, for "running", "queued waiting" and "hold" statuses. Perhaps the most useful use of qstat is "qstat -u username" which lists a job by the user; I usually alias it like "alias qme="qstat -u myusername".

I find that showq has more interesting in that it show the currently running jobs, the jobs that are in the queue (with their relative time to execution) and the ones that are on hold. Up to now I haven't described what hold means. Jobs that are on hold are that because one of two reasons: 1) the administrators of the supercomputer put it on hold (THIS IS VERY BAD, FIND OUT WHY YOUR JOB IS ON HOLD), 2) you put it on hold. The second of the options often occurs when you have one simulation that depends on another; for instance, you want a simulation to restart from the previous. In this case you would put it on hold until the first finished. I will cover how this works in a later post.

Anyway, showq gives you output (pretty self-explanatory) like this:

active jobs------------------------
JOBID USERNAME STATE PROCS REMAINING STARTTIME

388243 user1 Running 16 3:23:39 Thu Oct 3 16:31:59
388200 user2 Running 16 5:01:11 Thu Oct 3 18:09:31
...

45 active jobs 616 of 1880 processors in use by local jobs (32.77%)
77 of 235 nodes active (32.77%)

eligible jobs----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME

385175 user1 Idle 4 1:00:00:00 Thu Sep 26 14:37:15
385176 user2 Idle 4 1:00:00:00 Thu Sep 26 14:37:24
388412 user3 Idle 160 23:00:00 Wed Oct 2 10:06:41

3 eligible jobs

blocked jobs-----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME

388201 user1 Hold 16 6:00:00 Tue Oct 1 18:05:25
388202 user2 Hold 16 6:00:00 Tue Oct 1 18:05:25
...

64 blocked jobs

Total jobs: 112

Finally, one other useful command that comes with many of the supercomputers (though not all of them) is showstart. You pass it the job id and it estimates when the job will begin in the format 'dd:hh:mm:ss' based on its priority.

Job Modification

Job modification comes in two forms. You can modify the state of a job, for instance put it on hold or delete it. These are accomplished with:

qalter -h u jobid
qdel jobid

The other way to modify a job is to have the job script, execute other scripts. This way the actual job is not run by the queue script and if you discover a change you need to make while it is sitting in the queue, you can change the resultant script without having to wait through the queue again. This becomes especially important when you are running on supercomputers at national resources where you wait in the queue for a few days.

Those are sort of the main commands you will need. I'll have a few more advanced topics in a near future post.

Scientific Computing Survival Guide