kali
's webpage at
http://www.math.umbc.edu/kali.
kali
; I will collectively refer to them as
"the scheduler" in the following.
It also contains some suggestions how to supervise your runs
(and kill them if necessary) and how to monitor performance of your code.
If you find mistakes on this page or have suggestions, please contact me.
All scheduler commands have man pages; additionally, the man page man pbs_resources has particularly useful information. Also look under the "See Also" heading at the bottom of all man pages for cross-references to other pages.
#!/bin/bash : : The following is a template for job submission for the : Scheduler on kali.math.umbc.edu : : This defines the name of your job #PBS -N MPI_Aout : This is the path #PBS -o . #PBS -e . #PBS -q workq #PBS -l nodes=4:myrinet:ppn=2 cd $PBS_O_WORKDIR mpiexec -nostdout a.out
qsub qsub-script
at the Linux prompt.
When the scheduler starts executing this script, its working
directory is your home directory. But the environment
variable PBS_O_WORKDIR
holds on to the
directory, in which you started your job, which is
typically not your home directory. To get back to this
directory, the script first of all executes the line cd
$PBS_O_WORKDIR
. From now on, you are again in the
directory, where this file qsub-script
is
located and where you issued the qsub
qsub-script command. Hence, we can access the
executable in that directory by a.out
.
This directory change is crucial in particular, if your
code reads in an input file and/or creates output files.
Without the cd
command, your executable
will not be found; also, input files
cannot be accessed, and output files will all be put in
your home directory.
The -q workq
specifies which queue to submit your
job to. The queue workq
is the only one set up
on kali
.
You choose the name of your job with the option -N
;
this name will appear in the queue that you can see by qstat.
Choose a meaningful name for your own code here.
The options -o
and -e
tell the scheduler
in which directory to place the stdout and stderr files, respectively.
These files have the form jobnumber.kali.cl.ER and jobnumber.kali.cl.OU,
respectively, at present since the jobnumber is a three-digit number;
it it becomes a four-digit number, we will likely lose the letter "l"
and get jobnumber.kali.c.ER and jobnumber.kali.c.OU.
These files are created and accumulated in some temporary place
and only moved to your directory after completion of the job.
See below for a remark on this important issue.
In this example, you want to run on 8 processors as indicated by the crucial line
#PBS -l nodes=4:myrinet:ppn=2
ppn=2
), all connected by Myrinet. Your
job will execute on the 4 nodes returned by the scheduler. Note
that the run-line beginning with mpiexec
does
not specify the number of processes, so your job will run
on all the processors returned by the scheduler (8 in this case).
The line starting with mpiexec
actually
starts the job. The -nostdout
flag indicates
that you do not want output sent to the
stdout
stream, but rather want it redirected,
as explained above.
a.out
.
The above example showed how to pipe stderr and stdout into
two separate files. One can also 'join' them together.
To accomplish this, replace the -e .
by
-j oe
. (To make this clear, in case it is hard to
read in the previous sentence, you replace -e
by -j
and the period .
by oe
.
See the man page for qsub.
If your executable is not in the
current directory, you would simply replace the
a.out
in the mpiexec
line of
the qsub-script above by the full path for the
executable. For instance, if you have an executable
DG-mpigm-O
in directory
$HOME/develop/Applications/DG/bin/x86_linux-icc
,
the mpirun
line could read
mpiexec -nostdout \ $HOME/develop/Applications/DG/bin/x86_linux-icc/DG-mpigm-Owhich uses the predefined environment variable
HOME
to make the
script a little more general. Notice that I used the backslash "\"
to continue the line to keep the line length below 80 columns
to make the script file more readable. This is particularly useful if you
have a long list of command-line arguments.
Advanced options such as using only one
processor per node (even on a multiprocessor node)
by mpiexec -pernode
,
overriding the communication libraries built
into the executable, or
a customized config
file (to specify
exactly which nodes to use) are available.
Consult the man page for mpiexec
for full details.
A few important facts about mpiexec
must be
highlighted:
mpiexec
utility was
designed to run within the scheduler
environment; it cannot be executed from the Linux prompt.
mpiexec
(unless your code
explicitly spawns threads of its own).
mpiexec
provides a clean wrap-up of MPI
jobs. For instance, when any process associated with
a job is killed, all processes associated with that
job are automatically terminated. The behavior is similar
when a job is deleted using the scheduler command qdel
.
Using qdel
(usage explained below) is the clean way
to terminate MPI jobs.
mpirun
command.
Two useful caveats: If you use the csh or tcsh shell, the stdout file (the file ending with OU) will contain a couple of lines of error messages, namely
Warning: no access to tty (Bad file descriptor). Thus no job control in this shell.This is due to some conflict between our shell,
mpiexec
, etc.
You can safely ignore this.
Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 635.mgtnode gobbert workq MPI_DG 2320 8 1 -- 10000 R 716:0 636.mgtnode gobbert workq MPI_DG 2219 8 1 -- 10000 R 716:1 665.mgtnode gobbert workq MPI_Nodesu -- 16 1 -- 10000 Q -- 704.mgtnode gobbert workq MPI_Nodesu 12090 15 1 -- 10000 E 00:00 705.mgtnode kallen1 workq MPI_Aout -- 1 1 -- 10000 Q -- 706.mgtnode gobbert workq MPI_Nodesu -- 15 1 -- 10000 Q -- 707.mgtnode gobbert workq MPI_Nodesu -- 15 1 -- 10000 Q --
-N
in qsub-script).
The most interesting column is the one titled S for "status". It shows what your job is doing at this point in time: The letter Q indicates that your job has been queued, that is, it is waiting for resources to become available and will then be executed. The letter R indicates that your job is currently running. Finally, the letter E says that your job is exiting; this will appear during the shut-down phase, after the job has actually finished execution. See man qstat for more information.
Personal suggestions: I feel that qstat -a gives me a little more information. Often times, it is necessary to know which nodes your job is running on. You can see that by qstat -n; this implies the -a option. Finally, if your job is listed as queued (an Q in the S column of qstat), you can find out why it is not running using qstat -f; look for the comment field, which might say something like "not sufficient nodes of requested type available".
There are two other visual tools available on kali
to monitor performance and activity:
Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 6307.kali.math. kali-g2 workq MPI_SPARK 8472 16 1 -- 10000 R 133:5 node018/1+node018/0+node027/1+node027/0+node026/1+node026/0+node025/1 +node025/0+node024/1+node024/0+node023/1+node023/0+node022/1+node022/0 +node020/1+node020/0+node019/1+node019/0+node031/1+node031/0+node030/1 +node030/0+node029/1+node029/0+node028/1+node028/0+node005/1+node005/0 +node004/1+node004/0+node003/1+node003/0 6308.kali.math. kali-g2 workq MPI_SPARK 8894 2 1 -- 10000 R 16:44 node007/1+node007/0+node006/1+node006/0 6309.kali.math. kali-g2 workq MPI_SPARK 8126 8 1 -- 10000 R 07:37 node014/1+node014/0+node001/1+node001/0+node013/1+node013/0+node012/1 +node012/0+node011/1+node011/0+node010/1+node010/0+node009/1+node009/0 +node008/1+node008/0 6310.kali.math. kali-g2 workq MPI_SPARK 8304 1 1 -- 10000 R 00:54 node015/1+node015/0
kali-g2
is one of the group accounts on kali
used by myself (Gobbert).
Notice that this sample output shows the job-IDs as they appeared
in November 2004, truncated to the 15 available characters
for output in that column.
While the job-IDs look different now, the software still works the same way.
ER
and
OU
, respectively) in a temporary location and moved
to your current directory ($PBS_O_WORKDIR
) only after
your code has finished running.
Often, it is vital to be able to look at these files while the job is still running, for instance, to determine how far your simulation has progressed or whether your code has encountered a problem. Before the operating system upgrade in March 2005, the temporary location of these files was the user's home directory. This has changed, and the files are accumulated locally on that node on which the Process 0 of your job is running.
We are actually not satisfied with this situation, because it makes it very cumbersome to have a look at these files during a run. We are investigating how to control the location of these files as desired. But in the mean time, I am providing information in the following on how to find these files in the present situation.
First you have to find out on which nodes your job is running; use qstat -n to get information such as
803.kali.cl.mat gobbert workq MPI_Testio 9069 2 1 -- 10000 R 00:00 node31+node31+node30+node30telling you that your job runs on nodes
node30
and
node31
, using both CPUs on each.
So, in MPI jargon, there are Processes 0, 1, 2, and 3.
The order of the nodes in the list returned by qstat -n
tells you that in fact that Processes 0 and 1 are on node31
and Processes 2 and 3 on node30
.
(The confirmation of this fact is the point of having the process
numbers and their hostnames printed out in the sample codes
available from kali's webpage.)
Having determined that Process 0 is on node31
,
ssh to that node by saying ssh node31.
Change directory to /var/spool/pbs/spool
by saying cd /var/spool/pbs/spool.
A directory listing (using ls) should show the
ER and OU files here (or only OU if you requested them to be joined).
You can now look at the files using various Linux commands,
for instance, more or less, or
tail or even tail -f;
see their man pages for more information.
qdel
command using the jobnumber from qstat
,
as in qdel 636
, for instance.
See man qdel for more information.