Running Jobs using torque and Maui on kali


This page can be reached via kali's webpage at http://www.math.umbc.edu/kali.

Purpose of this Document

This document explains how to run jobs using the batch processing system torque and Maui scheduler on kali; I will collectively refer to them as "the scheduler" in the following. It also contains some suggestions how to supervise your runs (and kill them if necessary) and how to monitor performance of your code.

If you find mistakes on this page or have suggestions, please contact me.


Starting Point

I assume that you have compiled your program. Let me call the executable a.out in the following examples. For simplicity, I will also assume that this executable sits in your current directory, in which you want to run your code, where possible input files are located, and where you wish to collect the output files (both stdout and stderr captured by the scheduler and any other files that your code might create). See a remark below, if your executable is located somewhere else.


Overview of the Scheduler

A job, that is, an executable with its command-line arguments, is submitted to the scheduler with the qsub command. With qstat, you can see the status of the queue at any time. If you wish to delete a job for any reason, use the qdel command. There are additional commands, but these should get you started. They are explained in more detail below.

All scheduler commands have man pages; additionally, the man page man pbs_resources has particularly useful information. Also look under the "See Also" heading at the bottom of all man pages for cross-references to other pages.


The Command qsub and Required Script File

In the directory, in which you want to run your code, you need to create a script file that tells the scheduler more details about how to start the code, what resources you need, where to send output, and some other items. Let's call this file qsub-script in this example. It should look like this:
#!/bin/bash
:
: The following is a template for job submission for the
: Scheduler on kali.math.umbc.edu
:
: This defines the name of your job
#PBS -N MPI_Aout
: This is the path
#PBS -o .
#PBS -e .
#PBS -q workq
#PBS -l nodes=4:myrinet:ppn=2

cd $PBS_O_WORKDIR

mpiexec -nostdout a.out

This script is used as command-line argument to the qsub command by saying

qsub qsub-script

at the Linux prompt.

When the scheduler starts executing this script, its working directory is your home directory. But the environment variable PBS_O_WORKDIR holds on to the directory, in which you started your job, which is typically not your home directory. To get back to this directory, the script first of all executes the line cd $PBS_O_WORKDIR. From now on, you are again in the directory, where this file qsub-script is located and where you issued the qsub qsub-script command. Hence, we can access the executable in that directory by a.out. This directory change is crucial in particular, if your code reads in an input file and/or creates output files. Without the cd command, your executable will not be found; also, input files cannot be accessed, and output files will all be put in your home directory.

The -q workq specifies which queue to submit your job to. The queue workq is the only one set up on kali.

You choose the name of your job with the option -N; this name will appear in the queue that you can see by qstat. Choose a meaningful name for your own code here.

The options -o and -e tell the scheduler in which directory to place the stdout and stderr files, respectively. These files have the form jobnumber.kali.cl.ER and jobnumber.kali.cl.OU, respectively, at present since the jobnumber is a three-digit number; it it becomes a four-digit number, we will likely lose the letter "l" and get jobnumber.kali.c.ER and jobnumber.kali.c.OU. These files are created and accumulated in some temporary place and only moved to your directory after completion of the job. See below for a remark on this important issue.

In this example, you want to run on 8 processors as indicated by the crucial line

#PBS -l nodes=4:myrinet:ppn=2

This line specifies that you request 4 nodes, each with two processors per node (ppn=2), all connected by Myrinet. Your job will execute on the 4 nodes returned by the scheduler. Note that the run-line beginning with mpiexec does not specify the number of processes, so your job will run on all the processors returned by the scheduler (8 in this case).

The line starting with mpiexec actually starts the job. The -nostdout flag indicates that you do not want output sent to the stdout stream, but rather want it redirected, as explained above.


Some Additional Remarks on the qsub Submission Script

If your executable expects any command-line arguments, you would put them on this line, immediately following the a.out.

The above example showed how to pipe stderr and stdout into two separate files. One can also 'join' them together. To accomplish this, replace the -e . by -j oe. (To make this clear, in case it is hard to read in the previous sentence, you replace -e by -j and the period . by oe. See the man page for qsub.

If your executable is not in the current directory, you would simply replace the a.out in the mpiexec line of the qsub-script above by the full path for the executable. For instance, if you have an executable DG-mpigm-O in directory $HOME/develop/Applications/DG/bin/x86_linux-icc, the mpirun line could read

mpiexec -nostdout \
  $HOME/develop/Applications/DG/bin/x86_linux-icc/DG-mpigm-O
which uses the predefined environment variable HOME to make the script a little more general. Notice that I used the backslash "\" to continue the line to keep the line length below 80 columns to make the script file more readable. This is particularly useful if you have a long list of command-line arguments.

Advanced options such as using only one processor per node (even on a multiprocessor node) by mpiexec -pernode, overriding the communication libraries built into the executable, or a customized config file (to specify exactly which nodes to use) are available. Consult the man page for mpiexec for full details.

A few important facts about mpiexec must be highlighted:

Please note: Users are not allowed to use the mpirun command.

Two useful caveats: If you use the csh or tcsh shell, the stdout file (the file ending with OU) will contain a couple of lines of error messages, namely

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
This is due to some conflict between our shell, mpiexec, etc. You can safely ignore this.


The Scheduler Command qstat

Once you have submitted your job to the scheduler, you will want to confirm that it has been entered into the queue. Use qstat at the command-line to get output similar to this:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
635.mgtnode     gobbert  workq    MPI_DG       2320   8   1    --  10000 R 716:0
636.mgtnode     gobbert  workq    MPI_DG       2219   8   1    --  10000 R 716:1
665.mgtnode     gobbert  workq    MPI_Nodesu    --   16   1    --  10000 Q   -- 
704.mgtnode     gobbert  workq    MPI_Nodesu  12090  15   1    --  10000 E 00:00
705.mgtnode     kallen1  workq    MPI_Aout      --    1   1    --  10000 Q   -- 
706.mgtnode     gobbert  workq    MPI_Nodesu    --   15   1    --  10000 Q   -- 
707.mgtnode     gobbert  workq    MPI_Nodesu    --   15   1    --  10000 Q   -- 

(This is a little old example; the job-IDs look a little different by now; also, this output was actually obtained by qstat -a.) The first column shows the job-ID assigned by the scheduler. The columns Username and Jobname show the username of the person who submitted the job and the name of the job (chosen by -N in qsub-script).

The most interesting column is the one titled S for "status". It shows what your job is doing at this point in time: The letter Q indicates that your job has been queued, that is, it is waiting for resources to become available and will then be executed. The letter R indicates that your job is currently running. Finally, the letter E says that your job is exiting; this will appear during the shut-down phase, after the job has actually finished execution. See man qstat for more information.

Personal suggestions: I feel that qstat -a gives me a little more information. Often times, it is necessary to know which nodes your job is running on. You can see that by qstat -n; this implies the -a option. Finally, if your job is listed as queued (an Q in the S column of qstat), you can find out why it is not running using qstat -f; look for the comment field, which might say something like "not sufficient nodes of requested type available".

There are two other visual tools available on kali to monitor performance and activity:

  1. xpbsmon: By calling xpbsmon& you get a window such as this. (Note: This command responds with the message "site_cmds: entered"; I have no idea what this means; just hit return to get your prompt back.) Essentially, this shows the same information as qstat. This example corresponds to the output from qstat -an
                                                                Req'd  Req'd   Elap
    Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
    --------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
    6307.kali.math. kali-g2  workq    MPI_SPARK    8472  16   1    --  10000 R 133:5
       node018/1+node018/0+node027/1+node027/0+node026/1+node026/0+node025/1
       +node025/0+node024/1+node024/0+node023/1+node023/0+node022/1+node022/0
       +node020/1+node020/0+node019/1+node019/0+node031/1+node031/0+node030/1
       +node030/0+node029/1+node029/0+node028/1+node028/0+node005/1+node005/0
       +node004/1+node004/0+node003/1+node003/0
    6308.kali.math. kali-g2  workq    MPI_SPARK    8894   2   1    --  10000 R 16:44
       node007/1+node007/0+node006/1+node006/0
    6309.kali.math. kali-g2  workq    MPI_SPARK    8126   8   1    --  10000 R 07:37
       node014/1+node014/0+node001/1+node001/0+node013/1+node013/0+node012/1
       +node012/0+node011/1+node011/0+node010/1+node010/0+node009/1+node009/0
       +node008/1+node008/0
    6310.kali.math. kali-g2  workq    MPI_SPARK    8304   1   1    --  10000 R 00:54
       node015/1+node015/0
    

    which also shows the effect of the -n option. Here, kali-g2 is one of the group accounts on kali used by myself (Gobbert). Notice that this sample output shows the job-IDs as they appeared in November 2004, truncated to the 15 available characters for output in that column. While the job-IDs look different now, the software still works the same way.

  2. Ganglia: To see the loads more precisely along with lots of other information like memory usage and historical usage data, we have the web based monitoring software Ganglia. The default screen shows the nodes ordered from the heaviest to the lightest load. You can click on "Physical View" (top right corner) for a display that shows the nodes as they are arranged in the two racks.


Remark on the Temporary Location of the Redirected Output Streams

As explained above already, the output streams stderr and stdout are accumulated in files (with names ending in ER and OU, respectively) in a temporary location and moved to your current directory ($PBS_O_WORKDIR) only after your code has finished running.

Often, it is vital to be able to look at these files while the job is still running, for instance, to determine how far your simulation has progressed or whether your code has encountered a problem. Before the operating system upgrade in March 2005, the temporary location of these files was the user's home directory. This has changed, and the files are accumulated locally on that node on which the Process 0 of your job is running.

We are actually not satisfied with this situation, because it makes it very cumbersome to have a look at these files during a run. We are investigating how to control the location of these files as desired. But in the mean time, I am providing information in the following on how to find these files in the present situation.

First you have to find out on which nodes your job is running; use qstat -n to get information such as

  803.kali.cl.mat gobbert  workq    MPI_Testio   9069   2   1    --  10000 R 00:00
   node31+node31+node30+node30
telling you that your job runs on nodes node30 and node31, using both CPUs on each. So, in MPI jargon, there are Processes 0, 1, 2, and 3. The order of the nodes in the list returned by qstat -n tells you that in fact that Processes 0 and 1 are on node31 and Processes 2 and 3 on node30. (The confirmation of this fact is the point of having the process numbers and their hostnames printed out in the sample codes available from kali's webpage.)

Having determined that Process 0 is on node31, ssh to that node by saying ssh node31. Change directory to /var/spool/pbs/spool by saying cd /var/spool/pbs/spool. A directory listing (using ls) should show the ER and OU files here (or only OU if you requested them to be joined). You can now look at the files using various Linux commands, for instance, more or less, or tail or even tail -f; see their man pages for more information.


The Scheduler Command qdel

To delete a job from the queue or to kill a running job cleanly, use the qdel command using the jobnumber from qstat, as in qdel 636, for instance. See man qdel for more information.


Copyright © 2003-2005 by Matthias K. Gobbert. All Rights Reserved.
This page version 3.7, September 2005.