Using the Batch System

4Pack is now the first cluster in the SCL and one of the first in the world to run the new Scalable Systems Software resource management suite. This software is designed for scalability and fault tolerance and removes many of the limitations of PBS. However the suite is still very young and is likely to have many bugs and unimplemented features. The purpose of its use on 4Pack is to gain real world exprience with the software before installing it on more production oriented systems. All jobs must be run within the batch system. Violators will have their jobs terminated and may have their accounts deactivated. Almost all commands related to the batch system are from Bamboo. Man pages for those commands can be found on the login node for more detailed help. References to all of these man pages can be found by starting on the main page ("man bamboo").

The following three sections divide up the user interface to the batch system:


Submitting Jobs

Jobs are submitted with the 'qsub' command. Using this command you tell the batch system the name of the execution script or command, the name of the output file(s), and what resources your job needs to run. You should run qsub from the server (4pack). A common form of the command is:

qsub -j oe -o Your-output-file -l nodes=8,walltime=2:00:00 Your_Script

-j oe : This flag causes the error file to be combined into the output file.

-o Your-output-file: This is where any output to stdout from your execution script will go. stdout is normally for debugging or status type messages. If you explicitly create an output file in your code, you don't need to worry about this option. NOTE: This file is not copied to its final location until after the job finishes. If you need to monitor the output from a job while it is running you should redirect stdout in your execution script directly to a file in your home directory structure.

-l nodes=8,walltime=2:00:00 This give your resource request. First is the requested number of nodes (1-16 for inferno, 1-64 on ALICE). Second is the requested wall time in the format days:hours:minutes:seconds. You may leave off leading zeros, but not trailing zeros. The example requests 2 hours.

Your_Script: This is the name of an executable file (shell script or binary) which will be run when the job is started.

Most of the flags may also be placed at the beginning of the execution script, one per line preceeded on each line by "#PBS ". Please see the qsub man page for more information.


Information for the execution script (What is a job?)

Once your job is submitted and scheduled for execution on a node or nodes the batch system will setup the nodes for your job and then execute the file you provided to qsub. By default any output from the script or commands the script executes is sent to Your-output-file. If you are running on only one node and do not need local disk storage you may not need any more information about the execution environment. However, in most cases you will need to know at least the path to the local scratch disk space where you are allowed to write temporary files and the list of nodes that your job is assigned to run on. The execution environment for jobs includes a couple of variables to discover these features.

First is the variable $SSS_JOBID. This is the name of the job in the batch system. On the SCL clusters it is used to obtain the path your your local scratch space. On 4Pack it is /scratch/$SSS_JOBID. This directory is created on each node assigned to your job immediately before your job is started and is removed (including all of its contents!) upon termination of your job.

Second the list of nodes assigned to your job is given as a space separated list in the environment variable $SSS_HOSTLIST. For your convienience this information is also given in the file /scratch/$SSS_JOBID/SSS_HOSTLIST on each node of the job. This file lists the nodes one per line so it may need some modification depending on your needs. In order to run MPICH programs using the mpirun command you need to create a script which takes the node list and converts it into a form acceptable to mpirun. Here is an example sh script, modified for use on the G4s:

#!/bin/sh
#
#  this is a simple script for running an mpi program via the batch system
#
#  If you need to use disk for scratch use the following
export SCR=/scratch/$SSS_JOBID
#
#  Set the following to the name (and path) of your executable
export EXECUTABLE_NAME=/path/to/your/executable/yourexe
#
#  Compute the # of hosts from the dynamic hostlist
export nhosts=`wc -l $SCR/SSS_HOSTLIST | awk '{print $1}'`
#
# the following causes the subsequent commands to be echoed to stderr
#set -o xtrace
#
# run the mpi program
/usr/local/mpich-gm/bin/mpirun -np $nhosts -machinefile /scratch/$SSS_JOBID/SSS_HOSTLIST
 $EXECUTABLE_NAME
#
# finally exit
exit

The script dynamically determines the number of processors assigned to the job. Then it passes the number of processes and the machinefile to mpirun. NOTE mpirun will not run correctly if you don't pass in the number of processes and the machinefile in this fashion.

The above script is available on 4pack in the /usr/local/samples directory (/usr/local/samples/mpiscript.sh).


Monitoring and deleting jobs and obtaining information on idle nodes

There are a couple commands useful for monitoring the status of queued and running jobs. The first is qstat. By itself qstat will list the job ID, the job name, the user, the time used, the status (Q=queued, R=running, E=exiting) and the queue name (not generally usefull info. here). qstat -n will also include the requested number of nodes and time and if the job is running the nodes assigned to the job.

Thus qstat will tell the status of your job, but it doesn't tell you much about how many nodes are free or how long it will take to start a job. For that information you can use the showq command and the showbf command. The showq command lists all the running queued jobs in priority order while the showbf command lists how many nodes are available and for how long. NOTE submitting a job requesting less time and processors than showbf says are available will result in the job running immediately and thus is very usefull for small jobs.

To delete a job (running or queued) you can use the qdel command as:
qdel Your_JOBID
Where Your_JOBID can be obtained from qstat (you can enter just the number or the enter #.servername).