Using the Batch System

Several of the clusters, currently inferno and ALICE, within the SCL use the Portable Batch System (PBS) and the Maui Scheduler as the batch queueing system. On those clusters jobs are restricted to the batch environment except by special permission from the system administrators. Violators will have their jobs terminated and may have their accounts deactivated. Almost all commands related to the batch system are from PBS. man pages for those commands can be found on the login node of each system for more detailed help.

The following four sections divide up the user interface to the batch system:


Submitting Jobs

Jobs are submitted with the 'qsub' command. Using this command you tell the batch system the name of the execution script or command, the name of the output file(s), and what resources your job needs to run. You should run qsub from the front-end node (ie doormouse for ALICE; alpha1 for inferno) A common form of the command is:

qsub -j oe -o Your-output-file -l nodes=8,walltime=2:00:00 Your_Script

-j oe : This flag causes the error file to be combined into the output file.

-o Your-output-file: This is where any output to stdout from your execution script will go. stdout is normally for debugging or status type messages. If you explicitly create an output file in your code, you don't need to worry about this option. NOTE: This file is not copied to its final location until after the job finishes. If you need to monitor the output from a job while it is running you should redirect stdout in your execution script directly to a file in your home directory structure.

-l nodes=8,walltime=2:00:00 This give your resource request. First is the requested number of nodes (1-16 for inferno, 1-64 on ALICE). Second is the requested wall time in the format days:hours:minutes:seconds. You may leave off leading zeros, but not trailing zeros. The example requests 2 hours.

Your_Script: This is the name of an exacutable file (shell script or binary) which will be run when the job is started.

Some additional flags you might find useful are:


Most of the flags may also be placed at the beginning of the execution script, one per line preceeded on each line by "#PBS ". Please see the qsub man page for more information.


Information for the execution script (What is a job?)

Once your job is submitted and scheduled for execution on a node or nodes the batch system will setup the nodes for your job and then execute the file you provided to qsub. By default any output from the script or commands the script executes is sent to Your-output-file. If you are running on only one node and do not need local disk storage you may not need any more information about the execution environment. However, in most cases you will need to know at least the path to the local scratch disk space where you are allowed to write temporary files and the list of nodes that you job is assigned to run on. The execution environment for jobs includes a couple of variables to discover these features.

First is the variable $PBS_JOBID. This is the name of the job in the batch system. On the SCL clusters it is used to obtain the path your your local scratch space. On inferno the path is /scr/$PBS_JOBID and on ALICE it is /scratch/$PBS_JOBID. This directory is created on each node assigned to your job immediately before your job is started and is removed (including all of its contents!) upon termination of your job.

Second the list of nodes assigned to your job is given by the file pointed to by $PBS_NODEFILE as long as you specify a -l nodes=something. If you do not specify the nodes=n at submittion you will be assigned one node, but the $PBS_NODEFILE won't exist. This file lists the nodes one per line so it may need some modification depending on your needs. In order to run MPICH programs using the mpirun command you need to create a script which takes the node list and converts it into a form acceptable to mpirun. Here is an example csh script:


#!/bin/csh
#
#  NOTE: The following starts with 'scratch' on ALICE and 'scr' on Inferno
#
set SCR=/scratch/$PBS_JOBID

#Set the following to the name and path to your executable
set EXECUTABLE_NAME=/home/yourname/yourexe

#default the hostlist to something sensible
echo `hostname` > $SCR/mpi-hostlist
set nmax=1

#most cases will use the following
if ((-e $PBS_NODEFILE)) then # This file will always exist if more than 1 node is requested
	rm -f $SCR/mpi-hostlist
	set nmax=`wc -l $PBS_NODEFILE`
	set nmax=$nmax[1]
	@ NNODE=1
	while ($NNODE <= $nmax)
		set NODENAME=`sed -n -e "$NNODE p" $PBS_NODEFILE`
		echo $NODENAME >> $SCR/mpi-hostlist
		@ NNODE++
	end
# MPICH requires the file on every node so distribute it out
	@ NNODE=2
	while ($NNODE <= $nmax)
		set NODENAME=`sed -n -e "$NNODE p" $PBS_NODEFILE`
		scp $SCR/mpi-hostlist $NODENAME\:$SCR/mpi-hostlist
		@ NNODE++
	end
endif

mpirun -np $nmax -machinefile $SCR/mpi-hostlist $EXECUTABLE_NAME

The script dynamically determines the number of processors assigned to the job, then creates hostfile and distributes it to each node. Then it passes the number of processes and the machinefile to mpirun. NOTE mpirun will not run correctly if you don't pass in the number of processes and the machinefile in this fashion.

GAMESS users can find a gms and rungms script in ~brett/gamess that when copied to your own gamess directory can be used to submit and run GAMESS jobs in the queue system.

Important Note: ALICE Users must change rcp to scp to make this script work


There are a few things to keep in mind if your program deals with files (reading or writing data files).

The current working directory when your script runs is the base of your home directory. (/home/username/) If you want to use data files in a directory other than your base home directory, you'll need to do one of two things:

1) Modify your code so it looks for the data files based on the "." current working directory. For example (in C):

fopen("datafile","r")

becomes:

fopen("./datafile","r")

This makes the executable look for data files in the directory it is running in

2) Add a "cd datafiles_directory" before the mpirun statement in the default script.

Example:

{cut}
mpirun -np $nmax -machinefile $SCR/mpi-hostlist $EXECUTABLE_NAME

becomes:

{cut}
cd /home/username/datafiles/directory/name
mpirun -np $nmax -machinefile $SCR/mpi-hostlist $EXECUTABLE_NAME

Monitoring and deleting jobs and obtaining information on idle nodes

There are a couple commands useful for monitoring the status of queued and running jobs. The first is qstat. By itself qstat will list the job ID, the job name, the user, the time used, the status (Q=queued, R=running, E=exiting) and the queue name (not generally usefull info. here). qstat -n will also include the requested number of nodes and time and if the job is running the nodes assigned to the job.

Thus qstat will tell the status of your job, but it doesn't tell you much about how many nodes are free or how long it will take to start a job. For that information you can use the showq command and the showbf command. The showq command lists all the running queued jobs in priority order while the showbf command lists how many nodes are available and for how long. NOTE submitting a job requesting less time and processors than showbf says are available will result in the job running immediately and thus is very usefull for small jobs.

To delete a job (running or queued) you can use the qdel command as:
qdel Your_JOBID
Where Your_JOBID can be obtained from qstat (you can enter just the number or the enter #.servername).