JOB SUBMISSION ON THE Cray XT4

To ensure a fair use of the clusters, all users must run their computations via "the batch system". A batch system is a program that manages the jobs/programs users run on the cluster. To start jobs, users specify to the batch system which executable(s) they want to run, the amount of processors and memory needed, and approximately how much time.

The batch system on the Cray XT4 is called "Torque" (a PBS version), is the same as batch system running on fimm. However, because of the Cray XT4 architecture the syntax is slightly different. The scheduler on hexagon is called Moab, which is the commercial version of the Maui scheduler which runs of fimm. In addition hexagon uses aprun to execute jobs on the compute nodes, independent of the job being a MPI job or sequential job. The user therefore has to make sure to call "aprun ./executable", and not just the executable.

Jobs are normally submitted to PBS via shell scripts, and are often called job scripts or batch scripts. Lines in the scripts that start with #PBS are interpreted by Torque as instructions for the batch system. (We note that these lines are interpreted as comments when the script is run in the shell, so there is no magic here: a batch script is a shell script.)

There are essentially two ways to execute jobs via the batch system.

  • Interactive. The batch system allocates the requested resources or waits until these are available. Once the resources are allocated, interaction with these resources and your application is via the command-line and very similar to what you normally would do on your local (Linux) desktop.
  • Batch. One writes a job script that specifies the required resources and executables and arguments. This script is then given to the batch system that will then schedule this job and start it as soon as the resources are available.
Running jobs in batch is the more common way on a compute cluster. Here, one can log off and log on again later to see what the status of a job is, etc. We recommend running jobs in batch mode.


INTERACTIVE USE

PBS allows the user to use a compute nodes for interactive use, to off-load the master (front) node. Interactive jobs are typically used for code development that involves for example a sequence of short execute jobs. To request one processor, 1000MB memory and reserve it for 2 hours, use the command

   qsub -l mppwidth=1,walltime=2:00:00,mppmem=1000mb -A account_no -I
To request 2 CPUs, with 1000mb per process/cpu and reserve them for 2 hours, use the command
   qsub -l mppwidth=2,walltime=2:00:00,mppmem=1000mb -A account_no -I
Here, account_no is your project name for accounting. The option that specifies the interactive use is -I. Note that you will be charged for the full time this job allocates the CPUs/nodes, even if you are not actively using these resources. Hence, exit the job (shell) as soon as the interactive work is done. To launch your program on the compute node you go to /work/$USER and then you HAVE to use "aprun". If "aprun" is omitted the program is executed on the login node, which in the worst case can crash the login node. Since /home is not mounted on the compute node, the job has to be started from /work/$USER.


BATCH USE

We illustrate the use of batch job scripts and submission with a few examples.

Sequential jobs. To use 1 processor (CPU) for at most 60 hours wall-clock time and 1000Mb of memory the PBS job script must contain the line

#PBS -l mppwidth=1,walltime=60:00:00,mppmem=1000mb

Please note that Fimm, Stallo and Titan are much better suited for sequential jobs. Hexagon should therefore only be used for parallel jobs.

Below is a complete example of a PBS script for executing a sequential job.

#! /bin/sh -
#
# Give the job a name (optional)
#PBS -N "seqjob"
#
# Specify the project the job should be accounted on (obligatory)
#PBS -A account_no
#
# The job needs at most 60 hours wall-clock time on 1 CPU (obligatory)
#PBS -l mppwidth=1,walltime=60:00:00
#
# The job needs at most 1000mb of memory (obligatory)
#
#PBS -l mppmem=1000mb
#
# Write the standard output of the job to file 'seqjob.out' (optional)
#PBS -o seqjob.out
#
# Write the standard error of the job to file 'seqjob.err' (optional)
#PBS -e seqjob.err
#
# Make sure I am in the correct directory
cd /work/utby/seqwork

# Invoke the (sequential!) executable
aprun -n 1 -m 1000M ./program
 
Parallel/MPI jobs. To use 16 CPUs for at most 60 hours wall-clock time, the PBS job script must contain the line
#PBS -l mppwidth=16,walltime=60:00:00
Here we request 16 CPUs.

Below is an example of a PBS script for executing an MPI job. Click here to download.

#! /bin/sh -
#
#  Make sure I use the correct shell.
#
#PBS -S /bin/sh
#
#  Give the job a name
#
#PBS -N "mpijob"
#
#  Specify the project the job belongs to
#
#PBS -A account_no
#
#  We want 60 hours on 16 cpu's:
#
#PBS -l walltime=60:00:00,mppwidth=16
#
#  The job needs 1000 MB memory per process:
#PBS -l mppmem=1000mb
#
#  Send me an email on  a=abort, b=begin, e=end
#
#PBS -m abe
#
#  Use this email address (check that it is correct):
#PBS -M your@email.address.com
#
#  Write the standard output of the job to file 'mpijob.out' (optional)
#PBS -o mpijob.out
#
#  Write the standard error of the job to file 'mpijob.err' (optional)
#PBS -e mpijob.err
#
#  Make sure I am in the correct directory
mkdir -p /work/$USER/mpiwork
cd /work/$USER/mpiwork

# For hexagon use:
aprun -n 16 -m 1000M ./program

# Return output at end to mpiexec as exit status:
exit $?


IMPORTANT BATCH JOB ATTRIBUTES AND RESTRICTIONS

-A : a job script must specify a valid project name for accounting, otherwise it will not be possible to submit jobs to the batch system.

-l : resources are specified with the -l option. There are a number of resources that can be specified. See the example above for the correct syntax. Jobs must specify the number of processors (CPUs), and the maximum allowed wall-clock time for execution. If not done, the default values are 1 CPU and 60 minutes, respectively. Make sure that you specify a correct amount of memory or you will risk crashing the node for lack of memory. Note the difference between mppmem=XXXmb (per-process amount).

  • mppwidth : the number of processing elements. So mppwidth=16 means sixteen processing elements is requested. This argument must match the -n argument for aprun.
  • mppnppn : number of processing elements per node. If one needs only one processors on each compute node, one can use the attribute mppnppn (so mppnppn=1 or mppnppn=2). mppnppn=4 is default. If included, this argument has to match the -N argument for aprun. It is important to notice that even if mppnppn=1 is set the job will be accounted as mppnppn=4 in the cpu accounting.
  • walltime : the maximum allowed wall-clock time for execution of the job. If the specified time is too short, the job will be killed before it completes.
  • mppmem : an upper limit for the memory usage per process for a job. An explanation of how to request more memory can be found here. If the memory requirement is exceeded, the job may get killed by the system.
  • mppdepth : is the number of OpenMP threads per node. This must match the -d argument for aprun.

-o, -e : see example above. If the attributes are not used and thus filenames are not specified, the standard output and standard error from the job will be stored in the files mpijob.o## and mpijob.e## where ## is the job number assigned by PBS when submitting the job.

Finally, the examples on this page only show a small number of the attributes that PBS has. See 'man pbs_resources_linux' for more attributes.


JOB CONTROL AND MONITORING

The most important batch system commands are:

  • qsub. A PBS script called job.pbs is submitted with the command
      qsub job.pbs   : submit the script job.pbs
    
    Queues and priorities are chosen automatically by the system. The command qsub returns a job identifier (number) that can be used to monitor the status of the job (in the queue and during execution).
  • qstat. The job status can be shown by the command
      qstat          : display jobs
      qstat -a       : display all jobs in alternative format
      qstat -f       : display full status of all jobs (long output) 
      qstat -f jobid : display full status of a specific job 
    
    or by using the graphical front-end xpbs.
  • qdel. A queued or running job can be killed (or cancelled) by the command
      qdel jobid     : delete a specific batch job
    
  • showq. Displays the actual job ordering for the scheduler, separated in three list; active, eligible, and blocked jobs.
      showq          : display jobs
      showq -u $USER : display jobs for $USER
    
  • checkjob. Displays detailed job state information.
      checkjob jobid : display status for job
    
Here is a list of the most important commands in tabular form (manual pages are recommended):

  PBS                  Purpose
            
  qsub                 Submit a job
  qdel                 Cancel a job
  qstat                Get job status
  qstat -Q             Get available queues
  qstat -Q -f          Show queue information
  qstat -B -f          Show PBS Server status
  qhold                Temporarily stop job
  qrls                 Resume job
  qhold                Checkpoint job
  qrls                 Restart from checkpoint
  showq                Displays the job ordering of the scheduler
  showq -u $USER       Displays the job ordering for $USER
  showstart jobid      Displays estimated start time of job
  checkjob  jobid      Displays status for job
  apstat               Provides status information for Cray XT systems applications
  xtprocadmin          Display or set processor flags in the Cray XT series system database
  xtshowcabs           Shows information about compute and service partition processors and the jobs running in each partition
  pbsnodes -a          Show status of nodes in cluster
  tracejob             Trace job information from PBS log files


APRUN ARGUMENTS

The resources you requested in PBS has to match the arguments for aprun. So if you ask for "#PBS -l mppmem=900mb" you will need to add the argument "-m 900M" to aprun.

    -N processors per node          should be equal the value of mppnppn
    -n processing elements          should be equal the value of mppwidth
    -d number of threads            should be equal the value of mppdepth
    -m memory per element suffix    should be equal the amount of memory requested by mppmem. Suffix should be M.

A complete list of aprun arguments can be found on the man page of aprun.

Questions on how to make scripts, general use of the system, job-dependencies, etc. can be sent to hpc-support@hpc.uib.no