JOB SUBMISSION ON THE Cray XT4
To ensure a fair use of the clusters, all users must run their computations via "the batch system". A batch system is a program that manages the jobs/programs users run on the cluster. To start jobs, users specify to the batch system which executable(s) they want to run, the amount of processors and memory needed, and approximately how much time. The batch system on the Cray XT4 is called "Torque" (a PBS version), is the same as batch system running on fimm. However, because of the Cray XT4 architecture the syntax is slightly different. The scheduler on hexagon is called Moab, which is the commercial version of the Maui scheduler which runs of fimm. In addition hexagon uses aprun to execute jobs on the compute nodes, independent of the job being a MPI job or sequential job. The user therefore has to make sure to call "aprun ./executable", and not just the executable. Jobs are normally submitted to PBS via shell scripts, and are often called job scripts or batch scripts. Lines in the scripts that start with#PBS are interpreted by
Torque as instructions for the batch system. (We note that these lines are
interpreted as comments when the script is run in the shell, so there is no
magic here: a batch script is a shell script.)
There are essentially two ways to execute jobs via the batch system.
- Interactive. The batch system allocates the requested resources or waits until these are available. Once the resources are allocated, interaction with these resources and your application is via the command-line and very similar to what you normally would do on your local (Linux) desktop.
- Batch. One writes a job script that specifies the required resources and executables and arguments. This script is then given to the batch system that will then schedule this job and start it as soon as the resources are available.
INTERACTIVE USE PBS allows the user to use a compute nodes for interactive use, to off-load the master (front) node. Interactive jobs are typically used for code development that involves for example a sequence of short execute jobs. To request one processor, 1000MB memory and reserve it for 2 hours, use the command
qsub -l mppwidth=1,walltime=2:00:00,mppmem=1000mb -A account_no -ITo request 2 CPUs, with 1000mb per process/cpu and reserve them for 2 hours, use the command
qsub -l mppwidth=2,walltime=2:00:00,mppmem=1000mb -A account_no -IHere, account_no is your project name for accounting. The option that specifies the interactive use is
-I.
Note that you will be charged for the full time this job
allocates the CPUs/nodes, even if you are not actively using
these resources. Hence, exit the job (shell) as soon as
the interactive work is done.
To launch your program on the compute node you go to /work/$USER and then you HAVE to use "aprun".
If "aprun" is omitted the program is executed on the login node, which in the worst case can crash the login node.
Since /home is not mounted on the compute node, the job has to be started from /work/$USER.
BATCH USE We illustrate the use of batch job scripts and submission with a few examples. Sequential jobs. To use 1 processor (CPU) for at most 60 hours wall-clock time and 1000Mb of memory the PBS job script must contain the line
#PBS -l mppwidth=1,walltime=60:00:00,mppmem=1000mbPlease note that Fimm, Stallo and Titan are much better suited for sequential jobs. Hexagon should therefore only be used for parallel jobs. Below is a complete example of a PBS script for executing a sequential job.
#! /bin/sh - # # Give the job a name (optional) #PBS -N "seqjob" # # Specify the project the job should be accounted on (obligatory) #PBS -A account_no # # The job needs at most 60 hours wall-clock time on 1 CPU (obligatory) #PBS -l mppwidth=1,walltime=60:00:00 # # The job needs at most 1000mb of memory (obligatory) # #PBS -l mppmem=1000mb # # Write the standard output of the job to file 'seqjob.out' (optional) #PBS -o seqjob.out # # Write the standard error of the job to file 'seqjob.err' (optional) #PBS -e seqjob.err # # Make sure I am in the correct directory cd /work/utby/seqwork # Invoke the (sequential!) executable aprun -n 1 -m 1000M ./programParallel/MPI jobs. To use 16 CPUs for at most 60 hours wall-clock time, the PBS job script must contain the line
#PBS -l mppwidth=16,walltime=60:00:00Here we request 16 CPUs. Below is an example of a PBS script for executing an MPI job. Click here to download.
#! /bin/sh - # # Make sure I use the correct shell. # #PBS -S /bin/sh # # Give the job a name # #PBS -N "mpijob" # # Specify the project the job belongs to # #PBS -A account_no # # We want 60 hours on 16 cpu's: # #PBS -l walltime=60:00:00,mppwidth=16 # # The job needs 1000 MB memory per process: #PBS -l mppmem=1000mb # # Send me an email on a=abort, b=begin, e=end # #PBS -m abe # # Use this email address (check that it is correct): #PBS -M your@email.address.com # # Write the standard output of the job to file 'mpijob.out' (optional) #PBS -o mpijob.out # # Write the standard error of the job to file 'mpijob.err' (optional) #PBS -e mpijob.err # # Make sure I am in the correct directory mkdir -p /work/$USER/mpiwork cd /work/$USER/mpiwork # For hexagon use: aprun -n 16 -m 1000M ./program # Return output at end to mpiexec as exit status: exit $?
IMPORTANT BATCH JOB ATTRIBUTES AND RESTRICTIONS
- mppwidth : the number of processing elements. So mppwidth=16 means sixteen processing elements is requested. This argument must match the -n argument for aprun.
- mppnppn : number of processing elements per node. If one needs only one processors on each compute node, one can use the attribute mppnppn (so mppnppn=1 or mppnppn=2). mppnppn=4 is default. If included, this argument has to match the -N argument for aprun. It is important to notice that even if mppnppn=1 is set the job will be accounted as mppnppn=4 in the cpu accounting.
- walltime : the maximum allowed wall-clock time for execution of the job. If the specified time is too short, the job will be killed before it completes.
- mppmem : an upper limit for the memory usage per process for a job. An explanation of how to request more memory can be found here. If the memory requirement is exceeded, the job may get killed by the system.
- mppdepth : is the number of OpenMP threads per node. This must match the -d argument for aprun.
mpijob.o## and mpijob.e## where
## is the job number assigned by PBS when submitting
the job.
Finally, the examples on this page only show a small number of the
attributes that PBS has. See 'man pbs_resources_linux' for more
attributes.
JOB CONTROL AND MONITORING
The most important batch system commands are:
-
qsub.
A PBS script called
job.pbsis submitted with the commandqsub job.pbs : submit the script job.pbs
Queues and priorities are chosen automatically by the system. The commandqsubreturns a job identifier (number) that can be used to monitor the status of the job (in the queue and during execution). -
qstat.
The job status can be shown by the command
qstat : display jobs qstat -a : display all jobs in alternative format qstat -f : display full status of all jobs (long output) qstat -f jobid : display full status of a specific job
or by using the graphical front-endxpbs. -
qdel.
A queued or running job can be killed (or cancelled) by the
command
qdel jobid : delete a specific batch job
-
showq.
Displays the actual job ordering for the scheduler,
separated in three list; active, eligible, and blocked jobs.
showq : display jobs showq -u $USER : display jobs for $USER
-
checkjob.
Displays detailed job state information.
checkjob jobid : display status for job
PBS Purpose
qsub Submit a job
qdel Cancel a job
qstat Get job status
qstat -Q Get available queues
qstat -Q -f Show queue information
qstat -B -f Show PBS Server status
qhold Temporarily stop job
qrls Resume job
qhold Checkpoint job
qrls Restart from checkpoint
showq Displays the job ordering of the scheduler
showq -u $USER Displays the job ordering for $USER
showstart jobid Displays estimated start time of job
checkjob jobid Displays status for job
apstat Provides status information for Cray XT systems applications
xtprocadmin Display or set processor flags in the Cray XT series system database
xtshowcabs Shows information about compute and service partition processors and the jobs running in each partition
pbsnodes -a Show status of nodes in cluster
tracejob Trace job information from PBS log files
APRUN ARGUMENTS
The resources you requested in PBS has to match the arguments for aprun. So if you ask for "#PBS -l mppmem=900mb" you will need to add the argument "-m 900M" to aprun.
-N processors per node should be equal the value of mppnppn
-n processing elements should be equal the value of mppwidth
-d number of threads should be equal the value of mppdepth
-m memory per element suffix should be equal the amount of memory requested by mppmem. Suffix should be M.
A complete list of aprun arguments can be found on the man page of aprun.
Questions on how to make scripts, general use of the system, job-dependencies, etc. can be sent to hpc-support@hpc.uib.no
