Job execution (Fimm)From HpcdocBatch systemTo ensure a fair use of the clusters, all users have to run their computations via the batch system. A batch system is a program that manages the queuing, scheduling, starting and stopping of jobs/programs users run on the cluster. Usually it is divided into a resource-manager part and a scheduler part. To start jobs, users specify to the batch system which executable(s) they want to run, the amount of processors and memory needed, and the maximum amount of time the execution should take. Fimm uses "Torque" as the resource manager, which is the same as on the Hexagon cluster. To schedule jobs fimm uses "maui" where hexagon uses "moab", commercial version of "maui " . Batch job submissionThere are essentially two ways to execute jobs via the batch system.
Running jobs in batch is the more common way on a compute cluster. Here, one can e.g. log off and log on again later to see what the status of a job is. We recommend running jobs in batch mode. Create a job (scripts)Jobs are normally submitted to the batch system via shell scripts, and are often called job scripts or batch scripts. Lines in the scripts that start with #PBS are interpreted by Torque as instructions for the batch system. (Please note that these lines are interpreted as comments when the script is run in the shell, so there is no magic here: a batch script is a shell script.) Script can be created in any text editor, like e.g. vim and emacs. Job script should start with an interpreter line, like: #!/bin/bash Next it should contain directives to queue system, at least execution time and how many cpus are requested: #PBS -l walltime=00:60:00 #PBS -l nodes=1:ppn=1, mem=500mb The rest is the regular shell commands. All commands written in script will be executed on login node. This is important to remember for several reasons:
Taking this in mind all IO/CPU intensive tasks should be prefixed with aprun command. aprun will execute the command on compute nodes resulting in higher performance. Note that this should improve the charging of the job since the total time the script is running should be less (charging does not take into account whether the compute nodes are used or not during the time the script is run). Real computational tasks (the main program) should of course be prefixed with aprun as well. You can find examples bellow. Manage a job (submission, monitoring, suspend/resume, canceling)Please find below the most important batch system job management commands: To submit a job use the qsub command. qsub job.pbs # submit the script job.pbs Queues and priorities are chosen automatically by the system. The command qsub returns a job identifier (number) that can be used to monitor the status of the job (in the queue and during execution). This number may also be requested by the support staff. To monitor job status use the ""qstat"" command. qstat # display a listing of jobs qstat -a # display all jobs in alternative format qstat -f # display full status of all jobs (long output) qstat -f <jobid> # display full status of a specific job To cancel job use the ""qdel"" command. qdel jobid # delete a specific batch job To display the actual job ordering for the scheduler, separated in three list; active, eligible, and blocked jobs: showq # display jobs showq -u $USER # display jobs for $USER showq -i # display only jobs in eligible (idle) queue waiting for execution To display a detailed job state information use: checkjob <jobid> # display status for job List of useful commands, incl. short descriptionHere is a list of the most important commands in tabular form (manual pages are recommended):
List of useful job script parameters-A : a job script must specify a valid project name for accounting, otherwise it will not be possible to submit jobs to the batch system. -l : resources are specified with the -l option (lowercase L). There are a number of resources that can be specified. See the example above for the correct syntax. Jobs must specify the number of processors (CPUs), and the maximum allowed wall-clock time for execution. Make sure that you specify a correct amount of memory or you will risk crashing the node for lack of memory. Note that mppmem=XXXmb is a per-process amount and that the nodes have 4000 or 8000mb total (not 4096 and 8192). You can find all attributes and their description in: man pbs_resources_linux Below are the most important attributes:
-l walltime : the maximum allowed wall-clock time for execution of the job. If the specified time is too short, the job will be killed before it completes. -l mppmem : an upper limit for the memory usage per process for a job. An explanation of how to request more memory can be found here. If the memory requirement is exceeded, the job may get killed by the system. -l mppdepth : is the number of OpenMP threads per node. This must match the -d argument for aprun. -o, -e : see example bellow. If the attributes are not used and thus filenames are not specified, the standard output and standard error from the job will be stored in the files mpijob.o## and mpijob.e## where ## is the job number assigned by PBS when submitting the job. -c enabled enable checkpoint feature for the job. When this option specified job can be checkpointed during execution and switched to hold state, later it can be "unpaused" and execution can continue from place where it was stopped (or after a machine/node crash). To use this option the application must be compiled with checkpointing libraries. See Application development (Hexagon)#Checkpoint and restart of applications and man qsub for more info. -c periodic,interval=120,depth=2 this option will enable periodic checkpoints for the job, with an interval of 2 hours and will keep only the two latest checkpoint images, see man qsub and Application development (Hexagon)#Checkpoint and restart of applications for more info. For additional PBS switches please refer to: man qsub List of classes/queues, incl. short description and limitationsHexagon uses a default batch queue named "batch". It is a routing queue which based on job attributes can forward jobs to the debug, small or normal queues. Therefore there is no need to specify any execution queue in the PBS script. Please keep in mind that we have priority based job scheduling. This means that based on requested amount of CPU and time job, as well as previous usage history, jobs will get higher or lower priority in the queue. Please find a more detailed explanation in Job execution (Hexagon)#Scheduling policy on the machine.
NOTE: There is no need to specify a queue in the job script, the correct queue will automatically be selected. Relevant examplesWe illustrate the use of batch job scripts and submission with a few examples. Sequential jobsTo use 1 processor (CPU) for at most 60 hours wall-clock time and 900MB of memory the PBS job script must contain the line : #PBS -l nodes=1:ppn=1,walltime=60:00:00,mem=900mb Please note that Fimm, Stallo and Titan are much better suited for sequential jobs. Hexagon should therefore only be used for parallel jobs. Below is a complete example of a PBS script for executing a sequential job. #! /bin/sh -
#
# Give the job a name (optional)
#PBS -N "seqjob"
#
# Specify the project the job should be accounted on (obligatory)
#PBS -A account_no ("cost " command will tell you which account you should use )
#
# The job needs at most 60 hours wall-clock time on 1 core on one node. (obligatory)
#PBS -l nodes=1:ppn=1,walltime=60:00:00
#
# The job needs at most 900mb of memory (obligatory)
#
#PBS -l mem=900mb
#
# Write the standard output of the job to file 'seqjob.out' (optional)
#PBS -o seqjob.out
#
# Write the standard error of the job to file 'seqjob.err' (optional)
#PBS -e seqjob.err
#
# Make sure I am in the correct directory
cd /work/janfrode/seqwork
# Invoke the (sequential!) executable
./program
Parallel/MPI jobsTo use 6 processors (CPUs) for at most 60 hours wall-clock time, the PBS job script must contain the line #PBS -l nodes=3:ppn=2,walltime=60:00:00
#! /bin/sh - # # Make sure I use the correct shell. # #PBS -S /bin/sh # # Give the job a name # #PBS -N "mpijob" # # Specify the project the job belongs to # #PBS -A nn2117k #PBS -l nodes=1:ppn=2 # # We want 60 hours on 6 cpu's: # #PBS -l walltime=60:00:00,nodes=3:ppn=2 # # The job needs 900 MB memory per process: #PBS -l pmem=900mb # # Send me an email on a=abort, b=begin, e=end # #PBS -m abe # # Use this email address (check that it is correct): #PBS -M your@email.address.com # # Write the standard output of the job to file 'mpijob.out' (optional) #PBS -o mpijob.out # # Write the standard error of the job to file 'mpijob.err' (optional) #PBS -e mpijob.err # # Make sure I am in the correct directory mkdir -p /work/$USER/mpiwork cd /work/$USER/mpiwork # For fimm use: /usr/bin/mpiexec ./program # Return output at end to mpiexec as exit status: exit $? Interactive job submissionPBS allows the user to use a compute nodes for interactive use. Interactive jobs are typically used for:
A job run with the interactive option will run normally, but stdout and stderr will be connected directly to the users terminal. This also allows stdin from the user to be sent directly to the application. To request one processor, 1000MB memory and reserve it for 2 hours, use the command: qsub -l nodes=1:ppn=1,walltime=2:00:00,pmem=1000mb -A replace_with_correct_cpuaccount -I To request 2 CPUs, with 1000mb per process/cpu and reserve them for 2 hours, use the command: qsub -l nodes=1:ppn=2,walltime=2:00:00,pmem=1000mb -A replace_with_correct_cpuaccount -I Where, replace_with_correct_cpuaccount must be replaced with your project name for accounting. The option that specifies the interactive use is -I. You can use it with scripts aswell: qsub -I ~/myscript Note that you will be charged for the full time this job allocates the CPUs/nodes, even if you are not actively using these resources. Therefore, exit the job (shell) as soon as the interactive work is done. To launch your program on the compute node you go to /work/$USER and then you HAVE to use "aprun". If "aprun" is omitted the program is executed on the login node, which in the worst case can crash the login node. Since /home is not mounted on the compute node, the job has to be started from /work/$USER. General job limitationsThe default values if not specified are 1 CPU with 500mb of memory for 15 minutes. Maximum amount of resources per user: In default queue :
In Idle queue :
Default CPU and job maximums may be changed by sending an application to Support. Recommended environment variable settingsAll regular shell recommended environment variables are loaded automatically. Exceptions is if your default shell is tcsh and you have job script with the header #!/bin/bash or #!/bin/sh in that case you have to add into job script: #PBS -S /bin/bash If your job script is in tcsh you don't need to apply the procedure above. Sometimes there can be the problem with proper export of module functions, if you get module: command not found, try to add into your job script: export -f module If you still can get module functions in your job script try to add this: #PBS -V
source ${MODULESHOME}/init/REPLACE_WITH_YOUR_SHELL_NAME
# ksh example:
# source ${MODULESHOME}/init/ksh
Scheduling policy on the machineScheduler on Hexagon has fairshare setup in place. This ensures that all users will get adjusted priorities, based on initial and historical data from running jobs. Types of jobs that are not allowed (will be rejected or never start)The following type of jobs will never start:
CPU-hour quota and accountingTo execute jobs on the supercomputer facilities one needs a user account plus password. People working at UiB, Uni Research AS or IMR can apply via this link. Others at http://www.notur.no. Each user account is connected to at least one project (account). Each project has allocated a number of CPU hours (or quota). CPU-hour usage is defined as the elapsed (wall-clock) time of the user's job multiplied by the number of processors that is used. The quota is the maximum number of CPU hours that all the users connected to this project together can consume. After the quota is exhausted, it is no longer possible to submit jobs and one needs to apply for additional quota first. How to list quota and usage per user and per projectOne can check number of CPU hours can be used by issuing following command: cost On fimm we do not have CPU hours per user , CPU hours is considered based on project/group . thus you can only see how many hours left for your CPU account. You can see your available cpuaccounts and how much quota they have by using either the "cost" command or: qbalance -a $CPU_ACOUNT Cost commandCost command will show you used CPU-hours which is left available in the CPU account which you are belong to. cost Idle queueTo efficiently use the computing resources we have set up a special "idle" queue in the cluster which includes all computing nodes - including those nodes which are normally dedicated to specific groups. Jobs submitted to the "idle" queue will be able to run on dedicated nodes if they are free. Important: if the dedicated nodes are needed by the groups that own them (they submit a job to them) the "idle queue"-jobs using the needed nodes will be killed and re-queued to try to run at a later time. The "idle" queue is accessible to everyone who has an account on fimm.bccs.uib.no. The "idle" queue gives you access to the following extra resources:
The best situation to use the "idle" queue is:
You can do the following to check which queues are available on fimm : qstat -q The following will submit your job to the "idle" queue in interactive mode: qsub -I -q idle In your PBS script you can add the following to submit your job to the "idle" queue #PBS -q idle Please keep in mind that when you submit your job to the "idle" queue it is not guaranteed that your job will finish successfully since the owner of the hardware can "take the resources back" any time they submit a job to their specific queues. Using infiniband in idle queueWe have 16 nodes with Mellanox Technologies MT25204 [InfiniHost III Lx HCA] cards connected to each other with 24 port "MT47396 Infiniscale-III Mellanox " infiniband switch . Those nodes are belong to nanobasic group. If you are not belong to nanobasic group, the only way to access infiniband nodes are through idle queue. One can access infiniband nodes through idle queue with following lines in your PBS script : #PBS -l nodes=2:ppn=8:ib All infiniband nodes have "ib" as node futures. when your job landed on infiniband nodes , mpiexec will automatically pickup infiniband connection instead of regular ethernet connection. Other commands
For all commands mentioned above module gold should be loaded. FAQ / trouble shootingPlease refer to our general FAQ (Fimm) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||