User Tools

Site Tools


HPC User Guide

Open Grid Scheduler

Rosalind's HPC environment runs on the linux operating system and a basic knowledge of interacting with linux via the command line is essential. Users who are new to linux may find useful advice on the getting started with linux page.

In order to use Rosalind you must also become familiar with use of Rosalind's job scheduler, Open Grid Scheduler version 2011.11p1_155. After logging on to Rosaling you don't run jobs directly on the head node, but rather submit them to the jobs scheduler's batch queuing system which will then send them to a compute node or nodes to do the actual work. Open Grid Scheduler was developed from the Sun Grid Engine codebase and the syntax is almost identical.

This guide isn't an exhaustive instruction manual but should provide some useful guidance to get you started. There are many additional resources some of which can be found here:

To launch a batch job using Open Grid Scheduler you first of all need to create a shell script containing the commands that you want to run and some instructions to Open Grid Scheduler. This may at first seem like a little more work than should be necessary, however once you've run it this way a few times it becomes a lot more straightforward and getting into the habit of running things as batch scripts makes more complicated workflows and pipelines a lot easier to set up.

Anatomy of an Open Grid Scheduler Script

#$ -pe smp 6
#$ -cwd
#$ -N jobname
#$ -j y
#$ -l h_vmem=12G
#$ -l h_rt=02:00:00
module load bioinformatics/R/3.2.1
Rscript /users/k123456/brc_scratch/myRscript.R\\

The above script is a simple example of an OGS script and it consists of three parts. The first line #!/bin/sh is a command that indicates that the file is a shell script. The following lines which begin with #$ are instructions to OGS. The remaining lines are the shell script that is executed on the compute nodes, this can be a fully functioning pipeline or just a simple one line command to run a single program.

In the above example these are
#$ -S /bin/sh

Indicates the type of shell that the shell that will be used when the commands are run on the compute node/s.

#$ -pe smp

Use parallel environment “smp” and reserve 6 slots on the compute node.

#$ -cwd

Work in the current directory. This will cause the shell script to be executed from the directory that the OGS script is launched from, stdout and stderr output will also be written to that directory.

#$ -N jobname

Set a name for the job. This will be the name displayed when the list of running and queued jobs is displayed with qstat and the name of the stdout and stderr files.

#$ -j y

Write the stdout and stderr to the same file. Normally OGS will capture the stdout and stderr from the shell script and send them to separate files jobname.o<jobID> and jobname.e<jobID> respectively. For example if the jobname was “align_to_transcripts” and the jobID was 112445, the output files “align_to_transcripts.o112445” and “align_to_transcripts.e112445” would be created however with the above -j y option both files would be combined into “align_to_transcripts.o112445”.

#$ -l h_mem=12G

Apply a memory limit of 19G per slot for the job. Jobs will normally be killed by the scheduler if the memory consumed is greater than this. The default is 8G so it pays to increase the value if you know that you'll need more ram. (See “Reserving Memory” below).

#$ -l h_rt=02:00:00

By default, jobs on Rosalind have a time limit of 72 hours. If you require more time you must set the limit as in the above example h_rt=<hours:minutes:seconds>. Jobs with shorter run time limits will be given a higher dispatch priority than longer jobs and setting the time limit will allow the scheduler to perform back filling so during busy times it pays to set the job limit even if you need to run for a shorter time than the default.

qsub - Launching a job

The command “qsub” is used to launch open grid scheduler scripts. This can be followed by options which are passed to open grid scheduler such as those above and is often where the desired queue is specified. In the example below a script is sent to the LowMemShortterm.q queue using the “-q” option and then the qstat command is used to query the status of the jobs.

[k1214122@login1(rosalind) test]$ qsub -q LowMemShortterm.q
Your job 22044 (“”) has been submitted
[k1214122@login1(rosalind) test]$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
22043 0.36279 k1214122 r 11/26/2015 14:12:39 HighMemLongterm.q@nodea21.prv. 1
22044 0.36279 k1214122 r 11/26/2015 14:19:58 LowMemShortterm.q@nodeb19.prv. 1

qstat – querying job status

When the qstat command is run it will give the status for jobs launched by the current user. To get a more detailed view the “-f” flag can be used. For a full listing of all users on all queues:

To examine the list without the queued jobs:

Most of the output from qstat is self-explanatory however a few of the columns need further explanation. The column labelled state can have the values qw, r, t, dr and these correspond to the following:

qw - queue wait. The job is in the queue and either waiting for suitable resources to become available or waiting for the next scheduling run (every 15 seconds) to send the job to the relevant compute node/s.
t - transferring. The job/task is currently being transferred to the node and is waiting for the exec daemon to run it.
r - running. The job/task is running on the node.
dr - delete running. The job/task running on the node is being deleted and should shortly be terminated.

Parallel Environments

The current parallel environments are:

  • smp:
    • Symmetric Multiprocessing - this environment is suited to multi-threaded jobs which require all of the processes to execute on the same node. Most of the nodes on the Rosalind cluster have 20 cores, so jobs requesting more cores will likely never be scheduled.
  • mpislots:
    • this environment is suited to MPI (Message Passing Interface) jobs. Individual processes can be scheduled on different nodes. It uses the $fill_up allocation_rule

Reserving Memory

Using multiple CPU cores isn't the only reason to reserve more than one slot on a compute node. If a job is likely to consume a large share of memory (e.g. half the memory on the node) its a good idea to reserve at least half of the slots on that node using a parallel environment. Grid engine will kill jobs that exceed their share of the memory. By default jobs will be allowed 8GB per slot but you can change this by altering the h_vmem limit. e.g. if you wanted to use 380GB which is available on the HighMem* queues you would reserve 20 slots and add:
#$ -l h_vmem=19G
to your script.

Bear in mind when working out how many cores to reserve, that the amount of memory to number of cores ratio varies between some of the queues, so for example the nodes on the LowMemShortterm.q have 9GB per core and the nodes in the HighMemShortterm.q queue have 19GB per core.

In summary try to reserve an appropriate portion of the compute nodes for the job that you're running. If a job has failed unexpectedly, it could be that it exceeded its memeory quota. The qacct command is useful in finding the maximum amount of ram that the job used.

qacct -j <jobID>

The maxvmem is the high water mark for the amount of memory that a job used.

Reserving resources for demanding jobs(back filling)

If you request some resources which can be considered demanding when the system is busy, you may wait for a long time. Under the bonnet, the system wait to have the exact number of resources e.g. 40 cores in order your job to start. That can happen after quite a while. If you enable the reserving option with :

#$ -R y

the queue system will do some back filling for you. So when 20 slots are free, the system will keep them for you and when another 20 slots become available , it will join and give it to your job to run.

Interactive jobs

Interactive sessions are very useful for debugging, code development or running quick tests. An interactive session should be requested without preparing a job script.

Interactive sessions can be started on compute nodes using the grid engine command qrsh. This is similar to using ssh to log on to the node except that it reserves a slot (or several) on the target node. e.g. To start an interactive session:

qrsh -q HighMemShortterm.q

If more than one slot is required, interactive jobs can be opened with the smp parallel environment.

qrsh -q HighMemShortterm.q -pe smp 4

Note:If you want your interaction session to last for many hours, you need to ssh connect using this command: ssh -o ServerAliveInterval=900 -o ServerAliveCountMax=XX where XX=32+4*hours you want to have live connection(>8 hours).

There is currently a dedicated interactive queue on Rosalind in which there are two lowmem nodes. The queue currently has a 72 hour time limit. It previously had a 24 hour limit and since all jobs on Rosalind are launched with a default 72 hour run time, if we revert to this time limit in the future, you will need to set this to 24 hours or less (by setting the -l h_rt parameter) in order to launch a job on this queue. The queue does not currently accept parallel interactive sessions.

You can log on the interactive queue using the following example:

qrsh -q InterLowMem.q


When you buy CPU resources on Rosalind, you can then associate a group of users who will then “consume” that resource allocation. Each CPU allocation will be assigned to an open grid scheduler project. Users will be given a default project against which their CPU consumption will be tracked, however users who are involved in more than one project will be able to specify the project that they are using for a particular job submission. e.g.

qsub -P project_name

Tracking Usage

In the future you will be able to use our online usage tracking system, however at the moment you can use the qacct command to view details of finished jobs. See the qacct man page for details. We will be using the ru_wallclock metric (reported in seconds) in usage calculation since this is a measure of your occupancy on the system. So for a given job the usage is the ru_wallclock metric times the number of slots that were reserved.

hpc_user_guide/start.txt · Last modified: 2018/06/05 09:39 by alan