To run compute tasks such as simulations, weather models, and sequence assemblies on Gadi, users need to submit them as ‘jobs’ to ‘queues’. Each queue has different hardware capabilities and limits. The overall procedure takes a few steps, that we will outline here, but here are a few key points before moving on.
If there are tasks in the job need that access to the internet at any stage, they have to be separately packed into a job on a copyq node as none of the standard compute nodes have external network access outside of Gadi.
Once you have saved this script as a '.sh
' file, you will be able to submit it using the 'qsub
' command, followed by the file name,
$ qsub <jobscript.sh>
After your job has been successfully submitted, you will be given a jobID. This will be a string of number ending in .gadi-pbs, for example:
12345678.gadi-pbs
You can then use this jobID to monitor and enquire about the job that is running. There are several ways to monitor your job over its lifespan on Gadi. Please see our job monitoring guide for ways to obtain information about your job.
When writing a PBS script, users are encouraged to only request resources that they need, and that will allow their tasks to run close to the 'sweet spot'. This is where the job can take advantage of parallelism and achieve a shorter execution time, while utilising at least 80% of the resources requested. Searching for this sweet spot can take time and experimentation, some code will need several iterations before that efficiency is found.
On job completion, by default, the contents of the job’s standard output/error stream gets copied to a file in the working directory with the name in the format <jobname>.o<jobid>/<jobname>.e<jobid>.
For example, when the job 12345678 finishes, there are two files created with the names job.sh.o12345678 and job.sh.e12345678 as the record of its STDOUT and STDERR, respectively, and these two log files are located inside the same folder where the job was submitted. (STDOUT is a normal print out, STDERR is an error stream that shows if your job ran into any issues.) We recommend users check these two log files before proceeding with the post-processing of any output/result from the corresponding job.
Interactive jobs allow users to run jobs that can be monitored and adjusted at certain points during their lifespan. Users can submit jobs in an interactive shell on the head compute node, which allows them to test and debug code before running the entire job. NCI recommends that users utilise this resource to debug large parallel jobs or install applications that have to be built when GPUs are available.
Instead of writing a PBS script, Interactive jobs are run as a command on the login nodes. This is done with the command $ qsub -I
followed by the parameters you wish to run, for example
$ qsub -I -q gpuvolta -P a00 -l walltime=00:05:00,ncpus=48,ngpus=4,mem=380GB,jobfs=200GB,storage=gdata/a00,wd qsub: waiting for job 11029947.gadi-pbs to start qsub: job 11029947.gadi-pbs ready
Here we have a command submitting a job on the gpuvolta
queue, through project a00
, requesting 05:00 minutes of walltime
.
It asks for 48
CPUs
,4
GPUs
, 380
GiB of memory
, and 200
GiB
of local disk space. Once this job begins, it will mount /g/data/a00
to the job and enter the job's working directory.
When you have submitted an interactive job, you will notice that your ssh prompt has changed from a login node to something similar to this
[aaa777@gadi-gpu-v100-0079 ~]
This means that you are now logged into a compute node, you can see that change in the example below.
[aaa777@gadi-login-03 ~]$ qsub -I -l walltime=00:05:00,ncpus=48,ngpus=4,mem=380GB,jobfs=200GB,wd -q gpuvolta qsub: waiting for job 11029947.gadi-pbs to start qsub: job 11029947.gadi-pbs ready [aaa777@gadi-gpu-v100-0079 ~]$ module list No Modulefiles Currently Loaded. [aaa777@gadi-gpu-v100-0079 ~]$ exit logout qsub: job 11029947.gadi-pbs completed
This is a very minimalistic example of an interactive job. by default, the shell doesn't have any modules loaded, if you need to load modules repeatedly inside interactive jobs, you can edit your ~/.bashrc
file to automatically load them. Once you are finished with the job, run the command
$ exit
to terminate the job.
The login nodes are a shared space, at any time you could potentially be sharing the nodes with hundreds of other users while logged in. To make sure that everyone has fair access to these nodes, any job that runs for more than 30 minutes, or exceeded 4 GiB of memory, will be terminated. If you need to transfer a large amount of data, more than the amount allowed in the login nodes, NCI recommends that you submit it in a job within the copyq queue.
Copyq jobs have to be used for anything that requires:
An example of a PBS script that uses the copyq queue to collate files from a directory, then send them to massdata could look like:
#!/bin/bash #PBS -l ncpus=1 #PBS -l mem=2GB #PBS -l jobfs=2GB #PBS -q copyq #PBS -lother=mdss #PBS -P a00 #PBS -l walltime=02:00:00 #PBS -l storage=gdata/a00+massdata/a00 #PBS -l wd tar -cvf my_archive.tar /g/data/a00/aaa777/work1 mdss -P a00 mkdir -p aaa777/test/ mdss -P a00 put my_archive.tar aaa777/test/work1.tar mdss -P a00 dmls -ltrh aaa777/test
In this script, the following is specified:
(#PBS -q copyq)
,(#PBS -lother=mdss)
.Then the actual commands to collate, and copy files to massdata:
/g/data/a00/aaa777/work1
.aaa777/test/
on the massdata file system.To compile code inside a copyq job, it may be necessary to load modules such as intel-compiler, and request more jobfs to allow enough disk space to host data written to $TMPDIR.