Panel | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||
Jobs submitted to Gadi are given a jobID, this is shown to you as soon as it has been accepted and is a string of eight numbers, e.g. 12345678. NCI encourages users to monitor their jobs at every stage, to monitor it's health and assist in detecting errors and failures. However, please refrain from checking your jobs excessively. Repeated queries will be considered attacks, especially in quick succession. Our recommendation is to query your jobs status a maximum of once every 10 minutes, this should be more than enough. |
Panel | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||
Panel | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||
Queue Status Anchor |
To query job status, users run the command
The command -
The screenshot below is in regards to job
You can go even further with this and run the command
In this case, If you would like to see a list of
|
Panel | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||
CPU and Memory Utilisation Anchor |
Users should continue to monitor their jobs, especially the utilisation rate. If users run into errors, this will be evident in a drop in utilisation rate. While a low utilisation rate is helpful for spotting the underuse of compute time, a 100% utilisation rate doesn't necessarily indicate the most efficient use of requested resources. Further enquiries can be made to check if performance can be improved. To see how much CPU and memory your job has actually been using, run the command
This show us that the CPU ran at only 23% of the compute capacity of the 48 cores that were requested and that 36:47 has elapsed. It also shows the peak memory usage in the columns RSS and MEM. Depending on the tasks running within this job, the percentage may increase as its lifespan continues. NCI recommends that users aim for at least 80% overall CPU utilisation rate. |
Anchor | ||||
---|---|---|---|---|
|
Panel | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||
Process Status in Job To monitor the status of processes taking place inside a job, you can take a snapshot of the process status of a job by running
|
Files in Folder $PBS_JOBFS
To list the files contained in the folder $PBS_JOBFS
on a compute node, you can do this from the login node by running the command
Code Block | ||
---|---|---|
| ||
$ qls 12345678 |
To copy a file from $PBS_JOBFS into your current folder, you can use the command qcp
, such as
Commands to help monitor you jobs | |
man qstat | View the manual for qstat and a range of helpful commands |
qdel <jobid> | Delete the job with jobID <jobid> |
qstat -swx <jobid> | Display the job status in the queue with comment |
qstat -fx <jobid> | Display full job status information |
qps <jobid> | Take a snapshot of the process status of all current processes in the running job |
qcat [-s/-o/-e] <jobid> | Display [submission script/STDOUT/STDERR] of the running job |
qls <jobid> | List contents in the folder $PBS_JOBFS |
qcp <jobid> <dst> | Copy files and directories from the folder $PBS_JOBFS to the destination folder <dst> |
Code Block | ||
---|---|---|
| ||
$ qcp -n 0 12345678/testjob_outdir/job.timing ./job.timing.bk1 |