Jobs get stuck in the queue due to various reasons. To look up the reason for yours, please run
$ qstat -u $USER -Esw
The comment printed under the lines that started with the job ID gives very good hint about the reason.
The job 1234567 is on hold because the project xy11 doesn't have sufficient allocation and it needs at least 14.4 kSU available in xy11's compute grant to be considered to run.
$ qstat -sw 1234567 gadi-pbs: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - ----- 1234567.gadi-pbs abc111 normal-exec test_2x2 -- 30 1440 400gb 05:00 H -- Project xy11 does not have sufficient allocation to run job (14.40 KSU required)
To confirm the project's grant position, please run `nci_account -P <project_code>`. For the <project_code>=xy11 example, run
$ nci_account -P xy11 Usage Report: Project=xy11 Period=2020.q3 ============================================================= Grant: 75.00 KSU Used: 64.22 KSU Reserved: 0.00 SU Avail: 10.78 KSU
It shows the project xy11 has only 10.78 kSU available, not enough to support the job 1234567. In this case, the project lead CI need to contact the scheme manager to ask for more SUs. NCI will top up the compute grant according to the scheme manager's approval.
There are also other common reasons for jobs not running. Please see below for more information and possible solutions.
This comment suggests there are not enough CPU cores available to start this job. Nothing is wrong. The job is simply waiting for its turn to run.
This comment suggests there is not enough memory available to start this job. Nothing is wrong. The job is simply waiting for its turn to run.
This comment suggests there are not enough nodes of the right specification available to start this job. Nothing is wrong. The job is simply waiting for its turn to run.
The job is waiting for another job before it will have resources available to run. Nothing is wrong. The job is simply waiting for its turn to run.
This comment can be temporary for a job during the period that the job scheduler reconsiders to run it. When it is not transient, similar to the example shown above, it suggests that the project has not enough SU in the `Avail` account. Jobs with this comment might be able to run when the `Reserved` SUs returns to `Avail` depending on the actual usage of the just-finished jobs.
Jobs with a wall time limit that would extend into a scheduled downtime will not be started until the scheduled maintenance finishes. If you know the job won't use all the requested wall time, please request it as close to the actual usage as possible. If it is not possible to reduce the walltime limit to allow zero overlap between the job's potential execution time and the scheduled downtime, the job has to wait to be started after the downtime window.
Jobs that flagged PBS with the directive `-lsoftware=<software_string>` may run into this when the LSD record shows there are not enough license seats available for the job to run.
Most of the time, it is just a matter of waiting a bit longer. Once the license seats are released from other jobs they will serve the next waiting job. To look up how many jobs waiting ahead of yours, please search the `<software_string>` in the license status page https://usersupport.nci.org.au/license-status.html.
If you believe the LSD record is wrong and there are actually enough license seats for your job to checkout, please launch a ticket and provide us with the jobIDs. We will fix the problem.
The comment should be transitional. It suggests the job was scheduled to run just now but the compute node that was scheduled to run the job had some issues and unable to run the job at that time.
If you see it in your job for more than 10 minutes, please launch a ticket and provide us with the jobIDs. We will fix the problem.
The comment appears when a job has too many failed attempts to start, like described in the above error message `Execution server rejected request`. PBS was trying to start the job several times, but failed every time. This is an indication that either there is something seriously wrong in the job submission script or every start attempts send the job to the same failed but not yet detected node(s).
The job is put on hold to allow our HPC team to investigate. You can't release the job, but you can certainly try to submit it again after making sure that the script is OK. If the jobs is put again on hold, please launch a ticket and tell us the jobIDs.