Page History

Panel

borderColor	#21618C
bgColor	#F6F7F7
titleColor	#17202A
borderWidth	1
titleBGColor	#FFB96A
borderStyle	ridge
title	Why are my jobs not running?

There are several reasons that a job could get stuck in the queue.

Tip

To look up the reason for yours, please run

Code Block

theme	FadeToGrey

$ qstat -u $USER -Esw

The comment printed under the lines that started with the job ID gives very good hint about the reason.

Please check this page and see if your reason is listed.

Note
If your issue is not listed listed, please contact the NCI Helpdesk, or contact help@nci.org.au, and we will endeavour to assist you.

Panel

borderColor	#21618C
bgColor	#F6F7F7
titleColor	#17202A
borderWidth	1
titleBGColor	#FFB96A
borderStyle	ridge
title	On this page

Panel

bgColor	#ABEBC6

The most common reason: Insufficient Project Allocation

The job 1234567 is on hold because the project xy11 doesn't have sufficient allocation. It needs at least 14.4 kSU available in xy11's compute grant to be considered to run.

Code Block

theme	FadeToGrey

$ qstat -sw 1234567
 
gadi-pbs:
                                                                                                   Req'd  Req'd   Elap
Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
1234567.gadi-pbs               abc111          normal-exec     test_2x2          --     30       1440    400gb 05:00   H  --
   Project xy11 does not have sufficient allocation to run job (14.40 KSU required)

To confirm the project's grant position, please run `nci_account -P <project_code>`. For the <project_code>=xy11 example, run

Code Block

theme	FadeToGrey

$ nci_account -P xy11
 
Usage Report: Project=xy11 Period=2020.q3
=============================================================
    Grant:    75.00 KSU
     Used:    64.22 KSU
 Reserved:     0.00 SU
    Avail:    10.78 KSU

It shows the project xy11 has only10.78 kSU available, not enough to support the job 1234567.

Tip
In this case, the project's lead CI needs to contact the scheme manager to ask for more SUs. NCI will top up the compute grant according to the scheme manager's approval.

Other Common Reasons
Anchor
Other Common Reasons
Other Common Reasons

There are also other common reasons for jobs not running. Please see below for more information and possible solutions.

Job held, project <prj> is over storage allocation on <filesystem: scratch/gdata>: storage
Anchor
storage
storage

Your job is held up because the project storage space on scratch/gdata is over it's disk quota.

Note

title	Take action

Please run nci_account to know what has been exceeded: Allocation or Inode Allocation (number of files/dirs).
Project team members will have to cleanup that space and bring the storage usage sufficiently under Allocation/iAllocation.

Only after this can jobs be released.

Not Running: Insufficient amount of resource: ncpus
Anchor
ncpus
ncpus

Currently, there are not enough CPU cores available to start this job.

Tip
Nothing is wrong. The job is simply waiting for its turn to run.

Not Running: Insufficient amount of resource: mem
Anchor
mem
mem

Currently, there is not enough memory available to start this job.

Tip
Nothing is wrong. The job is simply waiting for its turn to run.

Not Running: Insufficient amount of resource: job_tags
Anchor
job_tags
job_tags

This comment suggests there are not enough nodes of the right specification available to start this job.

Tip
Nothing is wrong. The job is simply waiting for its turn to run.

Not Running: Job would conflict with reservation or top job
Anchor
Job would conflict with reservation or top job
Job would conflict with reservation or top job

The job is waiting for another job before it will have resources available to run.

Tip
Nothing is wrong. The job is simply waiting for its turn to run.

Not Running: PBS Error: Could not reserve allocation from project “<prj>” to run job
Anchor
PBS Error
PBS Error

This comment can be temporary for a job during the period that the job scheduler reconsiders to run it. When it is not transient, similar to the example shown above, it suggests that the project has not enough SU in the `Avail` account.

Tip
Jobs with this comment might be able to run when the `Reserved` SUs returns to `Avail` depending on the actual usage of the just-finished jobs.

Not Running: Job would cross dedicated time boundary
Anchor
Job would cross dedicated time boundary
Job would cross dedicated time boundary

Jobs with a walltime limit that would extend into a scheduled downtime will not be started until the scheduled maintenance finishes. If you know the job won't use all the requested walltime, please request it as close to the actual usage as possible.

Tip
If it is not possible to reduce the walltime limit to allow zero overlap between the job's potential execution time and the scheduled downtime, the job has to wait to be started after the downtime window.

Not Running: PBS Error: Waiting for software licences
Anchor
Waiting for software licenses
Waiting for software licenses

Jobs that flagged PBS with the directive `-lsoftware=<software_string>` may run into this when the LSD record shows there are not enough licence seats available for the job to run.

Most of the time, it is just a matter of waiting a bit longer. Once the licence seats are released from other jobs they will serve the next waiting job. To look up how many jobs waiting ahead of yours, please search the `<software_string>` in the licence status page.

Note
If you believe the LSD record is wrong and there are actually enough licence seats for your job to checkout, please launch a ticket and provide us with the jobIDs.

Not Running: PBS Error: Execution server rejected request
Anchor
Execution server rejected request
Execution server rejected request

This comment should be transitional. It suggests the job was scheduled to run just now but the compute node that was scheduled to run the job had some issues and is unable to run the job at that time.

Tip
If you see it in your job for more than 10 minutes, please launch a ticket and provide us with the jobIDs, and we will endeavour to fix the problem.

Job held, too many failed attempts to run
Anchor
Job held
Job held

This comment appears when a job has had too many failed attempts to start, similar to the error message above `Execution server rejected request`. PBS was trying to start the job several times, but failed every time. This is an indication that either there is something seriously wrong in the job submission script or every attempted start sends the job to the same failed but not yet detected node(s).

Note
The job is put on hold to allow our HPC team to investigate. You can't release the job, but you can certainly try to submit it again after making sure that the script is OK. If the jobs is put again on hold, please launch a ticket along with the jobID's, and we will look into the issue for you.

Can't find your error here? Try looking in out PBS FAQ and see if your error is listed there.

If not, please contact the NCI Helpdesk, or contact help@nci.org.au, and we will endeavour to assist you.

Page tree

Versions Compared

Old Version 46

New Version 47

Key

The most common reason: Insufficient Project Allocation

Other Common Reasons
Anchor
Other Common Reasons
Other Common Reasons

Job held, project <prj> is over storage allocation on <filesystem: scratch/gdata>: storage
Anchor
storage
storage

Not Running: Insufficient amount of resource: ncpus
Anchor
ncpus
ncpus

Not Running: Insufficient amount of resource: mem
Anchor
mem
mem

Not Running: Insufficient amount of resource: job_tags
Anchor
job_tags
job_tags

Not Running: Job would conflict with reservation or top job
Anchor
Job would conflict with reservation or top job
Job would conflict with reservation or top job

Not Running: PBS Error: Could not reserve allocation from project “<prj>” to run job
Anchor
PBS Error
PBS Error

Not Running: Job would cross dedicated time boundary
Anchor
Job would cross dedicated time boundary
Job would cross dedicated time boundary

Not Running: PBS Error: Waiting for software licences
Anchor
Waiting for software licenses
Waiting for software licenses

Not Running: PBS Error: Execution server rejected request
Anchor
Execution server rejected request
Execution server rejected request

Job held, too many failed attempts to run
Anchor
Job held
Job held

Authors: Yue Sun, Andrew Wellington, Mohsin Ali

Page tree

Page History

Versions Compared

Old Version 46

New Version 47

Key

The most common reason: Insufficient Project Allocation

Other Common Reasons AnchorOther Common ReasonsOther Common Reasons

Job held, project <prj> is over storage allocation on <filesystem: scratch/gdata>: storage Anchorstoragestorage

Not Running: Insufficient amount of resource: ncpus Anchorncpusncpus

Not Running: Insufficient amount of resource: mem Anchormemmem

Not Running: Insufficient amount of resource: job_tags Anchorjob_tagsjob_tags

Not Running: Job would conflict with reservation or top job AnchorJob would conflict with reservation or top jobJob would conflict with reservation or top job

Not Running: PBS Error: Could not reserve allocation from project “<prj>” to run job AnchorPBS ErrorPBS Error

Not Running: Job would cross dedicated time boundary AnchorJob would cross dedicated time boundaryJob would cross dedicated time boundary

Not Running: PBS Error: Waiting for software licences AnchorWaiting for software licensesWaiting for software licenses

Not Running: PBS Error: Execution server rejected request AnchorExecution server rejected requestExecution server rejected request

Job held, too many failed attempts to run AnchorJob heldJob held

Authors: Yue Sun, Andrew Wellington, Mohsin Ali

Other Common Reasons
Anchor
Other Common Reasons
Other Common Reasons

Job held, project <prj> is over storage allocation on <filesystem: scratch/gdata>: storage
Anchor
storage
storage

Not Running: Insufficient amount of resource: ncpus
Anchor
ncpus
ncpus

Not Running: Insufficient amount of resource: mem
Anchor
mem
mem

Not Running: Insufficient amount of resource: job_tags
Anchor
job_tags
job_tags

Not Running: Job would conflict with reservation or top job
Anchor
Job would conflict with reservation or top job
Job would conflict with reservation or top job

Not Running: PBS Error: Could not reserve allocation from project “<prj>” to run job
Anchor
PBS Error
PBS Error

Not Running: Job would cross dedicated time boundary
Anchor
Job would cross dedicated time boundary
Job would cross dedicated time boundary

Not Running: PBS Error: Waiting for software licences
Anchor
Waiting for software licenses
Waiting for software licenses

Not Running: PBS Error: Execution server rejected request
Anchor
Execution server rejected request
Execution server rejected request

Job held, too many failed attempts to run
Anchor
Job held
Job held