Page tree

Each job as its last step sets up the scripts and data for the next job to be run, then it issues a qsub command to run that next job.

The biggest problem with self submitting scripts is that if the jobs themselves are failing therefore finishing immediately, and there is no limiter in the self submission process, the constant flood of job submissions can overload the batch system and perhaps also needlessly tie up a large number of cpus if the jobs are parallel.

The simple examples linked below illustrate how to build in a limit to the number of times a job will resubmit itself. The examples allow the execution of a sequence of checkpoint-restart jobs. These scripts are only templates – you will need to modify them for your application. In particular the PBS qsub options (#PBS lines) will almost definitely need modifying, the executable line and possibly extra processing needed for restarting will have to be added and you may want to modify or eliminate the verbose output the scripts produce.

  1. undertime.bash and undertime.tcsh assume that the compute phase of the job will complete within the walltime of the PBS job. Most likely the compute executable will finish by writing a checkpoint file and doing whatever cleanup is necessary for the follow-on job to start.
  2. overtime.bash and overtime.tcsh assume that the compute phase of the job may be terminated by PBS due to exceeding the jobs walltime limit. The assumption is that the executable is regularly writing checkpoint files and can handle being terminated at an arbitrary point in execution.