Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

If your job exceeds the memory request you gave in your PBS job script you may receive a message like the following:

...

For example, if you request 128 CPUs in the normal queue, and 128GB of RAM, your job will be using 8 nodes (there are 16 CPU cores in each node in the normal queue on Raijin) and 16GB of RAM will be allocated on each of the 8 nodes.

...

The first line of the message tells you what which job exceeded the memory allocation, and what node the memory allocation was exceeded on. This may be helpful in determining which of the nodes in a multi-noded job is exceeding memory if there is a memory imbalance between nodes.

The subsequent lines indicate every process running as part of your job on that node: the process name and process ID are indicated the and help identify which process in your job is using significant memory or is not balanced with other processes. The RSS value is the "resident set size" or actively used memory for that process in bytes. The vmem value is the virtual memory address space of the process, ; this may be significantly larger than your memory usage as virtual memory address space may be allocated by some processes but never actually used.

...

If your memory usage looks higher than expected for your jobs, you may wish to look into way wasy to reduce memory usage of the job either through configuration options or input file options, or changing the code if you are developing your own code. Note that for applications that allow you to tweak their memory usage through configuration or input file options, using more memory often does not actually result in improved performance. You should can experiment with different values to determine the most efficient options for your jobs.


If you have questions or need help, please submit a help request or email help@nci.org.au.  

Content by Label
showLabelsfalse
max5
spacesHelp
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ("job","script","memory") and type = "page" and space = "Help"
labelsjob cron-job script

...