Contributed By
Contributors |
---|
Gadi is Australia’s most powerful supercomputer, a highly parallel cluster comprising more than 150200,000 processor cores on ten different types of compute nodes. Gadi accommodates a wide range of tasks, from running climate models to genome sequencing, from designing molecules to astrophysical modelling. To start using Gadi, you should read this page which covers most of the basics you need to know before submitting your first job.
...
Resource Name | Owner | Accessible from | Size Limit | Allocation Valid Until | Resource Specific Comments | |
---|---|---|---|---|---|---|
Compute Hours | project | n.a. | amount set by scheme manager | end of quarter | ||
storage | $HOME | user | PBS jobs / login nodes | 10 GiB with no possible extension | user account deactivation |
|
/scratch/$PROJECT | project | PBS jobs† / login nodes | 72 GiB by default, more on jobs' demand | project retirement/job demand changes |
| |
/g/data/$PROJECT | project | PBS jobs† / login nodes | amount set by scheme manager | project retirement |
| |
mdssmassdata | project | PBS copyq jobs† / login nodes | amount set by scheme manager | project retirement tape-based archival data sto |
| |
$PBS_JOBFS | user | PBS jobs * | disk space available on the job's hosting node(s) | job termination |
| |
software applications | NCI | PBS jobs / login nodes | n.a. | n.a. |
| |
license | software group owner | PBS jobs / login nodes | available seats on the licensing server | license expiry date |
|
...
This shows the total, used, reserved, and available compute grant of the project in the current billing period at the time of the query. SU is short for Service Unit, the unit that measures Gadi compute hours. Jobs run on the Gadi normal queue are charged 2 SUs to run for an hour on a single core with a memory allocation of up to 4 GiB. Jobs on Gadi are charged for the proportion of a node’s total memory that is used: see more examples of job cost here. In addition, different Gadi queues have different charge rates: see the breakdown of charge rates here.
The Grant amount listed is always equal to the sum of Used, Reserved and Avail. Every time a project has a job submitted, its potential cost, calculated according to the requested walltime, is reserved from the total Grant and reduces the amount of Avail. The actual cost based on walltime usage is determined only after the job's completion. If it finishes within the requested walltime limit, the over-reserved allocation is returned to the Avail amount and the actual cost goes to Used amount. When the project has no jobs that are waiting to run or running, Reserved SU returns to zero.
If there are not enough SUs available for a job to run according to its requested resource amounts, the job will be waiting in the queue indefinitely. The project lead CI should contact their allocation scheme manager to apply for more. If not sure which scheme manager to contact, see details in the verbose output of the command `nci_account`. It provides more granular information on per user and per stakeholder basis if receiving the `-v` flag in the command line.
Each user has a project-independent home directory. The storage limit of the home folder is fixed at 10 GiB. We recommend to use it as the folder where you host any scripts you want to keep to yourself. Users are encouraged to share their data elsewhere, see our Data Sharing Suggestions. All data on /home is backed up. In the case of ‘accidental’ deletion of any data inside the home folder, you can retrieve the data by following the example here.
...
The first column in the output shows the permissions set for the folder/file. For more information on unix file permissions, see this page.
To look up how much storage you have access to through which projects, run the command ‘lquota’ on the login node. It prints out the storage allocation info together with its live usage data. For example, the return message
...
The second section with all the lines that start with ‘#PBS’ specifies how much of each resource the job will need. It requests an allocations of 48 CPU cores, 190 GiB memory, and 200 GiB local disk on a compute node from the normal queue for its exclusive access for 2 hours. It also requests the system to mount both the a00 project folders on the filesystems /scratch and /g/data inside the job, and to enter the working directory once the job is started. Please see more PBS directives explained in here. Note that a ‘-lstorage’ directive must be included if you need access to /g/data , otherwise these folders will not be accessible when the job is runningis needed.
To find the right queue for your jobs, please browse the Gadi Queue Structure and Gadi Queue Limit pages.
Info |
---|
Users are encouraged to request resources to allow the task(s) to run around the ‘sweet spot’ where the job benefits from parallelism and achieves shorter execution time while utilising at least 80% of the requested compute capacity. While searching for the sweet spot, please be aware that it is common to see components in a task that run only on a single core and cannot be parallelised. These sequential parts drastically limit the parallel performance. For example, having 1% sequential parts in a certain workload limits the overall CPU utilisation rate of the job when running in parallel on 26 48 cores to less than 80%70%. Moreover, parallelism adds overhead which in general scales up with the increasing core count and, when beyond the ‘sweet spot’, results in a waste of time on unnecessary task coordinations. |
...
Info |
---|
Users are encouraged to run their tasks if possible in bigger jobs to take advantage of the massive potential massive parallelism that can be achieved on Gadi. However, depending on the application, it may not be possible for the job to run on more than a single core/node. For applications that do run on multiple cores/nodes, the commands and scripts/binaries called in the third section determine whether the particular job can utilise the requested amount of resources or not. Users need to edit the script/input files which define the compute task to allow it, for example, to run on multiple cores/nodes. It may take several iterations to find the ideal details for sections two and three of the submission script when exploring around the job's sweet spot. |
...