...
Panel | ||||
---|---|---|---|---|
| ||||
| ||||
Panel | ||||
|
Gadi is Australia’s most powerful supercomputer, a highly parallel cluster comprising more than 200,000 processor cores on ten different types of compute nodes. Gadi accommodates a wide range of tasks, from running climate models to genome sequencing, from designing molecules to astrophysical modelling. To start using Gadi, you should read this page which covers most of the basics you need to know before submitting your first job.
...
Info | ||||
---|---|---|---|---|
HOW TO USE THIS PAGE This page should take new users around an hour to read through.
Don't worry if some of the sections don't mean anything to you, yet. If you think they are not relevant to your workflow, simply go on to the next section. If later you run into an error, you can always come back and read the skipped section then. | ||||
Panel | ||||
|
Note |
---|
To use Gadi, you need to be registered as an NCI user first. Read our notes here on User Account Registration. To register and get a username, please follow the instructions at the MyNCI portal. |
...
Code Block |
---|
$ echo $DISPLAY localhost:76.0 $ xclock # A window with analogue clock will be displayed on local system |
When logging in, the first thing you see under the welcome text is the Mesage of the Day (motd). You can see an example in the above block. Please consider it a noticeboard and read it on every login. News and status updates relevant to Gadi users will be posted to it.
Users are encouraged to submit and monitor jobs, prepare scripts for job submissions, compile small applications, and transfer a small amount of data on the login nodes. To ensure fair usage of login node resources amongst all users, any processes that cumulatively use more than 30 minutes CPU time and/or instantaneously use more than 4 GiB memory on any login node will be killed immediately.
...
At login, your landing point is your home directory whose path is set in the environment variable HOME in the login shell. This also sets the default project in PROJECT and your default shell in SHELL according to the file $HOME/.config/gadi-login.conf.
...
To add more default functions, aliases, modules that are available in your login shell and/or the PBS job shell, please edit the file $HOME/.bashrc in the corresponding `IF` clause(s) in `in_interactive_shell` for both interactive PBS jobs and login shells, `in_login_shell` for only login shell, and `in_pbs_job` for non-interactive PBS jobs.
...
Users can perform various tasks on Gadi, for example, to run computationally intensive jobs on compute nodes, build their own applications on login nodes/data-mover nodes and manage storage on different filesystems. A summary of the available resources of all kinds on Gadi is shown below.
...
†† module file and PBS `-lsoftware` directive are used when controlling access to license [example link]
To run jobs on Gadi, users need to have sufficient allocated compute hours available. Importantly, compute allocations are granted to projects instead of directly to users. Only members of a project can look up and use its compute allocation. To look up how much compute allocation is available in your project, run the command ‘nci_account’. For example, to check the grant/usage of the project a00, user aaa777 would run
...
If there are not enough SUs available for a job to run according to its requested resource amounts, the job will be waiting in the queue indefinitely. The project lead CI should contact their allocation scheme manager to apply for more. If not sure which scheme manager to contact, see details in the verbose output of the command `nci_account`. It provides more granular information on per user and per stakeholder basis if receiving the `-v` flag in the command line.
Each user has a project-independent home directory. The storage limit of the home folder is fixed at 10 GiB. We recommend to use it as the folder where you host any scripts you want to keep to yourself. Users are encouraged to share their data elsewhere, see our Data Sharing Suggestions. All data on /home is backed up. In the case of ‘accidental’ deletion of any data inside the home folder, you can retrieve the data by following the example here.
Users get access to storage space on the /scratch filesystem, and on the /g/data filesystems if the project folder exists, through project memberships. For example, the user jjj777 is a member of the projects a00 and b11, therefore, jjj777 has permission to read and create files/folders in the folders /scratch/a00/jjj777 and /scratch/b11/jjj777.
...
Code Block |
---|
------------------------------------------------------------------------------ project user space used file size count ------------------------------------------------------------------------------ xy11 jjj777 14.3GB 13.2GB 112086 ... ------------------------------------------------------------------------------ |
All Gadi jobs by default have 100MB of storage space allocated on the hosting compute node(s). The path to the storage space is set in the environment variable PBS_JOBFS in the job shell. We encourage users to use the folder $PBS_JOBFS in their jobs that generate large numbers of small IO operations. It not only boosts the job's performance by saving the time spent in those frequent and small IO operations but also saves a lot of inode usage for your project on the shared filesystems like /scratch and /g/data. Note that the folder $PBS_JOBFS is physically deleted upon job completion/failure, therefore, it is crucial for users to copy the data in $PBS_JOBFS back to the local directory on the shared filesystem while the job is still running.
...
If the job runs on multiple compute nodes, this request of 100 GiB of space will be equally distributed among all nodes. If the job requests more than the available disk space on the compute node(s), the submission would fail. Please browse the Gadi Queue Structure and Gadi Queue Limit pages to look up how much local disk is available on each type of compute node.
NCI operates a tape filesystem called massdata to provide an archive service for projects that need to back up their data. Massdata has two large tape libraries in separate machine rooms in two separate buildings. Projects have their data in their own path massdata/<project> on massdata, but the path is not directly accessible from Gadi: data requests must be launched from Gadi login nodes or from within copyq jobs. Here is a video clip showing the tape robot at work.
...
If you need access to massdata or more space on it, the project Lead CI should contact their allocation scheme manager to apply for it.
Gadi has many software applications installed in the directory /apps and uses `Environment Modules` to manage them. To run any of them, load the corresponding module first. If the application requires licenses, join the corresponding software group through my.nci.org.au like you would join other projects.
If the applications or packages you need are not centrally installed on Gadi, please contact help@nci.org.au to discuss whether it is suitable to install the missing ones centrally on Gadi. We do not install all requested software applications but we are happy to assist with the installation inside of your own project folder.
The command `module avail` prints out the complete list of software applications centrally available on Gadi. To look for a specific application, please run `module avail <app_name>`. For example, `module avail open` prints out all the available versions for applications whose name starts with `open`.
...
To read more about `module` commands, please read the page Environment Modules.
Software group membership enables access to a particular licensed application. To join a software group, login to my.nci.org.au, navigate to ‘Projects and Groups‘, search for the software name.
Once you've identified the corresponding software group, read all the content under the ‘Join’ tab carefully before sending out the membership request. It may take a while for the software group lead CI to approve the membership request, send an email to help@nci.org.au or submit a ticket at help.nci.org.au if you need the approval immediately. Once the membership is approved, the access to the application will be enabled.
...
To reserve enough license seats for your PBS jobs, please add the PBS directive `-lsoftware=<license_name>` in your submission script. To look up the right <license_name> for the license owned by the software group you joined, have a look at the live license usage page. This is where we publish the details about how many of which licenses are used in, and reserved for, which job. Please use it when you need to know the status of licenses your jobs are requesting.
...
...
bgColor | #575757 |
---|
Gadi has six designated data-mover nodes with the domain name `gadi-dm.nci.org.au’` Please use it when transferring files to and from Gadi.
...
To transfer files/folders using MobaXterm on Windows system, drag and drop files/folders to/from local computer and Gadi after log in to Gadi via MobaXterm.
...
...
bgColor | #575757 |
---|
To run compute tasks such as simulations, weather models, and sequence assemblies on Gadi, users need to submit them as ‘jobs’ to ‘queues’. Job submission enables users to specify the queue, duration and resources needs of their jobs. Gadi uses PBSPro to schedule all submitted jobs and keeps nodes that have different hardware in different queues. See details about the hardware available in the different queues on the Gadi Queue Structure page. Users submit jobs to a specific queue to run jobs on the corresponding type of node.
...
We recommend users check these two log files before proceeding with the post-processing of any output/result from the corresponding job.
To submit a job defined in a submission script, called for example ‘job.sh’, run
...
The submission script consists of three sections. The first line specifies which shell to use. The second section includes all the PBS directives that define the resources the job needs and the last section contains all the command lines that you would use in an interactive shell to run the compute task.
Here is an example job submission script to run the python script ‘main.py’ which is assumed to be located inside the same folder where you run ‘qsub job.sh’.
...
Info |
---|
Users are encouraged to run their tasks if possible in bigger jobs to take advantage of the massive potential parallelism that can be achieved on Gadi. However, depending on the application, it may not be possible for the job to run on more than a single core/node. For applications that do run on multiple cores/nodes, the commands and scripts/binaries called in the third section determine whether the particular job can utilise the requested amount of resources or not. Users need to edit the script/input files which define the compute task to allow it, for example, to run on multiple cores/nodes. It may take several iterations to find the ideal details for sections two and three of the submission script when exploring around the job's sweet spot. |
We recommend users try their workflow on Gadi in an interactive job before submitting the tasks using the prepared submission script. This is because, when the interactive job starts, the user runs commands in an interactive shell on the head compute node, which allows one to quickly test out possible solutions. For example, this can be used to debug large jobs that run many parallel processes on many nodes or install applications that have to be built when GPUs are available.
...
Once the job starts, it mounts the folder /g/data/a00 to the job and enters the job's working directory in which the job was submitted.
Here is a minimum example of an interactive job. You can tell the interactive job has been started by the prompt changing from something like [aaa777@gadi-login-03 ~] to [aaa777@gadi-gpu-v100-0079 ~]. The job shell doesn't have any modules loaded by default. If you need to repeatedly load modules inside interactive jobs, please edit the file ~/.bashrc to load them automatically. Once you're done with the interactive job, run ‘exit’ to terminate the job.
Code Block |
---|
[aaa777@gadi-login-03 ~]$ qsub -I -lwalltime=00:05:00,ncpus=48,ngpus=4,mem=380GB,jobfs=200GB,wd -qgpuvolta qsub: waiting for job 11029947.gadi-pbs to start qsub: job 11029947.gadi-pbs ready [aaa777@gadi-gpu-v100-0079 ~]$ module list No Modulefiles Currently Loaded. [aaa777@gadi-gpu-v100-0079 ~]$ exit logout qsub: job 11029947.gadi-pbs completed |
For transfer of bulk data (say more than 500 GiB), it is recommended to do it in a job submitted to the queue ‘copyq’ because long data transfer processes running on the login node will be terminated when reaching the 30-minute cumulative CPU time usage limit. Long software installations that require an internet connection are also recommended to be run inside copyq jobs.
...
Code Block | ||
---|---|---|
| ||
$ qsub job.sh |
on a login node.
An example copyq job submission script job.sh could be
...
To compile code inside a copyq job, it may be necessary to load modules such as intel-compiler and request more jobfs to allow enough disk space to host data written to $TMPDIR.
...
Once a job submission is accepted, its jobID is shown in the return message and can be used to monitor the job’s status. Users are encouraged to keep monitoring their own jobs at every stage of their lifespan on Gadi. Note however, that excessive polling of the PBS servers for monitoring purposes will be considered attacks. A frequency of one monitoring query every 10 minutes is more than enough.
Note |
---|
The PBS server which manages job submissions and scheduling answers all submissions, queries, and requests related to jobs. To ensure its quick response to essential requests, don't launch frequent job monitoring queries. |
To look up the status of a job in the queue, run the command ‘qstat’. For example, to lookup the job 12345678 in the queue, run
...
It shows that the job was submitted by the user aaa777 to the normal queue, requested 48 cores and 190 GiB memory for 2 hours and has been running (S = "R") for 35 minutes and 21 seconds. The line at the bottom says the job started at 10:38 am on 4 Sept on the compute node gadi-cpu-clx-2697 where it has exclusive access to 48 cores, 190 GiB memory and 200 GiB jobfs local disk.
We encourage users to keep monitoring their own jobs' utilisation rate at every stage because, if they run into errors and fail to exit, this shows up clearly in the drop in utilisation rate.
...
Code Block |
---|
%CPU WallTime Time Lim RSS mem memlim cpus 12345678 R aaa777 a00 job.sh 23 00:36:47 2:00:00 5093MB 5093MB 190.0GB 48 |
To monitor the status of processes inside a job, run the command ‘qps’. For example, to take a snapshot of the process status of the job 12345678, run
...