Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Panel
titleOn this page

Table of Contents

Overview

...

Gadi is Australia’s most powerful supercomputer, a highly parallel cluster comprising more than 150200,000 processor cores on ten different types of compute nodes. Gadi accommodates a wide range of tasks, from running climate models to genome sequencing, from designing molecules to astrophysical modelling. To start using Gadi, you should read this page which covers most of the basics you need to know before submitting your first job. 

...

Info

HOW TO USE THIS PAGE

This page should take new users around an hour to read through. 

  • The first section Overview gives an overview of Gadi.
  • The second and third sections Logging In & Login Nodes and Login Environment two sections explains how to log in to Gadi and some important environment variables at login.
  • The third section ‘Gadi Resources‘ summarises fourth section Gadi Resources summarises the various resources available on Gadi and briefly covers all the basics of each type of resource.
  • The fourth section fifth section File Transfer to/from Gadi provides suggestions about moving data in and out of Gadi.
  • The fifth sections sixth section Gadi Jobs explains how jobs work on Gadi and includes three job submission examples.
  • The last section ‘Job Monitoring‘ section Job Monitoring shows the commands that can be used to monitor jobs on Gadi with example return messages explained.

Don't worry if some of the sections don't mean anything to you, yet. If you think they are not relevant to your workflow, simply go on to the next section. If later you run into an error, you can always come back and read the skipped section then.

Table of Contents

panel

bgColor#575757
titleColorWhite

Logging In & Login Nodes

Note

To use Gadi, you need to be registered as an NCI user first. Read our notes here on User Account Registration

To register and get a username, please follow the instructions at the MyNCI portal.


Info

With MobaXterm on Windows system, create a new SSH session by clicking on `Session` tab or `Sessions` menu item on the top-left corner. Use `gadi.nci.org.au` as `Remote host` and NCI username as `Specify username`. Then log in with the new SSH session.

To run jobs on Gadi, you need to first log in to the system. Users on To run jobs on Gadi, you need to first log in to the system. Users on Mac/Linux can use the built-in terminal. For Windows users, we recommend using MobaXterm as the local terminal. Logging in to Gadi happens through a Gadi login node.

For example, user aaa777 would run

Code Block
languagebash
$ ssh aaa777@gadi.nci.org.au

...

Code Block
$ ssh aaa777@gadi.nci.org.au
aaa777@gadi.nci.org.au's password: 
###############################################################################
#                  Welcome to the NCI National Facility!                      #
#      This service is for authorised clients only. It is a criminal          #
#      offence to:                                                            #
#                - Obtain access to data without permission                   #
#                - Damage, delete, alter or insert data without permission    #
#      Use of this system requires acceptance of the Conditions of Use        #
#      published at http://nci.org.au/users/conditions/                             nci-terms-and-conditions-access   #
###############################################################################
|         gadi.nci.org.au - 155185,232032 processor InfiniBand x86_64 cluster       | 
===============================================================================

JanMar 826 20202021 StorageNew FlagsCascade Enforced
Lake Megamem Now StorageAvailable
 flags are nowWe enforcedhave foradded all4 PBSnew jobs.Cascade YouLake mustmegamem request the
   storage you wish to use in your PBS jobs. For example:

       -l storage=scratch/ab12+gdata/yz98
   
   Any storage not requested will not be visible in jobs.

=================================================(3TB) nodes to Gadi. These nodes
   serve the "megamem" queue. Note that specialised queues should only be used
   by jobs that require the specialised resources of that queue.

   For more information about queues see: https://opus.nci.org.au/x/2ABiBQ

Jul 1 2021 08:00 - 18:00 Scheduled Downtime
   Gadi will be down for scheduled maintenance from 8:00am 1st July, 2021 to
   6:00pm 1st July, 2021. Jobs that run into this time period will not be
   launched until after the scheduled downtime. Access to Gadi will be restored
   as quickly as possible following the downtime.

   Update 18:45: Gadi is online and processing jobs.

===================================================================================

[aaa777@gadi-login-02 ~]$ =======

[aaa777@gadi-login-02 ~]$ 


In order to run graphical tools, you need to enable X Windowing system on local system before ssh. This can be done by running X-Server like XQuartz (Mac), MobaXterm (MS Windows), startx or similar (Linux). Then ssh to gadi: 

Code Block
languagebash
$ ssh -Y aaa777@gadi.nci.org.au

-Y option enables forwarding of trusted X protocol mesgs between X-Server on local system and X programs on gadi.

Once logged in, check environment variable DISPLAY is set and test with a simple graphical utility like xclock. Example:

Code Block
$ echo $DISPLAY
localhost:76.0

$ xclock                # A window with analogue clock will be displayed on local system

Message of the Day

When logging in, the first thing you see under the welcome text is the Mesage of the Day (motd). You can see an example in the above block. Please consider it a noticeboard and read it on every login. News and status updates relevant to Gadi users will be posted to it.

Usage Limit on Login Nodes

Users are encouraged to submit and monitor jobs, prepare scripts for job submissions, compile small applications, and transfer a small amount of data on the login nodes. To ensure fair usage of login node resources amongst all users, any processes that cumulatively use more than 30 minutes CPU time and/or instantaneously use more than 4 GiB memory on any login node will be killed immediately.

...

...

Login Environment

At login, your landing point is your home directory whose path is set in the environment variable HOME in the login shell. This also sets the default project in PROJECT and your default shell in SHELL according to the file $HOME/.config/gadi-login.conf. 

...

To add more default functions, aliases, modules that are available in your login shell and/or the PBS job shell, please edit the file $HOME/.bashrc in the corresponding `IF` clause(s) in `in_interactive_shell` for both interactive PBS jobs and login shells, `in_login_shell` for only login shell, and `in_pbs_job` for non-interactive PBS jobs.

...

bgColor#575757

Gadi Resources

Users can perform various tasks on Gadi, for example, to run computationally intensive jobs on compute nodes, build their own applications on login nodes/data-mover nodes and manage storage on different filesystems. A summary of the available resources of all kinds on Gadi is shown below.

Resource Name


Owner

Accessible from 

Size Limit

Allocation Valid Until

Resource Specific Comments

Compute Hoursprojectn.a.amount set by scheme managerend of quarter
storage

$HOME

user

PBS jobs / login nodes

10 GiB  with no possible extension 

user account deactivation

  • with backups in $HOME/.snapshot

/scratch/$PROJECT

project

PBS jobs† / login nodes

72 GiB 1 TiB by default, more on jobs' demand

project retirement/job demand changes
  • designed for jobs with large data IOdata expires in 90 days since creation [tbc when details available]
  • no backups
  • files not accessed for more than 100 days are automatically moved from project directories on /scratch into a quarantine space
  • any files remaining in quarantine at the end of the 14-day quarantine period will be automatically deleted
  • number-of-files limit applied

/g/data/$PROJECT

project

PBS jobs† / login nodes

amount set by scheme manager

project retirement 

  • designed for hosting persistent data
  • no backups
  • number-of-files limit applied
  • also accessible from other NCI services, like cloud

mdssmassdataprojectPBS copyq jobs† / login nodesamount set by scheme managerproject retirement tape-based archival data sto
  • two copies created in two different buildings
  • tape-based archival data storage

$PBS_JOBFSuserPBS jobs * disk space available on the job's hosting node(s)job termination
  • no backups
  • designed for jobs with frequent and small IO
software applications

NCI

PBS jobs / login nodes

n.a.

n.a.

  • built from source on Gadi when possible
  • more can be added on request ‡
license

software group owner

PBS jobs / login nodes

available seats on the licensing server

license expiry date

  • access controlled by software group membership ††
  • NCI owned licenses are for academic use only
  • projects, institutions and universities can bring in their own licenses 

...

†† module file and PBS `-lsoftware` directive are used when controlling access to license [example link] 

Compute Hours

To run jobs on Gadi, users need to have sufficient allocated compute hours available. Importantly, compute allocations are granted to projects instead of directly to users. Only members of a project can look up and use its compute allocation. To look up how much compute allocation is available in your project, run the command ‘nci_account’. For example, to check the grant/usage of the project a00, user aaa777 would run

...

If there are not enough SUs available for a job to run according to its requested resource amounts, the job will be waiting in the queue indefinitely. The project lead CI should contact their allocation scheme manager to apply for more. If not sure which scheme manager to contact, see details in the verbose output of the command `nci_account`. It provides more granular information on per user and per stakeholder basis if receiving the `-v` flag in the command line. 

The Home Folder $HOME

Each user has a project-independent home directory. The storage limit of the home folder is fixed at 10 GiB. We recommend to use it as the folder where you host any scripts you want to keep to yourself. Users are encouraged to share their data elsewhere, see our Data Sharing Suggestions. All data on /home is backed up. In the case of ‘accidental’ deletion of any data inside the home folder, you can retrieve the data by following the example here.

Project Folder on Lustre Filesystems /scratch and /g/data

Users get access to storage space on the /scratch filesystem, and on the /g/data filesystems if the project folder exists, through project memberships. For example, the user jjj777 is a member of the projects a00 and b11, therefore, jjj777 has permission to read and create files/folders in the folders /scratch/a00/jjj777 and /scratch/b11/jjj777.

...

The first column in the output shows the permissions set for the folder/file. For more information on unix file permissions, see this page.

To look up how much storage you have access to through which projects, run the command ‘lquota’ on the login node. It prints out the storage allocation info together with its live usage data. For example, the return message 

...

For example, to take a look at the snapshot of the usage by the project xy11 on the filesystem /scratch, run

Code Block
languagebash
$ nci-files-report -f scratch -g xy11

...

Code Block
------------------------------------------------------------------------------
         project             user     space used      file size          count
------------------------------------------------------------------------------
            xy11           jjj777         14.3GB         13.2GB         112086
             ...
------------------------------------------------------------------------------

Job Folder $PBS_JOBFS

All Gadi jobs by default have 100MB of storage space allocated on the hosting compute node(s). The path to the storage space is set in the environment variable PBS_JOBFS in the job shell. We encourage users to use the folder $PBS_JOBFS in their jobs that generate large numbers of small IO operations. It not only boosts the job's performance by saving the time spent in those frequent and small IO operations but also saves a lot of inode usage for your project on the shared filesystems like /scratch and /g/data. Note that the folder $PBS_JOBFS is physically deleted upon job completion/failure, therefore, it is crucial for users to copy the data in $PBS_JOBFS back to the local directory on the shared filesystem while the job is still running. 

...

If the job runs on multiple compute nodes, this request of 100 GiB of space will be equally distributed among all nodes. If the job requests more than the available disk space on the compute node(s), the submission would fail. Please browse the Gadi Queue Structure and Gadi Queue Limit pages to look up how much local disk is available on each type of compute node.

Tape Filesystem massdata 

NCI operates a tape filesystem called massdata to provide an archive service for projects that need to back up their data. Massdata has two large tape libraries in separate machine rooms in two separate buildings. Projects have their data in their own path massdata/<project> on massdata, but the path is not directly accessible from Gadi: data requests must be launched from Gadi login nodes or from within copyq jobs. Here is a video clip showing the tape robot at work.

...

If you need access to massdata or more space on it, the project Lead CI should contact their allocation scheme manager to apply for it.

Software Applications and Licenses

Gadi has many software applications installed in the directory /apps and uses `Environment Modules` to manage them. To run any of them, load the corresponding module first. If the application requires licenses, join the corresponding software group through my.nci.org.au like you would join other projects.

If the applications or packages you need are not centrally installed on Gadi, please contact help@nci.org.au to discuss whether it is suitable to install the missing ones centrally on Gadi. We do not install all requested software applications but we are happy to assist with the installation inside of your own project folder. 

Applications on /apps 

The command `module avail` prints out the complete list of software applications centrally available on Gadi. To look for a specific application, please run `module avail <app_name>`. For example, `module avail open` prints out all the available versions for applications whose name starts with `open`. 

...

To read more about `module` commands, please read the page Environment Modules.

Access to Software Licenses

Software group membership enables access to a particular licensed application. To join a software group, login to my.nci.org.au, navigate to ‘Projects and Groups‘, search for the software name.
Once you've identified the corresponding software group, read all the content under the ‘Join’ tab carefully before sending out the membership request. It may take a while for the software group lead CI to approve the membership request, send an email to help@nci.org.au or submit a ticket at help.nci.org.au if you need the approval immediately. Once the membership is approved, the access to the application will be enabled.

...

To reserve enough license seats for your PBS jobs, please add the PBS directive `-lsoftware=<license_name>` in your submission script. To look up the right <license_name> for the license owned by the software group you joined, have a look at the live license usage page. This is where we publish the details about how many of which licenses are used in, and reserved for, which job. Please use it when you need to know the status of licenses your jobs are requesting. 

...

bgColor#575757

File Transfer to/from Gadi

Gadi has six designated data-mover nodes with the domain name `gadi-dm.nci.org.au’` Please use it when transferring files to and from Gadi.

For example, aaa777 runs the following command line in the local terminal 

code
Code Block
language
bash
$ scp input.dat aaa777@gadi-dm.nci.org.au:/home/777/aaa777

...

If the transfer is going to take a long time, there is a possibility that it could be interrupted by network instability. For that reason, it is better to start the transfer in a resumable way. For example, the following command line allows user aaa777 to download data in the folder /scratch/a00/aaa777/test_dir on Gadi onto the current directory on their local machine using ‘rsync’.

Code Block
languagebash
$ rsync -avPS aaa777@gadi-dm.nci.org.au:/scratch/a00/aaa777/test_dir ./

If the download is interrupted, run the same command again to resume the download from where it left off.

...

bgColor#575757

To transfer files/folders using MobaXterm on Windows system, drag and drop files/folders to/from local computer and Gadi after log in to Gadi via MobaXterm.

Gadi Jobs

To run compute tasks such as simulations, weather models, and sequence assemblies on Gadi, users need to submit them as ‘jobs’ to ‘queues’. Job submission enables users to specify the queue, duration and resources needs of their jobs. Gadi uses PBSPro to schedule all submitted jobs and keeps nodes that have different hardware in different queues. See details about the hardware available in the different queues on the Gadi Queue Structure page. Users submit jobs to a specific queue to run jobs on the corresponding type of node. 

...

We recommend users check these two log files before proceeding with the post-processing of any output/result from the corresponding job.

Job Submission

To submit a job defined in a submission script, called for example ‘job.sh’, run

Code Block
languagebash
$ qsub job.sh

on the login node. Once the submission is accepted, the above command returns the jobID, for example, 12345678.gadi-pbs, which can be used for tracking its status.

...

The submission script consists of three sections. The first line specifies which shell to use. The second section includes all the PBS directives that define the resources the job needs and the last section contains all the command lines that you would use in an interactive shell to run the compute task.

Submission Script Example 

Here is an example job submission script to run the python script ‘main.py’ which is assumed to be located inside the same folder where you run ‘qsub job.sh’.

...

The second section with all the lines that start with ‘#PBS’ specifies how much of each resource the job will need. It requests an allocations of 48 CPU cores, 190 GiB memory, and 200 GiB local disk on a compute node from the normal queue for its exclusive access for 2 hours. It also requests the system to mount both the a00 project folders on the filesystems /scratch and /g/data inside the job, and to enter the working directory once the job is started. Please see more PBS directives explained in here.  Note that a ‘-lstorage’ directive must be included if you need access to /g/data , otherwise these folders will not be accessible when the job is runningis needed.

To find the right queue for your jobs, please browse the Gadi Queue Structure and Gadi Queue Limit pages. 

Info

Users are encouraged to request resources to allow the task(s) to run around the ‘sweet spot’ where the job benefits from parallelism and achieves shorter execution time while utilising at least 80% of the requested compute capacity.

While searching for the sweet spot, please be aware that it is common to see components in a task that run only on a single core and cannot be parallelised. These sequential parts drastically limit the parallel performance. For example, having 1% sequential parts in a certain workload limits the overall CPU utilisation rate of the job when running in parallel on 26 48 cores to less than 80%70%. Moreover, parallelism adds overhead which in general scales up with the increasing core count and, when beyond the ‘sweet spot’, results in a waste of time on unnecessary task coordinations.

...

Info

Users are encouraged to run their tasks if possible in bigger jobs to take advantage of the massive potential massive parallelism that can be achieved on Gadi. However, depending on the application, it may not be possible for the job to run on more than a single core/node. For applications that do run on multiple cores/nodes, the commands and scripts/binaries called in the third section determine whether the particular job can utilise the requested amount of resources or not. Users need to edit the script/input files which define the compute task to allow it, for example, to run on multiple cores/nodes. It may take several iterations to find the ideal details for sections two and three of the submission script when exploring around the job's sweet spot.

Interactive Jobs 

We recommend users try their workflow on Gadi in an interactive job before submitting the tasks using the prepared submission script. This is because, when the interactive job starts, the user runs commands in an interactive shell on the head compute node, which allows one to quickly test out possible solutions. For example, this can be used to debug large jobs that run many parallel processes on many nodes or install applications that have to be built when GPUs are available.

To submit an interactive job, run ‘qsub -I’ on the login node. For example, to start an interactive job on Gadi’s gpuvolta queue through project a00 with the request of 48 CPU cores, 4 GPUs, 380 GiB memory, and 200 GiB local disk for 5 minutes on one gpu compute node, run

Code Block
languagebash
$ qsub -I -qgpuvolta  -Pa00 -lwalltime=00:05:00,ncpus=48,ngpus=4,mem=380GB,jobfs=200GB,storage=gdata/a00,wd

Once the job starts, it mounts the folder /g/data/a00 to the job and enters the job's working directory in which the job was submitted.

Example

Here is a minimum example of an interactive job. You can tell the interactive job has been started by the prompt changing from something like [aaa777@gadi-login-03 ~] to [aaa777@gadi-gpu-v100-0079 ~]. The job shell doesn't have any modules loaded by default. If you need to repeatedly load modules inside interactive jobs, please edit the file ~/.bashrc to load them automatically. Once you're done with the interactive job, run ‘exit’ to terminate the job.

Code Block
[aaa777@gadi-login-03 ~]$ qsub -I -lwalltime=00:05:00,ncpus=48,ngpus=4,mem=380GB,jobfs=200GB,wd -qgpuvolta
qsub: waiting for job 11029947.gadi-pbs to start
qsub: job 11029947.gadi-pbs ready

[aaa777@gadi-gpu-v100-0079 ~]$ module list
No Modulefiles Currently Loaded.
[aaa777@gadi-gpu-v100-0079 ~]$ exit
logout

qsub: job 11029947.gadi-pbs completed

Copyq Jobs

For transfer of bulk data (say more than 500 GiB), it is recommended to do it in a job submitted to the queue ‘copyq’ because long data transfer processes running on the login node will be terminated when reaching the 30-minute cumulative CPU time usage limit. Long software installations that require an internet connection are also recommended to be run inside copyq jobs. 

To submit a copyq job, called for example job.sh, define the tasks in the submission script, specify the queue to be copyq, and run

Code Block
languagebash
$ qsub job.sh

on a login node.

Example

An example copyq job submission script job.sh could be

...

To compile code inside a copyq job, it may be necessary to load modules such as intel-compiler and request more jobfs to allow enough disk space to host data written to $TMPDIR.  

...

Job Monitoring

Once a job submission is accepted, its jobID is shown in the return message and can be used to monitor the job’s status. Users are encouraged to keep monitoring their own jobs at every stage of their lifespan on Gadi. Note however, that excessive polling of the PBS servers for monitoring purposes will be considered attacks. A frequency of one monitoring query every 10 minutes is more than enough. 

Note

The PBS server which manages job submissions and scheduling answers all submissions, queries, and requests related to jobs. To ensure its quick response to essential requests, don't launch frequent job monitoring queries.

Queue Status

To look up the status of a job in the queue, run the command ‘qstat’. For example, to lookup the job 12345678 in the queue, run

Code Block
languagebash
$ qstat -swx 12345678

If the job is running, you would see something like

...

It shows that the job was submitted by the user aaa777 to the normal queue, requested 48 cores and 190 GiB memory for 2 hours and has been running (S = "R") for 35 minutes and 21 seconds. The  line at the bottom says the job started at 10:38 am on 4 Sept on the compute node gadi-cpu-clx-2697 where it has exclusive access to 48 cores, 190 GiB memory and 200 GiB jobfs local disk.

CPU and Memory Utilisation

We encourage users to keep monitoring their own jobs' utilisation rate at every stage because, if they run into errors and fail to exit, this shows up clearly in the drop in utilisation rate.

...

To see how much CPU and memory the job actually has been using, run the command ‘nqstat_anu’. For example, to look up the info of the job 12345678, run

Code Block
languagebash
$ nqstat_anu 12345678

The output shows the CPU utilisation rate in the column ‘%CPU’ and the peak memory usage in the column ‘RSS’ and ‘mem’.  The example output below says the job used only 23% of the compute capacity of the requested 48 cores in the elapsed 36 minutes and 47 seconds. Depending on the tasks running inside the job, this percentage may increase while the job proceeds further.  We normally recommend users aim for at least 80% overall CPU utilisation rate.

Code Block
                                %CPU  WallTime  Time Lim     RSS    mem  memlim  cpus
 12345678 R aaa777  a00 job.sh  23    00:36:47   2:00:00  5093MB 5093MB 190.0GB    48

Process Status and Files in Folder $PBS_JOBFS 

To monitor the status of processes inside a job, run the command ‘qps’. For example, to take a snapshot of the process status of the job 12345678, run

...