Page tree

Gadi Resources

Within Gadi there are 10 login nodes, 6 data-mover nodes, over 4000 compute nodes, and NCI's massdata tape storage.

Below is a deeper look into those that expands on what they are, what they do, and what they connect to. 

The 10 login nodes are where you land when logging into Gadi. These act as remote hosts to interact with Gadi.

Which login node you are placed into is decided by a round robin process that will allocate you a new node with each log in. 

You can use login nodes to test small amounts of your code, to gather data on your jobs efficiency. However, no full jobs or anything with high compute demands should be run on a login node. These nodes are a shared resource and running jobs here will impact other users. 

Any job running for longer than 30 minutes, or that exceeds 4 GiB will be terminated. 

Data-mover nodes are used for exactly that, moving data. You can use these nodes to transfer files to and from Gadi at high-speed, following the steps outlined here.

Compute nodes are the workhorse of Gadi. You can think of them as thousands of high powered PC's all designed to work together.

You can find a great break down of their specifications and types here.

If you look at the chart to the right, you will notice that the compute nodes don't have access to external internet. If any tasks within a submitted job need access to the internet at any stage, they should be packed into a seperate job on a copyq node with internet access. We will go over how to do this in the job submission guide.

/apps is where all of Gadi's software applications are stored. If you would like to see what software is available to your project, you can use the command '$ module avail' to see a list of them. Please take a look at the software applications guide for more information about them. 

/home is your independent directory. It has a 10Gib storage limit that cannot be expanded. /home is a great place to host any scripts that you want to keep to yourself. 

All data on /home is backed up and accidentally deleted files can be retrieved via $HOME/.snapshot. Please see the data storage FAQ for more information.

/scratch is your playground, where all of your high-speed computing takes place. There are limits to /scratch, including a limit on the number of files and files that aren't accessed in a long time. Please see the table below for all resource capacities.

massdata is NCI's tape storage system. Not every project will have access to massdata, only those with a storage allocation. 
g/data or global data is a storage area intended for long term storage of research data. As you can see in the diagram, g/data is available off of the Gadi system, meaning that users from AARNet can access this system without needing to use Gadi. However, /scratch and the rest of the directories will not be available to them. 
Some projects are run outside of the NCI system, meaning that they don't need to be allocated compute time. As you can see on the right, these projects still have access to data services, VDI/cloud, NFS servers, and /g/data.

Click on the headings on the left hand side to expand them and learn about Gadi's resources.



Navigating through Gadi 

Navigating through the directories on login nodes is simple if you can remember the rules outlined below.

 If you keep these formats in mind as you are using Gadi, you will have no problems navigating to the directories that you need.

Your home directory will always be located at home/institution/username

Scratch will always be at scratch/project/username

g/data follows the same format as scratch with g/data/project/username

All software applications are found at apps/software/version


Resource capacity and limitations 


$HOME

Owner> User

Accessible from> PBS jobs and login nodes

Size limit> 10 GiB with zero extensions available

Allocation valid until> Users account is deactivated 

Resource specific attributes:


  • Backups located in $HOME/.snapshot
/scratch

Owner> Project

Accessible from> PBS Jobs† and login nodes

Size limit> 1 TiB by default with more available on request 

Allocation valid until>Project completion or job demand changes

Resource specific attributes:


  • Designed for jobs with large data IO
  • No backups
  • Files not accessed for 100 says will be moved from /scratch and placed into a quarantined location
  • any files quarantined for longed than 14 days will be automatically deleted
  • number of files limit applies to /scratch 

Need to be explicitly mounted using the PBS directive

`-lstorage`. Please see our PBS directives listing for more information. 

/g/data

Owner> Project

Accessible from> PBS Jobs† and login nodes

Size limit> Amount is set by the scheme manager

Allocation valid until> Project completion 

Resource specific attributes:


  • Designed to store persistent data
  • No backups
  • Number of files limit applies
  • g/data is also accessible from other NCI services e.g. Nirin cloud

† Need to be explicitly mounted using the PBS directive `-lstorage`. Please see the jobs submission page for more information. 

massdata

Owner> Project

Accessible from> PBS copyq Jobs and login nodes

Size limit> Amount set by scheme manager

Allocation valid until> Project completion

Resource specific attributes:


  • Backup files are stored in two different buildings
  • tape-based archival data storage

Read more about massdata here.

$PBS_JOBFS

Owner> User

Accessible from> PBS Jobs*

Size limit>  SSD Disk space available on the job's hosting node(s) Default 100MB

Allocation valid until> Job termination 

Resource specific attributes:


  • No backups
  • Designed for jobs with frequent and small IO

* Job owner can access the folder through commands like `qls` and `qcp` on the login node during the job.

Read more about $PBS_JOBFS here.

I/O Intensive

 Owner> User

Accessible from> PBS Jobs

Size limit>  All-flash NetApp EF600 storage, volumes available on request

Allocation valid until> Job termination 

Resource specific attributes:


  • No backups
  • Designed for jobs with frequent and small IO
  • Does not currently work in normalsr and expresssr queues

Please refer to out I/O Intensive page for more information about this system.  

Software applications

Owner> NCI

Accessible from> PBS jobs and login nodes

Size limit> N.A

Allocation valid until> N.A

Resource specific attributes:


  • Built from source on Gadi where possible
  • More applications can be added according to demand, dependencies and scalability. Applications can be requested to be added to the Gadi /apps repository.
Licences

Owner> Software group owner

Accessible from> PBS jobs and login nodes

Size limit> Available seats on the licencing server 

Allocation valid until> Licence expiry date

Resource specific attributes:


  • Access controlled by software group owner. Module file and PBS `-lsoftware` directive are used when controlling access to licence
  • NCI owned licences are for academic use only
  • Projects, institutions and universities can bring their own licences 
  • See our live licence status page for more information

There is also a quota called iQuota that is applied to /scratch and /g/data. This limits the maximum number of files and folders allowed within a project. you can see the amount of iQuota by running the command

$ lquota

Please try to keep the number of files as low as possible as this can affect the I/O performance in your job. Gadi is efficient at handling large scale parallel I/O but performance becomes significantly worse when doing frequent small small operations.

A main culprit for creating a large amount of files is the Python packaging system conda. Please use pip and the available modules that are already tuned for Gadi to keep file and folder count as low as possible. 

$PBS_JOBFS

Any job submitted to Gadi is allocated a default 100 MB of storage space on the hosting nodes SSD. NCI encourages users to utilise the folder $PBS_JOBFS in jobs that generate a large number amount of small I/O operations. This will boost your jobs performance by saving the amount of time that would spent running those small operations and frees up space for your project in /scratch and /g/data.

You can also request space on multiple compute nodes by adding the directive -l jobfs to your job script, for example,

#PBS -l jobfs=100GB

Would request 100 GiB on the nodes. If this job was to run on multiple nodes, this 100 GiB would be equally distributed among all of them. Jobs that request more disk space than is available on the nodes will fail. Please check the queue structure and queue limits pages for information on how much local disk is available. 

The limit on $PBS_JOBFS is 400 GiB.

Note that the folder $PBS_JOBFS is physically deleted upon job completion/failure, therefore, it is crucial for users to copy the data in $PBS_JOBFS back to the local directory on the shared filesystem while the job is still running. 
Tape Filesystem - massdata

NCI operates a tape filesystem called massdata to provide a reliable archive for projects to backup their data. The data is held on magnetic tape, which is held in separate machine rooms in two seperate buildings. The tapes are accessed and transported by a small robot that works tirelessly for NCI.

While projects do have their own path on massdata, i.e. massdata/<projectcode> there is no direct access to it via Gadi. Data requests from the tape library must be launched from within the login nodes or via a copyq job. You can read our job submission page to learn how to submit copyq jobs. 

NCI provides the `mdss` utility for users to manage the migration and retrieval of files between multiple levels of a storage hierarchy: from on-line disk cache to offline tape archival. It connects to massdata and launches the corresponding requests. For example, `mdss get` first launches the requests to stage the remote files from the massdata repository into the disk cache, once the data gets online it then transfers the data back to your local directory, for example, a project folder on /scratch or /g/data. 

To the right are some simple commands that can help while navigating massdata. These commands can be run from the login nodes and begin with the prefix 'mdss', for example

$ mdss get

 You can read the manual for mdss by running the command

$ man mdss

Which will provide you several more ways to interact with the storage library. 

putcopy files to the MDSS
get copy files from the MDSS
mkdir/rmdir

create/delete directories in your massdata directory. 

ls list directories


Authors: Yue Sun, Andrew Wellington, Andrey Bliznyuk, Ben Menadue, Mohsin Ali, Andrew Johnston