Using Gadi with Consideration

When using a system like Gadi, you need to be aware that there are potentially hundreds of other users accessing the system at the same time as you. For Gadi to remain efficient and usable, everyone needs to be courteous and use the system with consideration for others.

This basic guide of do's and don't should ensure that every user has access to the systems and speeds that they require to complete their research.

Do

Large data transfers, roughly 500 GiB and above, should be put through the copyq queues, to prevent overworking the login nodes and impacting other users.

Scratch is intended to be used as the main computing space for your compute needs. However, once your jobs have run, please move your data to a different directory. Files left in scratch for 100 days will be quarantined and potentially deleted permanently.

Run tests on your submission scripts. You should aim to run you tasks around where the job benefits from parallelism and achieves a shorter execution time, while aiming to utilise a minimum of 80% of the requested compute capacity.

While searching for the sweet spot, please be aware that it is common to see components in a task that run only on a single core and cannot be parallelised. These sequential parts drastically limit the parallel performance.

For example, having 1% sequential parts in a certain workload limits the overall CPU utilisation rate of the job when running in parallel on 48 cores to less than 70%. Moreover, parallelism adds overhead which in general scales up with the increasing core count and, when beyond the ‘sweet spot’, results in a waste of time on unnecessary task coordination.

A way to test this would be to limit the wall time of your job ot a very low value while in the testing phase. This allows you to test without changing parameters that would affect the jobs final run results.

If your job needs access to the internet at any stage of its life, it will need to be submitted to the copyq queues, as these are the only queues with external internet access.

Don't

The login nodes aren't intended to be used to run jobs. Doing so will impact the efficiency of these nodes and take resources away from other users.

Large data transfers should be put through the copyq queues, to prevent overworking the login nodes and impacting other users.

Scratch is intended to be used as the main computing space for your compute needs. However, once your jobs have run, please move your data to a different directory. Files left in scratch for 100 days will be quarantined and potentially deleted permanently.

Don't request resources that you won't need, it will only result in your job and other users jobs being held up. The PBS scheduler will find time for 2 cpus faster than 4 cpus, so really think about how many resources you are requesting.

Repeatedly checking the status of your job will be detected as a malicious attack. Checking now and then is fine but please limit the amount of times you query the job, especially in quick succession. Please wait at least 10 minutes before queerying your job again.

Authors: Andrew Johnston

Page tree

Do's and Don'ts of using Gadi...

Don't

Authors: Andrew Johnston