Page tree
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 27 Next »

Last updated  (3:00pm AEST)


This page provides an overview of what users can expect as they make the transition from Raijin to Gadi in 2019 Q4. There is a lot of information here, so please take the time to read this page carefully. NCI will regularly update this page and provide more detailed information here as it becomes available.

If you have questions or special concerns about how your work may be impacted by the transition from Raijin to Gadi please let us know as soon as possible - contact NCI user support at help@nci.org.au - and we will endeavour to help you as soon as possible.


Gadi Timeline

Updated  12:00pm AEST

Date(s)

Events
NOWNCI data centre preparation and Gadi installation in progress. Users preparing for Gadi.

 

Gadi and Raijin available to users.
Gadi allocations will match Raijin Q4 allocations.
Jobs can be run (independently) on both systems.

 

Raijin job submission ends.  

 

Raijin nodes go offline.
Sandy Bridge nodes decommissioned.
Broadwell, Skylake and K80 GPU nodes go offline for migration to Gadi. The expected downtime for these nodes is three days.
Raijin /short file system available on Gadi login nodes (alternative path) for user transfers.

 

Broadwell, Skylake and K80 GPU nodes available on Gadi. 

 

Gadi resource allocations for 2020 Q1 installed.

 

Gadi operational at full production specification.

 

Raijin /short file system decommissioned.

Please note that this timeline will be updated as often as necessary to reflect progress in data centre preparations, installation activities, and dependencies.

It is important to note that the Gadi pre-production period is expected to be brief - two weeks - because of complex dependencies in delivery and service schedules. Gadi cannot be brought to full production before Raijin is decommissioned because of power, cooling and data centre requirements.

User Environment

  1. The user's default shell and project will be controlled by the file gadi-login.conf in each user's $HOME/.config directory.
  2. Gadi /home quotas are applied on a per-user basis, as on Raijin.
  3. Gadi /home quotas will be 10 GB.
  4. Gadi login and compute nodes will run the CentOS 8 operating system.
  5. NCI is currently considering the best mechanism to deliver files from Raijin /home directories to Gadi. The solution is likely to be either (a) user can copy files from a read-only archive copy of Raijin /home, or (b) a bulk copy of user Raijin /home files to Gadi /home directories. More information will be available on this as soon as possible.

Processor Comparison: Raijin vs Gadi

RaijinGadi
Intel Xeon E5-2670 (Sandy Bridge)Intel Xeon Platinum 8274 (Cascade Lake)
Two physical processors per node

Two physical processors per node

2.6 GHz clock speed3.2 GHz clock speed
16 cores per node48 cores per node

332 GFLOPs per node
(theoretical peak)

4915 GFLOPs per node
(theoretical peak)

Resources

  1. The computing charge rate on Gadi is 2.0 service units (SU) per cpu-hour. This rate broadly reflects Gadi's performance relative to Raijin.
  2. All NCI allocations for 2020, including NCMAS, will be on Gadi only.
  3. Compute allocations on Gadi are managed by stakeholder scheme managers, as on Raijin. Check this page (link) to identify your scheme manager. 
  4. Compute allocations on Gadi will apply to projects, as on Raijin.
  5. in 2019 Q4, all active projects will be given Gadi compute quotas which match (pro-rata) their 2019 Q3 or Q4 Raijin allocations. 
  6. During the Gadi pre-production period, compute (job) accounting on Raijin and Gadi will be independent.

File Systems - /home

  1. Gadi /home is a new, independent file system.
  2. The quota on Gadi home directories will be 10 GB, as compared to a 2 GB quota on Raijin. Home directories are intended for irreproducible files, e.g. source code and configuration files, and users are expected to utilise /scratch, /g/data and JOBFS file systems for working data.
  3. Users are strongly encouraged to copy and keep only essential files from Raijin to their home directories on Gadi.

File Systems - /scratch

  1. The temporary file system for Gadi users is /scratch. Note that the path '/short', as used on Raijin, will not exist on Gadi. 
  2. The contents of Raijin /short will not be migrated to Gadi /scratch. 
  3. Raijin /short will be available on Gadi via a temporary, read-only path on login and data mover nodes only until . Users are strongly encouraged to copy only essential files to Gadi /scratch.
  4. Data transfer rates from Raijin /short to Gadi /scratch are expected to be approximately 1 TB per hour. Please plan your transfers accordingly, and do not wait until the last minute.
  5. Gadi /scratch will be subject to an automated file purging policy: files will be removed 90 days after the time of last modification. In the interest of fairness and transparency exceptions to this policy are not permitted.
  6. NCI is developing tools and notifications to help users track the status of their files in /scratch. 
  7. Any attempts to circumvent the 90-day scratch purge policy by using the touch command or other strategies will result in account deactivation.
  8. The Gadi /scratch purging policy is expected to be activated in 2020 Q2, at/near . Users will have ~3 months to clean up and organise files in /scratch directories before activation of the purging regime.
  9. All compute projects will be provided with a default /g/data directory for storage of persistent data. The default quota for /g/data project directories remains to be finalised. Note that allocations for projects which already have /g/data access will not change in 2020 unless such changes are defined in a contract or agreement.
  10. Plan to modify your workflow(s) to place temporary files on /scratch, and persistent files on /g/data.

File Systems - /g/data

  1. The /g/data file systems will be available on Gadi and Raijin during the Gadi pre-production phase. Infrastructure and deployment work may temporarily impact file system performance during pre-production.
  2. Current data services and data collections access are not expected to  are expected as Gadi is brought into production.
  3. Projects should exercise caution in running workflows on Gadi and Raijin simultaneously during the pre-production period. Jobs which are in flight at the same time on Raijin and Gadi, and which access files on the /g/data file systems, for example, could fail due to file contention. 

Jobs

  1. Gadi Cascade Lake nodes have 48 CPUs and 192 GB memory.
  2. Users will need to adjust PBS job scripts and workflows from Raijin to suit Gadi: 48 CPUs/node (Cascade Lake, significantly faster than Raijin), 192 GB RAM/node, 400 GB PBS_JOBFS/node, and so on.
  3. Gadi will run PBS Pro version 19.
  4. Job scheduling will be determined at the project level, as was the case on Raijin. It is not possible to schedule jobs on a per-user basis on Gadi.
  5. Gadi queues... (link to queue page)
  6. Gadi resource limits... (link to queue limits page). 
  7. The PBS_JOBFS size on Gadi will be limited to 400 GB per node. Jobs that require more than 400 GB/node are expected to use /scratch.
  8. Jobs on Gadi must explicitly declare, via PBS directives, which file systems are to be accessed during the job. As an example, a job which will read or write data in the /scratch/<project> and /g/data/<project> directories must include the directive  "-lstorage=scratch/<project>+gdata/<project>".  A job that attempts to access a /g/data or scratch directory without this directive will fail during run time.
  9. Jobs which use less than a full node (Cascade Lake = 48 cpus) will be charged according to the fractional utilisation of node resources, that is, by number of CPUs or amount of node memory requested, whichever is larger. Note that charging on Raijin was based on cpu-hours only, without consideration of memory used. 
  10. Projects which use memory-intensive, low-compute workflows may consume SUs more rapidly than expected on Gadi. 
  11. Project job resource exemption (for example, wall time extensions) established on Raijin will not be carried across to Gadi. Most user jobs on Gadi will require less wall time than on Raijin. Job resource exceptions on Gadi will need to be compellingly justified.
  12. Broadwell, SkyLake and K80 nodes are expected to be offline for three working days in November when they are migrated to Gadi. Users who rely on Broadwell, SkyLake and/or K80 nodes should prepare for ~3 days of downtime in November, without a testing/pre-production period.

Job Charging - Examples

Gadi Cascade Lake node = 48 CPUs, 192 GB memory

1 hour wall time = 2 service units (SU).

QueueCPUsMemory (GB)WalltimeChargeComments
Normal416 GB5 hours4 x 5 x 2 = 40 SU4 cpus = 8.3% of node compute.
16 GB = 8.3% of node memory.
Normal816 GB5 hours8 x 5 x 2 = 40 SUCPU request dominates.
8 cpus = 17% of node compute.
16 GB = 8% of node memory. 
Normal8192 GB5 hours48 x 5 x 2 = 480 SUMemory request dominates.
192GB = 100% of node memory.
Express816 GB5 hours8 x 5 x 2 x 3 = 240 SUCPU request dominates (as above).
Express multiplier is x3.

Software

  1. NCI strongly recommends that all users recompile their applications to obtain optimum performance and compatibility with the Gadi run-time environment. 
  2. Binary executables from Raijin will be compatible Gadi if required dependencies and run-time libraries are available. Users are strongly advised to recompile on Gadi if possible.
  3. Work is currently in progress porting third-party application software to Gadi is in progress. See Gadi Software Catalogue - DRAFT for more information.
  4. NCI can assist with local builds of third-party software for individual research groups on Gadi, as on Raijin. Please note that during the transition to Gadi staff time may be limited and software assistance may be deferred until Gadi is fully operational. 
  5. The environment modules command will be available on Gadi, and will work in the same manner as on Raijin.
  6. All Python users are encouraged to move to Python 3 as soon as possible. Python 2.7.16 will be provided on Gadi, however this will be the final version of Python 2 installed on the system. Development of Python 2 will officially ceases on .  
  7. Containers will be available on Gadi, however, NCI staff will need to build the container image to ensure it satisfies security and compatibility criteria. Singularity is the preferred container type at this type. Users who wish to use containers should contact NCI user support at help@nci.org.au.
  8. Work is in progress on a container environment to support Raijin backward compatibility on Gadi. This is intended to be a stop-gap solution for projects which require more time to adapt to Gadi. This "Raijin in a container" is expected to be available to users in Q4 and 2020 Q1 for a limited time only - details to be confirmed. Users are again strongly encouraged to rebuild all applications on Gadi for best stability and performance. 

Virtual Desktop Interface (VDI)

  1. NCI's VDI service will be available to users as Gadi enters service in 2019 Q4 and 2020 Q1.
  2. The VDI application software stack on Gadi will continue to be compatible with that on Raijin through 2020 Q1.
  3. Gadi home directories will be available on VDI.
  4. VDI support will eventually move to CentOS 8 (schedule TBD).


Data Collections

  1. Gadi users who require access to NCI data collections should ensure they are members of the required data collection projects. 
  2. PBS jobs which read from files in data collections will need to use the requisite job directives to flag collection access, for example, "-lstorage=gdata/<project>". A data collection path on the file system will not be available to a job unless the PBS directive is provided. 

Questions?

If you have further questions or concerns about the transition from Raijin to Gadi please contact NCI user support at help@nci.org.au.





  • No labels