Last updated (2:00pm AEDT)
This page provides information to help users prepare for the transition from Raijin to Gadi in 2019 Q4.
NCI will regularly update this page and provide more detailed information as it becomes available. Note that the format of this page (and child pages) may change.
If you have questions or special concerns about how your work may be impacted by the transition from Raijin to Gadi please let us know as soon as possible - contact NCI user support at email@example.com - and we will endeavour to help you as soon as possible.
There are a number of things users can do now to prepare for Gadi. All users are strongly encouraged to take action as soon as possible.
The ANU has closed its Acton, Mount Stromlo and Kioloa campuses from midnight Thursday (TODAY) through 9:00am AEDT Tuesday due to hazardous smoke conditions in the ACT and the region. As a result of this closure Gadi is expected to be available to users starting Tuesday . NCI will update this notice on Monday to reflect advice from the ANU, and any local developments with respect to smoke conditions and regional bushfires.
Updated (1:45pm AEDT)
Raijin Sandy Bridge nodes will be decommissioned earlier than originally planned to accommodate electrical power support for Gadi. Please note that this revised schedule is still subject to change.
|NCI data centre preparation and Gadi installation phase one - COMPLETED.|
|Gadi stability and acceptance testing underway. Users preparing for Gadi.|
|Raijin user home directories will be copied to Gadi home directories ($HOME/raijin_home).|
|Transition Phase One|
Gadi and Raijin available to users. Gadi pre-production configuration is expected to include one rack of V100 GPU nodes.
Gadi allocations will match Raijin Q4 pro-rata allocations.
Jobs can be run (independently) on both systems.
Raijin /short available read-only on Gadi login and data mover nodes for user file transfers.
|50% of Raijin Sandy Bridge nodes decommissioned to allow power work for Gadi - DONE|
|Raijin Broadwell nodes offline for power reconfiguration work - DONE|
|Raijin operational with Broadwell and Skylake nodes only - DONE|
All Raijin Sandy Bridge nodes decommissioned; "normal" and "express" queues no longer available - DONE
|Raijin run-time compatibility environment available on Gadi - DONE|
Scheduled Downtime - Gadi
|Raijin operational with Broadwell and Skylake compute nodes.|
Production Phase Two
|Raijin /short filesystem decommissioned. Raijin end of service.|
|Jan 2020 - TBD||Broadwell and Skylake nodes migrated to Gadi.|
|2020 Q2 Scheduled Maintenance Downtime - Gadi|
This scheduled quarterly maintenance downtime will be extended to accommodate configuration tuning for Gadi. Details will be provided at a later date.
Please note that this timeline will be updated as often as necessary to reflect progress in data centre preparations, installation activities, and dependencies. NCI must decommission Raijin before Gadi can be configured in its full production capacity.
The message of the day on Raijin and Gadi login nodes will always contain the most up to date information about availability of those systems.
|Intel MPI||2018.3.222, 2019.3.199|
|OpenMPI||2.1.6, 3.0.4, 3.1.4, 4.0.1|
NCI plans to provide version OpenMPI 4.0.2 at the time of Gadi pre-production, subject to testing and validation.
|Intel Xeon E5-2670 (Sandy Bridge)||Intel Xeon Platinum 8274 (Cascade Lake)|
|Two physical processors per node|
Two physical processors per node
|2.6 GHz clock speed||3.2 GHz clock speed|
|16 cores per node||48 cores per node|
332 GFLOPs per node
|4915 GFLOPs per node|
The computing charge rate on Gadi is 2.0 service units (SU) per cpu-hour. This rate broadly reflects Gadi's performance relative to Raijin.
All NCI allocations for 2020, including NCMAS, will be on Gadi only.
Compute allocations on Gadi are managed by stakeholder scheme managers, as on Raijin. Check this page - https://nci.org.au/scheme-managers - to identify your scheme manager.
Compute allocations on Gadi will apply to projects, as on Raijin.
In 2019 Q4, all active projects will be given Gadi compute quotas which match (pro-rata) their 2019 Q4 Raijin allocations.
During the Gadi pre-production period, compute (job) accounting on Raijin and Gadi will be independent.
To login from your local desktop or other NCI computer run ssh:
where abc123 is your own username. Your ssh connection will be to one of ten possible login nodes. As usual, for security reasons we ask that you do not set up passwordless ssh to Gadi. Entering your password every time you login is more secure, or use specialised ssh secure agents.
Gadi /home is a new, independent file system.
The quota on Gadi home directories will be 10 GB, as compared to a 2 GB quota on Raijin. Home directories are intended for irreproducible files, e.g. source code and configuration files. Users are expected to utilise /scratch, /g/data and JOBFS file systems for working data.
On NCI will copy the contents of each user's Raijin home directory to the user's home directory on Gadi. The copy destination will be a subdirectory on Gadi, $HOME/raijin_home. This will be a one-time copy. Users will be responsible for migrating any further home directory files from Raijin to Gadi after .
Users are strongly encouraged to retain only essential files from their Raijin home directories on Gadi.
The temporary file system for Gadi users is /scratch. Note that the path '/short', as used on Raijin, will not exist on Gadi.
Raijin /short will be available on Gadi via a temporary, read-only path on login and data mover nodes only until . Users are strongly encouraged to copy only essential files to Gadi /scratch.
The contents of Raijin /short will not be migrated to Gadi /scratch. It is the responsibility of each user or project to transfer any files he/she needs from Raijin /short to Gadi.
Data transfer rates from Raijin /short to Gadi /scratch are expected to be approximately 1 TB per hour. Please plan your transfers accordingly, and do not wait until the last minute.
Gadi /scratch will be subject to an automated file purging policy: files will be removed 90 days after the time of last modification (mtime). In the interest of fairness and transparency, exceptions to this policy are not permitted.
Access time (atime) will not be considered in the /scratch purging policy. Persistent files should be stored in home directories or in project directories on the /g/data file systems.
Safety quotas, to prevent accidental overpopulation of the file system, will be applied to projects on /scratch. NCI is currently developing the safety quota implementation.
NCI is developing tools and notifications to help users track the status of their files in /scratch.
Any attempts to circumvent the 90-day scratch purge policy by using the touch command or other strategies will result in account deactivation.
The Gadi /scratch purging policy is expected to be activated in 2020 Q2, . Users will have approximately three (3) months to clean up and organise files in /scratch directories before activation of the purging regime.
All compute projects will be provided with a default /g/data directory for storage of persistent data. The default quota for /g/data project directories remains to be finalised. Note that allocations for projects which already have /g/data access will not change in 2020 unless such changes are defined in a contract or agreement.
Plan to modify your workflow(s) to place temporary files on /scratch, and persistent files on /g/data.
The /g/data file systems will continue to be available on Gadi and Raijin during the Gadi transition phase. Infrastructure work may temporarily impact file system performance during pre-production. Please also note that during transition, while Raijin and Gadi systems are both connected to the /g/data file systems, the file system performance may be impacted, as bandwidth is shared across both systems.
Jobs on Gadi must explicitly declare, via PBS directives, which file systems are to be accessed during the job. As an example, a job which will read or write data in the /scratch/<project> and /g/data/<project> directories must include the directive "-lstorage=scratch/<project>+gdata/<project>". A job that attempts to access a /g/data or scratch directory without this directive will fail during run time. Refer to the Data Collections section (below) to ensure that your access to data collections projects will not be affected by this change.
A user shell on a Gadi login node will not have access to /g/data file system directories of projects for which the user is not a member.
Projects should exercise caution in running workflows on Gadi and Raijin simultaneously during the pre-production period. Jobs which are in flight at the same time on Raijin and Gadi, and which access files on the /g/data file systems, for example, could fail due to file contention.
Project data on the /g/data2 file system was recently migrated to a new file system, /g/data4. A symbolic link /g/data2→/g/data4 has been provided for backward compatibility on Raijin. This /g/data2 symbolic link will not be provided on Gadi. All Gadi users are expected to update scripts and workflows to include the new /g/data4 path where needed.
Gadi Cascade Lake nodes have 48 CPUs and 192 GB memory.
Users will need to adjust PBS job scripts and workflows from Raijin to suit Gadi: 48 CPUs/node (Cascade Lake, significantly faster than Raijin), 192 GB RAM/node, 400 GB PBS_JOBFS/node, and so on.
Gadi will run PBS Pro version 19.
Job scheduling will be determined at the project level, as on Raijin. It is not possible to schedule jobs on a per-user basis on Gadi.
Gadi will have Normal and Express queues as on Raijin. Gadi's Broadwell and SkyLake queues will conform to Raijin specifications. Queue details are available on the page Gadi Job Queues.
Gadi resource limits (including defaults) will be provided in a child page, linked here.
The PBS_JOBFS size on Gadi normal/express/copyq nodes will be limited to 400 GB per node. Jobs that require more than 400 GB/node are expected to use /scratch disk.
Jobs on Gadi must explicitly declare, via PBS directives, which file systems are to be accessed during the job. As an example, a job which will read or write data in the /scratch/<project> and /g/data/<project> directories must include the directive "-lstorage=scratch/<project>+gdata/<project>". A job that attempts to access a /g/data or scratch directory without this directive will fail during run time.
Jobs which use less than a full node (Cascade Lake = 48 cpus) will be charged according to the fractional utilisation of node resources, that is, by number of CPUs or amount of node memory requested, whichever is larger. Note that charging on Raijin was based on cpu-hours only, without consideration of the memory requested or used.
Projects which use memory-intensive, low-compute workflows may consume SUs more rapidly than expected on Gadi.
Project job resource exemption (for example, wall time extensions) established on Raijin will not be carried across to Gadi. Most user jobs on Gadi will require less wall time than on Raijin. Job resource exceptions on Gadi will need to be compellingly justified.
Broadwell and SkyLake nodes are expected to be offline for three working days in November when they are migrated to Gadi. Users who rely on Broadwell or SkyLake nodes should prepare for approximately three (3) days of downtime in late November. Unfortunately a testing/pre-production period will not be available to Broadwell and SkyLake workflows.
Raijin will continue to operate with Broadwell and Skylake compute nodes until .
Gadi Cascade Lake node = 48 CPUs, 192 GB memory
1 cpu-hour = 2 service units (SU)
|Normal||4||16 GB||5 hours||4 x 5 x 2 = 40 SU||Satisfies 1 CPU <= 4 GB memory.|
|Normal||8||16 GB||5 hours||8 x 5 x 2 = 80 SU||CPU request dominates.|
|Normal||8||128 GB||5 hours||32 x 5 x 2 = 320 SU||Memory request dominates.|
32 cpus is proportion of node resources.
|Normal||8||192 GB||5 hours||48 x 5 x 2 = 480 SU||Memory request dominates.|
192GB = 100% of node memory.
|Express||8||16 GB||5 hours||8 x 5 x 2 x 3 = 240 SU||CPU request dominates (as above).|
Express multiplier is x3.
NCI strongly recommends that all users recompile their applications to obtain optimum performance and compatibility with the Gadi run-time environment.
Binary executables from Raijin are expected to be compatible with Gadi if required dependencies and run-time libraries are available. Applications which rely on old dependencies are particularly at risk. Users are strongly encouraged to recompile on Gadi if possible.
Work is currently in progress porting third-party application software to Gadi. More information is available on the following page: Gadi Software Catalogue.
NCI can assist with local builds of third-party software for individual research groups on Gadi, as on Raijin. Please note that during the transition to Gadi staff time may be limited and software assistance may be deferred until Gadi is fully operational.
The environment modules command will be available on Gadi, and will work in the same manner as on Raijin.
All Python users are encouraged to move to Python 3 as soon as possible. Python 2.7.16 will be provided on Gadi, however this will be the final version of Python 2 installed on the system. Development of Python 2 will officially ceases on .
Containers will be available on Gadi, however, NCI staff will need to build the container image to ensure it satisfies security and compatibility criteria. Singularity is the preferred container type at this time. Users who require containers on Gadi should contact NCI user support at firstname.lastname@example.org.
Work is in progress on a container environment to support Raijin backward compatibility on Gadi. This is intended to be a stop-gap solution for projects which require more time to adapt to Gadi. This "Raijin in a container" is expected to be available to users in Q4 and 2020 Q1 for a limited time only - details to be confirmed. Users are again strongly encouraged to rebuild all applications on Gadi for long-term stability and performance.
Gadi provides a containerised environment which duplicates the run-time environment available on Raijin. This capability is provided to maintain operational continuity for projects which may require more time to port workflows and tools to the native environment on Gadi.
To use the Raijin compatibility image:
Add the following flag to your PBSPro job script: -limage=raijin.
The Raijin compatibility image has several limitations:
All projects are encouraged to migrate their applications and workflows to Gadi as soon as possible.
NCI's VDI service will continue to be available to users as Gadi enters service in 2019 Q4 and 2020 Q1. Overall VDI functionality is expected to remain unchanged.
The current VDI application software stack will continue to be available. Email email@example.com if you have any questions or requests for new software packages on the VDI.
As is the case now on Raijin, user home directories on VDI will continue to be separate from home directories on Gadi.
VDI-to-Gadi job submission functionality is now available. Please note that this will be implemented as the default option overnight Monday 9 December. For more information about how to use this feature during the transition period see the VDI User Guide https://opus.nci.org.au/display/Help/VDI+User+Guide#VDIUserGuide-4.2.PBS.
Gadi users who require access to NCI data collections should ensure they are members of the required data collection projects.
PBS jobs which read from files in data collections will need to use the requisite job directives to flag collection access, for example, "-lstorage=gdata/<project>". A data collection path on the file system will not be available to a job unless the appropriate PBS directive is provided.
More comprehensive updates on VDI will be provided to users in January-February 2020. If you have specific questions or concerns about VDI please contact NCI User Support - firstname.lastname@example.org.
Transition to Gadi - as presented at the ALCS 2019 Training Day.
If you have further questions or concerns about the transition from Raijin to Gadi please contact NCI user support at email@example.com.