Page tree
Skip to end of metadata
Go to start of metadata

Last updated  (10:30am AEDT)


This page provides information to help users prepare for the transition from Raijin to Gadi in 2019 Q4. 

NCI will regularly update this page and provide more detailed information as it becomes available. Note that the format of this page (and child pages) may change.

If you have questions or special concerns about how your work may be impacted by the transition from Raijin to Gadi please let us know as soon as possible - contact NCI user support at help@nci.org.au - and we will endeavour to help you as soon as possible.

There are a number of things users can do now to prepare for Gadi. All users are strongly encouraged to take action as soon as possible.

Contents

.

Gadi Timeline

Updated  (10:30am AEDT)

The Gadi pre-production period has been extended to provide additional time for users to transition from Raijin to Gadi. The revised schedule is still subject to change. 

Date(s)

Events

 

NCI data centre preparation and Gadi installation phase one - COMPLETED.

 

Gadi stability and acceptance testing underway. Users preparing for Gadi.

 

Raijin user home directories will be copied to Gadi home directories ($HOME/raijin_home). 

 

Transition Phase One
Gadi and Raijin available to users. Gadi pre-production configuration is expected to include one rack of V100 GPU nodes.
Gadi allocations will match Raijin Q4 pro-rata allocations.
Jobs can be run (independently) on both systems.

 -  

Raijin /short available read-only on Gadi login and data mover nodes for user file transfers.
Progressive deployment of Gadi nodes to full specification, and phased retirement of Raijin Sandy Bridge nodes.

 

50% of Raijin Sandy Bridge nodes decommissioned to allow power work for Gadi.

 -  

Raijin job submission ends (reduced configuration). 
Raijin Sandy Bridge decommissioned. 
Scheduled Gadi downtime to allow acceptance testing on Phase Two. 
Broadwell and Skylake nodes migrated to Gadi. 
K80 GPUs decomissioned.

 

Production
Gadi operational at full production specification. Broadwell and Skylake nodes available on Gadi.

 

Raijin /short filesystem decommissioned.

Please note that this timeline will be updated as often as necessary to reflect progress in data centre preparations, installation activities, and dependencies. NCI must decommission Raijin before Gadi can be configured in its full production capacity.

User Environment

  1. The user's default shell and project will be controlled by the file gadi-login.conf in each user's $HOME/.config directory.
  2. Gadi /home quotas are applied on a per-user basis, as on Raijin.
  3. Gadi /home quotas will be 10 GB.
  4. Gadi login and compute nodes will run the CentOS 8 operating system.
  5. NCI is currently considering the best mechanism to deliver files from Raijin /home directories to Gadi. The solution is likely to be either (a) user can copy files from a read-only archive copy of Raijin /home, or (b) a bulk copy of user Raijin /home files to Gadi /home directories. More information will be available on this as soon as possible.

User Environment: Compilers and MPI

PackageVersions
Intel Compilers2019.3.199
Intel MPI2018.3.222, 2019.3.199
OpenMPI2.1.6, 3.0.4, 3.1.4, 4.0.1

NCI plans to provide version OpenMPI 4.0.2 at the time of Gadi pre-production, subject to testing and validation.

Processor Comparison: Raijin vs Gadi

RaijinGadi
Intel Xeon E5-2670 (Sandy Bridge)Intel Xeon Platinum 8274 (Cascade Lake)
Two physical processors per node

Two physical processors per node

2.6 GHz clock speed3.2 GHz clock speed
16 cores per node48 cores per node

332 GFLOPs per node
(theoretical peak)

4915 GFLOPs per node
(theoretical peak).

Resources

The computing charge rate on Gadi is 2.0 service units (SU) per cpu-hour. This rate broadly reflects Gadi's performance relative to Raijin.

All NCI allocations for 2020, including NCMAS, will be on Gadi only.

Compute allocations on Gadi are managed by stakeholder scheme managers, as on Raijin. Check this page - https://nci.org.au/scheme-managers - to identify your scheme manager. 

Compute allocations on Gadi will apply to projects, as on Raijin.

In 2019 Q4, all active projects will be given Gadi compute quotas which match (pro-rata) their 2019 Q4 Raijin allocations. 

During the Gadi pre-production period, compute (job) accounting on Raijin and Gadi will be independent.

Logging in

To login from your local desktop or other NCI computer run ssh:


where abc123 is your own username. Your ssh connection will be to one of ten possible login nodes. As usual, for security reasons we ask that you do not set up passwordless ssh to Gadi. Entering your password every time you login is more secure, or use specialised ssh secure agents.

File Systems - /home

Gadi /home is a new, independent file system.

The quota on Gadi home directories will be 10 GB, as compared to a 2 GB quota on Raijin. Home directories are intended for irreproducible files, e.g. source code and configuration files. Users are expected to utilise /scratch, /g/data and JOBFS file systems for working data.

On  NCI will copy the contents of each user's Raijin home directory to the user's home directory on Gadi. The copy destination will be a subdirectory on Gadi, $HOME/raijin_home. This will be a one-time copy. Users will be responsible for migrating any further home directory files from Raijin to Gadi after .

Users are strongly encouraged to retain only essential files from their Raijin home directories on Gadi. 

File Systems - /scratch

The temporary file system for Gadi users is /scratch. Note that the path '/short', as used on Raijin, will not exist on Gadi. 

Raijin /short will be available on Gadi via a temporary, read-only path on login and data mover nodes only until . Users are strongly encouraged to copy only essential files to Gadi /scratch.

The contents of Raijin /short will not be migrated to Gadi /scratch. It is the responsibility of each user or project to transfer any files he/she needs from Raijin /short to Gadi.

Data transfer rates from Raijin /short to Gadi /scratch are expected to be approximately 1 TB per hour. Please plan your transfers accordingly, and do not wait until the last minute.

Gadi /scratch will be subject to an automated file purging policy: files will be removed 90 days after the time of last modification (mtime). In the interest of fairness and transparency, exceptions to this policy are not permitted.

Access time (atime) will not be considered in the /scratch purging policy. Persistent files should be stored in home directories or in project directories on the /g/data file systems.

Safety quotas, to prevent accidental overpopulation of the file system, will be applied to projects on /scratch. NCI is currently developing the safety quota implementation. 

NCI is developing tools and notifications to help users track the status of their files in /scratch. 

Any attempts to circumvent the 90-day scratch purge policy by using the touch command or other strategies will result in account deactivation.

The Gadi /scratch purging policy is expected to be activated in 2020 Q2, . Users will have approximately three (3) months to clean up and organise files in /scratch directories before activation of the purging regime.

All compute projects will be provided with a default /g/data directory for storage of persistent data. The default quota for /g/data project directories remains to be finalised. Note that allocations for projects which already have /g/data access will not change in 2020 unless such changes are defined in a contract or agreement.

Plan to modify your workflow(s) to place temporary files on /scratch, and persistent files on /g/data.

File Systems - /g/data

The /g/data file systems will continue to be available on Gadi and Raijin during the Gadi transition phase. Infrastructure work may temporarily impact file system performance during pre-production. Please also note that during transition, while Raijin and Gadi systems are both connected to the /g/data file systems, the file system performance may be impacted, as bandwidth is shared across both systems.

Jobs on Gadi must explicitly declare, via PBS directives, which file systems are to be accessed during the job. As an example, a job which will read or write data in the /scratch/<project> and /g/data/<project> directories must include the directive  "-lstorage=scratch/<project>+gdata/<project>".  A job that attempts to access a /g/data or scratch directory without this directive will fail during run time. Refer to the Data Collections section (below) to ensure that your access to data collections projects will not be affected by this change.

A user shell on a Gadi login node will not have access to /g/data file system directories of projects for which the user is not a member.

Projects should exercise caution in running workflows on Gadi and Raijin simultaneously during the pre-production period. Jobs which are in flight at the same time on Raijin and Gadi, and which access files on the /g/data file systems, for example, could fail due to file contention.  

Project data on the /g/data2 file system was recently migrated to a new file system, /g/data4. A symbolic link /g/data2→/g/data4 has been provided for backward compatibility on Raijin. This /g/data2 symbolic link will not be provided on Gadi. All Gadi users are expected to update scripts and workflows to include  the new /g/data4 path where needed. 

Jobs

Gadi Cascade Lake nodes have 48 CPUs and 192 GB memory.

Users will need to adjust PBS job scripts and workflows from Raijin to suit Gadi: 48 CPUs/node (Cascade Lake, significantly faster than Raijin), 192 GB RAM/node, 400 GB PBS_JOBFS/node, and so on.

Gadi will run PBS Pro version 19.

Job scheduling will be determined at the project level, as on Raijin. It is not possible to schedule jobs on a per-user basis on Gadi.

Gadi will have Normal and Express queues as on Raijin. Gadi's Broadwell and SkyLake queues will conform to Raijin specifications. Queue details are available on the page Gadi Job Queues

Gadi resource limits (including defaults) will be provided in a child page, linked here.

The PBS_JOBFS size on Gadi normal/express/copyq nodes will be limited to 400 GB per node. Jobs that require more than 400 GB/node are expected to use /scratch disk.

Jobs on Gadi must explicitly declare, via PBS directives, which file systems are to be accessed during the job. As an example, a job which will read or write data in the /scratch/<project> and /g/data/<project> directories must include the directive  "-lstorage=scratch/<project>+gdata/<project>".  A job that attempts to access a /g/data or scratch directory without this directive will fail during run time.

Jobs which use less than a full node (Cascade Lake = 48 cpus) will be charged according to the fractional utilisation of node resources, that is, by number of CPUs or amount of node memory requested, whichever is larger. Note that charging on Raijin was based on cpu-hours only, without consideration of the memory requested or used. 

Projects which use memory-intensive, low-compute workflows may consume SUs more rapidly than expected on Gadi. 

Project job resource exemption (for example, wall time extensions) established on Raijin will not be carried across to Gadi. Most user jobs on Gadi will require less wall time than on Raijin. Job resource exceptions on Gadi will need to be compellingly justified.

Broadwell and SkyLake nodes are expected to be offline for three working days in November when they are migrated to Gadi. Users who rely on Broadwell or  SkyLake nodes should prepare for approximately three (3) days of downtime in late November. Unfortunately a testing/pre-production period will not be available to Broadwell and SkyLake workflows.

Job Charging - Examples

Gadi Cascade Lake node = 48 CPUs, 192 GB memory

1 cpu-hour = 2 service units (SU)

QueueCPUsMemory (GB)WalltimeChargeComments
Normal416 GB5 hours4 x 5 x 2 = 40 SUSatisfies 1 CPU <= 4 GB memory.
Normal816 GB5 hours8 x 5 x 2 = 80 SUCPU request dominates.

Normal8128 GB5 hours32 x 5 x 2 = 320 SUMemory request dominates.
32 cpus is proportion of node resources.

Normal8192 GB5 hours48 x 5 x 2 = 480 SUMemory request dominates.
192GB = 100% of node memory.
Express816 GB5 hours8 x 5 x 2 x 3 = 240 SUCPU request dominates (as above).
Express multiplier is x3.

Software

NCI strongly recommends that all users recompile their applications to obtain optimum performance and compatibility with the Gadi run-time environment. 

Binary executables from Raijin are expected to be compatible with Gadi if required dependencies and run-time libraries are available. Applications which rely on old dependencies are particularly at risk. Users are strongly encouraged to recompile on Gadi if possible.

Work is currently in progress porting third-party application software to Gadi. More information is available on the following page: Gadi Software Catalogue.

NCI can assist with local builds of third-party software for individual research groups on Gadi, as on Raijin. Please note that during the transition to Gadi staff time may be limited and software assistance may be deferred until Gadi is fully operational. 

The environment modules command will be available on Gadi, and will work in the same manner as on Raijin.

All Python users are encouraged to move to Python 3 as soon as possible. Python 2.7.16 will be provided on Gadi, however this will be the final version of Python 2 installed on the system. Development of Python 2 will officially ceases on .  

Containers will be available on Gadi, however, NCI staff will need to build the container image to ensure it satisfies security and compatibility criteria. Singularity is the preferred container type at this time. Users who require containers on Gadi should contact NCI user support at help@nci.org.au.

Work is in progress on a container environment to support Raijin backward compatibility on Gadi. This is intended to be a stop-gap solution for projects which require more time to adapt to Gadi. This "Raijin in a container" is expected to be available to users in Q4 and 2020 Q1 for a limited time only - details to be confirmed. Users are again strongly encouraged to rebuild all applications on Gadi for long-term stability and performance. 

Virtual Desktop Infrastructure (VDI)

NCI's VDI service will continue to be available to users as Gadi enters service in 2019 Q4 and 2020 Q1. Overall VDI functionality is expected to remain unchanged.

The current VDI application software stack will continue to be available. Email help@nci.org.au if you have any questions or requests for new software packages on the VDI.

As is the case now on Raijin, user home directories on VDI will continue to be separate from home directories on Gadi.

VDI-to-Gadi job submission functionality is now available. Please note that this will be implemented as the default option overnight Monday 9 December. For more information about how to use this feature during the transition period see the VDI User Guide https://opus.nci.org.au/display/Help/VDI+User+Guide#VDIUserGuide-4.2.PBS.

Data Collections

Gadi users who require access to NCI data collections should ensure they are members of the required data collection projects. 

PBS jobs which read from files in data collections will need to use the requisite job directives to flag collection access, for example, "-lstorage=gdata/<project>". A data collection path on the file system will not be available to a job unless the appropriate PBS directive is provided. 

More comprehensive updates on VDI will be provided to users in January-February 2020. If you have specific questions or concerns about VDI please contact NCI User Support - help@nci.org.au.

Training

Transition to Gadi - as presented at the ALCS 2019 Training Day.

Questions?

If you have further questions or concerns about the transition from Raijin to Gadi please contact NCI user support at help@nci.org.au.





  • No labels