Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Last updated 03 Jan updated   (212:00pm AEDT)

This page provides information to help users prepare for the transition from Raijin to Gadi in 2019 Q4. 

NCI will regularly update this page and AEST)


Note

 COVID-19 Update

NCI staff are currently working from home In response to the COVID-19 pandemic. NCI systems remain fully operational and will be supported remotely. NCI user support will continue to operate as normal, with email the preferred mode of support interaction - help@nci.org.au.  


NCI will regularly update this page and provide more detailed information as it becomes available. Note that the format of this page (and child pages) may change.

If you have questions or special concerns about how your work may be impacted by the transition from Raijin to Gadi please let us know as soon as possible - contact NCI user support at help@nci.org.au - and we will endeavour to help you as soon as possible.

There are a number of things users can do now to prepare for Gadi. All users are strongly encouraged to take action as soon as possible.

UPDATE  

The ANU has closed its Acton, Mount Stromlo and Kioloa campuses from midnight Thursday  (TODAY) through 9:00am AEDT Tuesday  due to hazardous smoke conditions in the ACT and the region. As a result of this closure Gadi is expected to be available to users starting Tuesday . NCI will update this notice on Monday  to reflect advice from the ANU, and any local developments with respect to smoke conditions and regional bushfires.

Contents

Table of Contents

.

Gadi Timeline

Updated  (1:45pm AEDT)

Gadi Status Summary

UPDATED  

  • Cascade Lake hugemem nodes are now available on Gadi. 
  • On Gadi use the PBS directive "-lstorage=<path>" if your job accesses a /g/data directory or the /scratch directory of another project. (Note that POSIX permissions still apply.) Failure to provide these directives will cause a job to fail with a run-time error. See the section below Filesystems-/g/data for more information.
  • On Gadi, user workflows should reference /g/data directory paths using the form "/g/data/projectcode", i.e. without the alphanumeric filesystem descriptors 1a, 1b, 2, 3, or 4. 
  • A Raijin run-time compatibility image is provided on Gadi. To use this add the "-limage=raijin" directive in your PBS job script, and modify your cpu request to be Gadi compliant, i.e. a multiple of 48 cpus, if you are using more than one full node. Use of the Raijin compatibility image will incur a performance penalty. All users are advised to recompile their applications on Gadi as soon as possible.
  • Raijin /short and /home file systems are now offline. These file systems will be decommissioned on Tuesday

Details on these items can be found in the following sections.

Table of Contents
printablefalse


Gadi Timeline

Raijin Sandy Bridge nodes will be decommissioned earlier than originally planned to accommodate electrical power support for Gadi. Please note that this revised schedule is still subject to change. 

Date(s)

Events

 

NCI data centre preparation and Gadi installation phase one - COMPLETED.

 

Gadi stability and acceptance testing underway. Users preparing for Gadi.

 

Raijin user home directories will be copied to Gadi home directories ($HOME/raijin_home). 

 

Transition Phase One
Gadi and Raijin available to users. Gadi pre-production configuration is expected to include one rack of V100 GPU nodes.
Gadi allocations will match Raijin Q4 pro-rata allocations.
Jobs can be run (independently) on both systems.

 -  

Raijin /short available read-only on Gadi login and data mover nodes for user file transfers.
Progressive deployment of Gadi nodes to full specification, and phased retirement of Raijin Sandy Bridge nodes.

 

50% of Raijin Sandy Bridge nodes decommissioned to allow power work for Gadi - DONE

 

Raijin Broadwell nodes offline for power reconfiguration work - DONE

 

Raijin operational with Broadwell and Skylake nodes only - DONE
All Raijin Sandy Bridge nodes decommissioned; "normal" and "express" queues no longer available - DONE

 

Raijin run-time compatibility environment available on Gadi - DONE

 -  

Scheduled Downtime - Gadi
Scheduled downtime for final Gadi configuration and pre-production acceptance testing. 

 -  

Raijin operational with Broadwell and Skylake compute nodes. 

07  

Production Phase Two
Gadi operational at full production specification. Broadwell and Skylake nodes available on Gadi.
UPDATE: The ANU has stood down all non-essential staff from midnight Thursday  to 9:00am AEDT Tuesday  due to hazardous smoke conditions. Gadi is expected to be available to users on Tuesday .Gadi available to users. Broadwell and Skylake nodes offline for Gadi integration.
UPDATE: The ANU has re-opened its main campus following a multi-day shutdown due to hazardous smoke conditions in the ACT.

 -  

ANU campus closure due to severe hailstorm. NCI systems available to users. 

 

Raijin /short filesystem and /home file systems decommissioned. Raijin end of service. 
Jan 2020 - TBDBroadwell and Skylake nodes migrated to Gadi.

 -  

2020 Q2 Scheduled Maintenance Downtime - Gadi
This Q2 scheduled quarterly maintenance downtime will be extended to accommodate configuration tuning for Gadi. Details will be provided at a later date.

...

The message of the day on Raijin and Gadi login nodes will always contain the most up to date information about system status and availability of those systems.

User Environment

  1. The user's default shell and project will be controlled by the file gadi-login.conf in each user's $HOME/.config directory.
  2. Gadi /home quotas are applied on a per-user basis, as on Raijin.
  3. Gadi /home quotas will be 10 GB.
  4. Gadi login and compute nodes will run the CentOS 8 operating system.NCI is currently considering the best mechanism to deliver files from Raijin /home directories to Gadi. The solution is likely to be either (a) user can copy files from a read-only archive copy of Raijin /home, or (b) a bulk copy of user Raijin /home files to Gadi /home directories. More information will be available on this as soon as possible8 operating system.
  5. Raijin /short and /home directories are available on Gadi via the paths /raijin/short and /raijin/home, respectively, until . These paths can be accessed read-only on login nodes and copyq/datamover nodes only.

User Environment: Compilers and MPI

...

Users are strongly encouraged to retain only essential files from their Raijin home directories on Gadi. 

UPDATE  – Raijin /short and /home directories on Gadi. are available on Gadi via the paths /raijin/short and /raijin/home, respectively, until . These paths can be accessed read-only on login nodes and copyq/datamover nodes only.

File Systems - /scratch

The temporary file system for Gadi users is /scratch. Note that the path '/short', as used on Raijin, will not exist on Gadi. 

...

Project data on the /g/data2 file system was recently migrated to a new file system, /g/data4. A symbolic link /g/data2→/g/data4 has been provided for backward compatibility on Raijin. This /g/data2 symbolic link will not be provided on Gadi. All Gadi users are expected to update scripts and workflows to include  the new /g/data4 path where needed. 

On Gadi, user workflows should reference /g/data directory paths of the form "/g/data/projectcode". A numeric file system descriptor in /g/data paths, for example /g/data1a/ab12, should no longer be used.

Jobs

Gadi Cascade Lake normal/express/copyq nodes have 48 CPUs and 192 GB memory.

Users will need to adjust PBS job scripts and workflows from Raijin to suit Gadi: 48 CPUs/node (Cascade Lake, significantly faster than Raijin), 192 GB RAM/node, 400 GB PBS_JOBFS/node, and so on.

Gadi will run runs PBS Pro version 19.

Job scheduling will be determined at the project level, as on Raijin. It is not possible to schedule jobs on a per-user basis on Gadi.

Gadi will have Normal and Express queues as on Raijin. Gadi's Broadwell and SkyLake queues will conform to Raijin specifications. Queue details are available on the page Gadi Job Queues

Gadi resource limits (including defaults) will be provided in a child page, linked here.

The PBS_JOBFS size on Gadi normal/express/copyq nodes will be limited to 400 GB per node. Jobs that require more than 400 GB/node are expected to use /scratch disk.

Jobs on Gadi must explicitly declare, via PBS directives, which file systems are to be accessed during the job. As an example, a job which will read or write data in the /scratch/<project> and /g/data/<project> directories must include the directive  "-lstorage=scratch/<project>+gdata/<project>".  A job that attempts to access a /g/data or scratch directory without this directive will fail during run time.is not possible to schedule jobs on a per-user basis on Gadi.

Gadi will have Normal and Express queues as on Raijin. Gadi's Broadwell and SkyLake queues will conform to Raijin specifications. Queue details are available on the page Gadi Job Queues

The PBS_JOBFS size on Gadi normal/express/copyq nodes will be limited to 400 GB per node. Jobs that require more than 400 GB/node are expected to use /scratch disk.

Jobs on Gadi must explicitly declare, via PBS directives, which file systems are to be accessed during the job. As an example, a job which will read or write data in the /scratch/<project> and /g/data/<project> directories must include the directive  "-lstorage=scratch/<project>+gdata/<project>".  A job that attempts to access a /g/data or scratch directory without this directive will fail during run time.

Exercise caution if you use symbolic links in your workflows. The -lstorage directive tells PBSPro which directories to mount for the execution of a job, and therefore must refer to an actual project directory on /scratch or /g/data. The best practice is to always use actual target directories in -lstorage directives. Symbolic links which cross file systems, for example, will fail at run time. 

Jobs which use less than a full node (Cascade Lake = 48 cpus) will be charged according to the fractional utilisation of node resources, that is, by number of CPUs or amount of node memory requested, whichever is larger. Note that charging on Raijin was based on cpu-hours only, without consideration of the memory requested or used. 

Projects which use memory-intensive, low-compute workflows may consume SUs more rapidly than expected on Gadi. 

Project job resource exemption (for example, wall time extensions) established on Raijin will not be carried across to Gadi. Most user jobs on Gadi will require less wall time than on Raijin. Job resource exceptions on Gadi will need to be compellingly justified.

Broadwell and SkyLake nodes are expected to be offline for three working days in November when they are migrated to Gadi. Users who rely on Broadwell or  SkyLake nodes should prepare for approximately three (3) days of downtime in late November. Unfortunately a testing/pre-production period will not be available to Broadwell and SkyLake workflowsrapidly than expected on Gadi. 

Project job resource exemption (for example, wall time extensions) established on Raijin will not be carried across to Gadi. Most user jobs on Gadi will require less wall time than on Raijin. Job resource exceptions on Gadi will need to be compellingly justified.

Raijin will continue to operate with Broadwell and Skylake compute nodes until . Broadwell and Skylake nodes will be offline for relocation and integration with Gadi during January 2020. NCI will advise users when they are available.

Job Charging - Examples

Gadi Cascade Lake node = 48 CPUs, 192 GB memory

...

All Python users are encouraged to move to Python 3 as soon as possible. Python 2.7.16 will be provided on Gadi, however this will be the final version of Python 2 installed on the system. Development of Python 2 will officially ceases ceased on .  

Containers will be available on Gadi, however, NCI staff will need to build the container image to ensure it satisfies security and compatibility criteria. Singularity is the preferred container type at this time. Users who require containers on Gadi should contact NCI user support at help@nci.org.au.

...