Page tree
Skip to end of metadata
Go to start of metadata

Gadi will be NCI's newest and fastest Tier-1 high-performance computing cluster. The upgrade has been made possible with $70 million in Australian Government funding under The National Collaborative Research Infrastructure Strategy (NCRIS).

Gadi is named after the words ‘to search for’ in the language of the Ngunnawal – the traditional owners of the Canberra region. It is pronounced 'gar dee'. 

This page covers important details and specifications for the installation of the system. Additional pages will cover software environments and allocations.

This page was last updated on the 16th of September 2019.

Gadi Specifications

Compute Hardware

The exact size will be announced closer to the launch. The following provides details on the major components:

  • Intel Cascade Lake CPUs - CPU specifications for the Gadi processors, and a comparison to Raijin, are as follows:

Cascade Lake on Gadi

Sandy Bridge on Raijin

Intel Xeon Platinum 8274

Intel Xeon E5-2670

Two physical processors per node

Two physical processors per node

3.2GHz clock speed

2.6GHz clock speed

48 cores per node

16 cores per node

4915 Gflops theoretical performance/node

332 Gflops theoretical performance/node

  • NVIDIA V100 GPUs, and as with the Raijin GPU nodes, the Gadi GPU nodes will be dual CPU systems with 4 GPUs per node.
  • A new /scratch filesystem using DDN Lustre filesystem using NetApp enterprise class storage arrays.
  • Mellanox Technologies' latest generation HDR InfiniBand technology in a Dragonfly+ topology, capable of transferring data at 200 Gb/s.
  • A new /home filesystem
  • Existing /g/data filesystems will be available on Gadi.

Other hardware:

  • Gadi will use both Fujitsu and Lenovo innovative direct liquid cooling technology with warm water, allowing for high density computing.
  • The underlying storage sub-systems will be provided by NetApp enterprise class storage arrays, linked together in a DDN Lustre parallel file system enabling the high-performance throughput needed for computing on big data challenges.

More details of the system components will be made available here: https://nci.org.au/our-systems/hpc-systems.

Software

  • Gadi will use the latest version of the CentOS 8 operating system.
  • Altair’s PBSPro software will optimise job scheduling and workload management. We note some of the configuration of the queueing system will change and be described elsewhere.
  • The compilers will be an update to the current Intel suite. Older versions of the compilers will not be available on Gadi due to licencing.
  • The default MPI will be OpenMPI. Not all existing versions will be available.
  • Since the operating system, network, and all underpinning libraries will be updated, all codes will need to be recompiled on Gadi.

Since software is a complex system, we will include more details about the software and applications for Gadi in a separate web page.

Gadi Installation and Raijin Decommission Timeline

Overview

Gadi will be installed as a phased implementation, with installation and formal acceptance testing of each phase before it can be available to users.

  • Gadi Compute Install: 
    • Physical installation of Cascade Lake compute system nodes, new /home and /scratch
  • Gadi Phase 1 Acceptance Testing
    • Testing of nodes to ensure they meet contracted deliverable
  • Gadi Phase 1 User Test
    • Commencement of user testing on Gadi Phase 1
    • Parallel running of Raijin
  • Gadi Phase 1 Production
    • Commencement of production services on Gadi Phase 1
    • Raijin Decommission
  • Gadi GPU Phase 2 Install: 
    • Installation of NVIDIA V100 GPUs
    • Installation of large-memory Cascade Lake nodes
    • Installation of the remainder of Cascade Lake compute nodes
  • Gadi Full Production:
    • Full test of the system and access to users

Raijin will also be decommissioned in phases:

  • Raijin Decommission:
    • Decommission of Sandy Bridge nodes
    • Migration and reimage of Raijin’s Broadwell and Skylake nodes to Gadi
    • Migration and reimage of Raijin’s GPU nodes
    • /short mounted read-only on Gadi
  • Raijin /short Decommission:
    • Decommission /short and all remaining Raijin hardware not migrated.

During Gadi Phase 1 User Test, the systems will run side-by-side for a short period to transition projects. The /g/data filesystems will be available on both Gadi and Raijin and will continue into Gadi’s production system after Raijin is decommissioned.

Other more specific details on each stage of this process are included below.

Commencing in August, and during the installation there will be significant data centre infrastructure works. While have planned to minimise disruption to NCI services including Raijin, some short outages are anticipated.

During the installation, there may be some unexpected issues. Any updates relevant to the general community will be updated on this page.

Schedule of Works

Task

Status

Details

Estimated Commencement date

Procurement

COMPLETED



Gadi System Design

COMPLETED



Data Centre Floor Removal

COMPLETED

Removal of floor to allow building works for sub floor power and cooling infrastructure


Gadi Storage Install

UNDERWAY

Physical installation of /scratch disk arrays and Lustre servers

August 2019

Data Centre Power and Cooling Infrastructure 

UNDERWAY

Installation of additional power and cooling

August/September  2019

Data Centre Floor Install

COMPLETED

Install new floor

9th September  2019

Data Centre Cooling Infrastructure 

COMPLETED

Join cooling loops

(Full Raijin system downtime)

12th September 2019


Gadi Compute Install

UNDERWAY

Physical install of all of Gadi’s new Cascade Lake compute nodes

16th September 2019

Commission New Cooling Loops

SCHEDULED

Connect cooling loops and adjust water quality

15th October  2019

Gadi Phase 1 Acceptance Testing 

SCHEDULED

System Acceptance testing commencing

28th October 2019

Gadi Phase 1 User Testing

SCHEDULED

Access for users to Gadi, Raijin running in parallel

11th November  2019

Raijin Jobs Submission DisabledSCHEDULEDJobs submission disabled on Raijin23rd November 2019

Gadi Phase 1 Production

SCHEDULED

Gadi Phase 1 running

Raijin’s old /short mounted read-only on Gadi

25th November 2019

Raijin Decommission

SCHEDULED

Decommission of Sandy Bridge nodes

Removal of Raijin’s Broadwell, Skylake and GPU nodes

25th November 2019

Gadi Broadwell and Skylake

 SCHEDULED

Raijin's Broadwell, Skylake and GPU nodes installed on Gadi

28th November 2019

Gadi GPU Phase 2 Install

SCHEDULED

Physical install and system testing

11th December 2019

Data Centre Power/Cooling works

SCHEDULED

Final power reticulation - Scheduled Downtime

1-5 January 2020

Gadi Full Production

SCHEDULED

User access to the full system

6 January 2020

Raijin /short Decommission

SCHEDULED

/short decommissioned

20 January 2020

Data Centre Cooling Tower Replacement

SCHEDULED

New free cooling loop installed and tested

7- 20 January 2020

Data Centre Free-Cooling Loop Commission

SCHEDULED

Disk and Tape free cooling loop commissioned and goes into production

22 January 2020

Note that these dates are subject to change.

Planning Your Transition

Gadi will be commissioned as a stand-alone system before beginning any transition. The critical phases are Gadi Phase 1 User Testing, and Raijin Decommission.

Preparatory work

  • Users should be cleaning up their data files from Raijin as there is new storage hardware for Gadi. In particular, raijin:/short needs to be cleaned up, and there is no system-wide process for moving users' /short files. Please get in touch with the NCI Help Desk if you need special assistance.
  • We also encourage users to clean up their /home directories.
  • This might be the time to retest your compilation processes: existing binaries are not likely to work on Gadi without recompilation due to the major updates to the underlying software environment to support the new hardware.
  • Since Gadi will have gadi:/scratch, which replaces raijin:/short in naming, this is the time to look over your configurations for data location.

Gadi Phase 1 User Test

On completion of Gadi Phase 1 Acceptance Testing, users will gain early pre-production access to Gadi to assist in testing. A copy of raijin:/home will be taken just prior to users getting access to the Gadi system for early testing purposes.

There may be some outages on Gadi that affect jobs with minimal or no notice, particularly as NCI prepares the system environment.

Gadi Phase 1 Production and Raijin Decommission

Due to power limitations within the data hall, Raijin needs to be decommissioned and turned off before power can be reticulated to the remainder of the Gadi system. 

Jobs submission will be disabled on Raijin two days before the nodes are being decommissioned. This will allow any remaining jobs to complete, and for users to migrate the last of their data from raijin:/short to gadi:/scratch. Any jobs remaining in the Raijin queues will be deleted.

A final backup of raijin:/home will take place before Raijin is decommissioned.

Existing Broadwell, Skylake and GPU nodes will be disconnected from Raijin, connected to the new interconnect and then re-imaged on Gadi. This will require a downtime of approximately 2 days for these nodes while they are reconnected.

The new production NCI environment will be available at this time, including NCI’s software environment (/apps) and project accounting.

Raijin:/short will continue to be available in a read-only mode on the login and data mover nodes to allow users to migrate any data they need from raijin:/short to the new gadi:/scratch

Gadi GPU Phase 2 Install

With the power to Raijin turned off, the remainder of Gadi equipment will be installed and turned on progressively as power becomes available. NCI will then perform the required system acceptance testing on these new nodes before they are made available to users.

Gadi Full Production

The completed complement of compute and GPU nodes will be operational and available for all users.

Raijin /short Decommission

The /short file system will be unmounted off Gadi and any files left on Raijin /short will be deleted. There will be no backup.

Other Questions?

We are preparing separate pages about the software environment and allocation systems.

If you still have questions or concerns about the transition of production services from Raijin to Gadi, or special circumstances that we need to be aware of, please contact the NCI Help Desk.

  • No labels