Data Centre
Systems Impacted: Compute nodes in normal, express, normalsr, expresssr, and gpuvolta queues
Dear NCI Users,
A cooling fault in the NCI Data Centre has lead to the nodes backing the normal, express, normalsr, expresssr, and gpuvolta queues being powered off to protect the hardware. Cooling has now been restored and NCI staff are bringing these nodes back online.
Any jobs that were running on these nodes will have been lost.
Regards,
NCI User Services
/g/data
Systems Impacted: gdata1a
Filesystem Fault
Dear NCI Users,
NCI system admins have reported problems with gdata1a filesystem. If your session to Gadi is hanging it is possible your projects are on that filesystem. System admins are working on finding and fixing the issue. This page will be updated as soon as there is more information.
Update: Filesystem resumed normal operations since 1:19pm.
Regards,
NCI User Services
/g/data
Systems Impacted: gdata6
Filesystem Fault
Dear NCI Users,
NCI system admins have reported Lustre problems with gdata6 filesystem. If your session to Gadi is hanging it is possible your projects are on that filesystem. System admins are working on finding and fixing the issue. This page will be updated as soon as there is more information.
Update: Filesystem resumed normal operations since 3:01pm
Regards,
NCI User Services
/g/data
Systems Impacted: gdata1a
Filesystem Fault
Dear NCI Users,
NCI system admins have reported a problem with one of the storage servers in gdata1a. If your session to Gadi is hanging it is possible your projects are on that filesystem. Currently this is being investigated. This page will be updated as soon as there is more information.
Update: Filesystem has resumed normal operations and has been stable since 12:30pm
Regards,
NCI User Services
Core cloud infrastructure
Systems Impacted: ARE, Nirin VM, accessdev and other services relying on cloud infrastructure
Hardware Faults
Dear NCI Users,
We had hardware issues on the core cloud infrastructure which causes the impacted systems unresponsive.
We have identified a fix for this issue and we are implementing it now.
If you require further assistance, please contact NCI User Services via the Helpdesk at https://help.nci.org.au or help@nci.org.au.
Update 4 Jan 12:48pm: All the impacted compute nodes are back up. Users will need to verify that their services are properly functional or not. Any users with instances that went down can restart them now.
Regards,
NCI User Services