In general NCI Nirin Cloud instances will continue to run until a dashboard user decides to shut them down. However there are some circumstances where instances may be administratively shut down, for example during some scheduled maintenance, during some emergency maintenance situations, or due to an unexpected outage.
It is NCI policy that instances shut down administratively must be restarted by the instance owner; unfortunately, this can cause some systems running on our Nirin Cloud to experience inconveniently long service outages. To remove this issue NCI has developed a tool by which Nirin Cloud projects can request that their instances be restarted automatically after an outage.
This document assumes that the user is familiar with the basics of using the NCI Nirin Cloud, as documented in the Nirin - Quick Start Guide. The user is also assumed to be familiar with using the OpenStack client tools for direct Nirin API Access.
If you have an instance with unique identifier $uuid
, and you would like that instance to be restarted automatically following an outage, you can flag this by adding metadata to the instance as follows
$ openstack server set --property nci_restart=true $uuid
which will cause our system to identify and restart the instance automatically after an outage.
This can be repeated for any number of instances.
There is another metadata key nci_restart_after
which can be used to configure more complex ordering dependencies between instances. This is explained along with additional detail below.
The automated restart tool makes use of metadata that the user sets on their project's instances. Instance metadata is simply a set of arbitrary key=value
pairs that are associated with the instance, and which can be set by the user. The automated restart tool reads a set of NCI specified metadata keys and restarts instances based on the metadata values.
Two keys are recognised by the automated restart tool:
nci_restart
nci_restart_after
Setting the nci_restart
key on an instance will cause the instance to be restarted automatically. Note that the value is ignored: setting nci_restart=false
will still result in the instance being restarted. Please see the section on setting and clearing metadata below for details on how to remove the key from an instance, which will disable automated restart for that instance.
The nci_restart_after
key can be used to specify a restart dependency graph for a set of instances - the metadata value for an instance must be set to a comma separated list of instance UUIDs, all owned by the same project, which must be running before the instance can be restarted. The automated restart tool will use this information to calculate a topological sort of the dependency graph, and restart the instances in order.
For nci_restart_after
to function three requirements must be met: firstly, all instances in the dependency graph must have either the nci_restart
or nci_restart_after
keys set; secondly, the dependency graph must not have any cycles; and finally all the instances in a given dependency graph must be from the same project. If any of these requirements are not met none of the project's instances will be restarted.
An instance can have both keys set; the behavior will be exactly the same as if only nci_restart_after
was set for that instance.
When no dependencies exist all of a project's instances with the restart metadata will be started in parallel. Where dependencies exist instances are restarted in an order determined by a topological sort of the dependency graph. If any instance in the sequence fails to start, or takes too long to start, no more instances from that project will be restarted.
Instances that take more than a pre-determined time period (currently 60 seconds) to reach an active state are treated as a failed restart, and no more instances from that project will be restarted.
The NCI Cloud Team will attempt to monitor for issues while restarting a project's instances, and where possible will attempt to contact the project. However, this will be a best effort attempt, and no guarantees are made.
In the current version there is no tool available to users to verify that their configuration is correct. Please take care while defining dependencies and setting metadata keys on your project, and if you have questions about your configuration feel free to contact the NCI help desk at help@nci.org.au for assistance.
The OpenStack command line tools are the recommended way to set instance metadata. Please see the Nirin API Access page for information about installing and running the OpenStack command line tools. This document assumes you are using the Unified OpenStack Client; other OpenStack tools may be used to manipulate instance metadata, but the details of their use is outside the scope of this document.
To set a metadata key on an instance use the following command:
$ openstack server set --property key=value instance_uuid
And to unset a metadata key:
$ openstack server unset --property key instance_uuid
To set the nci_restart
metadata for an instance:
$ openstack server set --property nci_restart=true instance_uuid
And to set the nci_restart_after
metadata for an instance:
$ openstack server set --property nci_restart_after=uuid1,uuid2 instance_uuid
Remember that instances uuid1
and uuid2
must also have either nci_restart
or nci_restart_after
set for the automated restart to work.
Finally, to disable restart on an instance:
$ openstack server unset --property nci_restart --property nci_restart_after instance_uuid
A project has the following instances:
$ openstack server list --column ID --column Name +--------------------------------------+-----------+ | ID | Name | +--------------------------------------+-----------+ | f10b24e4-da9a-4280-9cdf-cf7fe03af3df | test4 | | c32dbc05-55b6-491e-9e2b-585d76289483 | test3 | | 1933f3bd-3b69-408d-a0ef-a657890df955 | test2 | | 402e5c04-ea09-43c7-8bb4-f34e26b0c637 | test1 | +--------------------------------------+-----------+
Instances test1
and test2
need to be restarted, but test3
and test4
do not:
# set the nci_restart key on each instance # test1 $ openstack server set --property nci_restart=true 402e5c04-ea09-43c7-8bb4-f34e26b0c637 # test2 $ openstack server set --property nci_restart=true 1933f3bd-3b69-408d-a0ef-a657890df955
In this configuration test1
and test2
will be restarted in parallel, and test3
and test4
will not be restarted.
Instance test4
requires services running on test3
and test2
, and both test3
and test2
depend on a service running on test1
. This creates the following dependency graph:
To capture these dependencies in the automated restart system, the following commands can be used:
# test1 must be restarted, but doesn't depend on anything else $ openstack server set --property nci_restart=true 402e5c04-ea09-43c7-8bb4-f34e26b0c637 # test2 needs to be restarted after test1 $ openstack server set --property nci_restart_after=402e5c04-ea09-43c7-8bb4-f34e26b0c637 1933f3bd-3b69-408d-a0ef-a657890df955 # test3 also needs to be restarted after test1 $ openstack server set --property nci_restart_after=402e5c04-ea09-43c7-8bb4-f34e26b0c637 c32dbc05-55b6-491e-9e2b-585d76289483 # finally, test4 needs to be restarted after test2 and test3 $ openstack server set --property nci_restart_after=1933f3bd-3b69-408d-a0ef-a657890df955,c32dbc05-55b6-491e-9e2b-585d76289483 f10b24e4-da9a-4280-9cdf-cf7fe03af3df
In this configuration test1
will be restarted, test2
and test3
will be restarted in parallel once test1
is active, and test4
will be restarted after test2
and test3
are both active.
In this case all four instances need to be restarted, but one is independent of the others - test4
depends on test2
, which depends on test1
, but test3
does not depend on any others:
# test1 must be restarted, but doesn't depend on anything else $ openstack server set --property nci_restart=true 402e5c04-ea09-43c7-8bb4-f34e26b0c637 # test2 needs to be restarted after test1 $ openstack server set --property nci_restart_after=402e5c04-ea09-43c7-8bb4-f34e26b0c637 1933f3bd-3b69-408d-a0ef-a657890df955 # test4 needs to be restarted after test2 $ openstack server set --property nci_restart_after=1933f3bd-3b69-408d-a0ef-a657890df955 f10b24e4-da9a-4280-9cdf-cf7fe03af3df # test3 also needs to be restarted $ openstack server set --property nci_restart=true c32dbc05-55b6-491e-9e2b-585d76289483
In this configuration test1
and test3
will be restarted in parallel, test2
will be restarted once test1
is active, and test4
will be restarted after test2
is active.
If you have further questions about NCI's automated restart system please contact the NCI help desk at help@nci.org.au