Page tree

On this page

Overview

In general NCI Nirin Cloud instances will continue to run until a dashboard user decides to shut them down. However there are some circumstances where instances may be administratively shut down, for example during some scheduled maintenance, during some emergency maintenance situations, or due to an unexpected outage.

It is NCI policy that instances shut down administratively must be restarted by the instance owner; unfortunately, this can cause some systems running on our Nirin Cloud to experience inconveniently long service outages. To remove this issue NCI has developed a tool by which Nirin Cloud projects can request that their instances be restarted automatically after an outage.

Pre-requisites

This document assumes that the user is familiar with the basics of using the NCI Nirin Cloud, as documented in the Nirin - Quick Start Guide. The user is also assumed to be familiar with using the OpenStack client tools for direct Nirin API Access.

Quick start

If you have an instance with unique identifier $uuid, and you would like that instance to be restarted automatically following an outage, you can flag this by adding metadata to the instance as follows

$ openstack server set --property nci_restart=true $uuid

which will cause our system to identify and restart the instance automatically after an outage.

This can be repeated for any number of instances.

There is another metadata key nci_restart_after which can be used to configure more complex ordering dependencies between instances. This is explained along with additional detail below.

Details

The automated restart tool makes use of metadata that the user sets on their project's instances. Instance metadata is simply a set of arbitrary key=value pairs that are associated with the instance, and which can be set by the user. The automated restart tool reads a set of NCI specified metadata keys and restarts instances based on the metadata values.

Metadata Keys

Two keys are recognised by the automated restart tool:

  • nci_restart
  • nci_restart_after

nci_restart

Setting the nci_restart key on an instance will cause the instance to be restarted automatically. Note that the value is ignored: setting nci_restart=false will still result in the instance being restarted. Please see the section on setting and clearing metadata below for details on how to remove the key from an instance, which will disable automated restart for that instance.

nci_restart_after

The nci_restart_after key can be used to specify a restart dependency graph for a set of instances - the metadata value for an instance must be set to a comma separated list of instance UUIDs, all owned by the same project, which must be running before the instance can be restarted. The automated restart tool will use this information to calculate a topological sort of the dependency graph, and restart the instances in order.

Notes on behavior

For nci_restart_after to function three requirements must be met: firstly, all instances in the dependency graph must have either the nci_restart or nci_restart_after keys set; secondly, the dependency graph must not have any cycles; and finally all the instances in a given dependency graph must be from the same project. If any of these requirements are not met none of the project's instances will be restarted.

An instance can have both keys set; the behavior will be exactly the same as if only nci_restart_after was set for that instance.

When no dependencies exist all of a project's instances with the restart metadata will be started in parallel. Where dependencies exist instances are restarted in an order determined by a topological sort of the dependency graph. If any instance in the sequence fails to start, or takes too long to start, no more instances from that project will be restarted.

Instances that take more than a pre-determined time period (currently 60 seconds) to reach an active state are treated as a failed restart, and no more instances from that project will be restarted.

The NCI Cloud Team will attempt to monitor for issues while restarting a project's instances, and where possible will attempt to contact the project. However, this will be a best effort attempt, and no guarantees are made.

Important Note

In the current version there is no tool available to users to verify that their configuration is correct. Please take care while defining dependencies and setting metadata keys on your project, and if you have questions about your configuration feel free to contact the NCI help desk at help@nci.org.au for assistance.

Setting Instance Metadata

The OpenStack command line tools are the recommended way to set instance metadata. Please see the Nirin API Access page for information about installing and running the OpenStack command line tools. This document assumes you are using the Unified OpenStack Client; other OpenStack tools may be used to manipulate instance metadata, but the details of their use is outside the scope of this document.

To set a metadata key on an instance use the following command:

$ openstack server set --property key=value instance_uuid

And to unset a metadata key:

$ openstack server unset --property key instance_uuid

To set the nci_restart metadata for an instance:

$ openstack server set --property nci_restart=true instance_uuid

And to set the nci_restart_after metadata for an instance:

$ openstack server set --property nci_restart_after=uuid1,uuid2 instance_uuid

Remember that instances uuid1 and uuid2 must also have either nci_restart or nci_restart_after set for the automated restart to work.

Finally, to disable restart on an instance:

$ openstack server unset --property nci_restart --property nci_restart_after instance_uuid

Worked Examples

A project has the following instances:

$ openstack server list --column ID --column Name
+--------------------------------------+-----------+
| ID                                   | Name      |
+--------------------------------------+-----------+
| f10b24e4-da9a-4280-9cdf-cf7fe03af3df | test4     |
| c32dbc05-55b6-491e-9e2b-585d76289483 | test3     |
| 1933f3bd-3b69-408d-a0ef-a657890df955 | test2     |
| 402e5c04-ea09-43c7-8bb4-f34e26b0c637 | test1     |
+--------------------------------------+-----------+

Instance restart with no dependencies

Instances test1 and test2 need to be restarted, but test3 and test4 do not:

# set the nci_restart key on each instance
# test1
$ openstack server set --property nci_restart=true 402e5c04-ea09-43c7-8bb4-f34e26b0c637
# test2
$ openstack server set --property nci_restart=true 1933f3bd-3b69-408d-a0ef-a657890df955

In this configuration test1 and test2 will be restarted in parallel, and test3 and test4 will not be restarted.

Instance restart with dependencies

Instance test4 requires services running on test3 and test2, and both test3 and test2 depend on a service running on test1. This creates the following dependency graph:

To capture these dependencies in the automated restart system, the following commands can be used:

# test1 must be restarted, but doesn't depend on anything else
$ openstack server set --property nci_restart=true 402e5c04-ea09-43c7-8bb4-f34e26b0c637
# test2 needs to be restarted after test1
$ openstack server set --property nci_restart_after=402e5c04-ea09-43c7-8bb4-f34e26b0c637 1933f3bd-3b69-408d-a0ef-a657890df955
# test3 also needs to be restarted after test1
$ openstack server set --property nci_restart_after=402e5c04-ea09-43c7-8bb4-f34e26b0c637 c32dbc05-55b6-491e-9e2b-585d76289483
# finally, test4 needs to be restarted after test2 and test3
$ openstack server set --property nci_restart_after=1933f3bd-3b69-408d-a0ef-a657890df955,c32dbc05-55b6-491e-9e2b-585d76289483 f10b24e4-da9a-4280-9cdf-cf7fe03af3df

In this configuration test1 will be restarted, test2 and test3 will be restarted in parallel once test1 is active, and test4 will be restarted after test2 and test3 are both active.

Instance restart with a mix of dependent and independent instances

In this case all four instances need to be restarted, but one is independent of the others - test4 depends on test2, which depends on test1, but test3 does not depend on any others:

# test1 must be restarted, but doesn't depend on anything else
$ openstack server set --property nci_restart=true 402e5c04-ea09-43c7-8bb4-f34e26b0c637
# test2 needs to be restarted after test1
$ openstack server set --property nci_restart_after=402e5c04-ea09-43c7-8bb4-f34e26b0c637 1933f3bd-3b69-408d-a0ef-a657890df955
# test4 needs to be restarted after test2
$ openstack server set --property nci_restart_after=1933f3bd-3b69-408d-a0ef-a657890df955 f10b24e4-da9a-4280-9cdf-cf7fe03af3df
# test3 also needs to be restarted
$ openstack server set --property nci_restart=true c32dbc05-55b6-491e-9e2b-585d76289483

In this configuration test1 and test3 will be restarted in parallel, test2 will be restarted once test1 is active, and test4 will be restarted after test2 is active.

Further Assistance

If you have further questions about NCI's automated restart system please contact the NCI help desk at help@nci.org.au