Page tree

You can set up a Ray cluster to utilise single node or multiple nodes resources.

Single Node

If you are working with single node resources via OOD, ARE or Gadi PBS job, you can simply call ray.init() in your Jupyter Notebook or python script to start a Ray runtime on the current working host.

You can view the resources available to the Ray via function ray.cluster_resources().

Single node ray
import ray
ray.init()
print(ray.cluster_resources())

The above commands will print a message as below which indicates there are 16 CPU core, 29GB memory and 1 node available for the current local Ray cluster.

{'memory': 28808935835.0, 'object_store_memory': 14404467916.0, 'CPU': 16.0, 'node:10.0.128.152': 1.0}

Multiple Nodes

If you need a larger scale of Ray cluster across multiple nodes, you can start a pre-defined Ray cluster and then connect it in the Jupyter notebook or python script.

An easy way to set up the pre-defined Ray cluster is to utilise dk92 module "gadi_jupyterlab".  

Launching a pre-defined Ray cluster

Gadi

In your PBS job script, you should load NCI-data-analysis/2022.06 together with the gadi_jupyterlab module. Then you need to run a script called "jupyter.ini.sh -R" to set up the pre-defined Ray cluster. It will start a Ray worker on each CPU core of all available compute nodes in a job. You can also specify the number of Ray workers per node via "-p" flag. For example, in a job requesting 96 cores ( 2 nodes) of "normal" queue, you can set up a pre-defined Ray cluster with 12 Ray workers per node, and 24 Ray workers in total via the following command

jupyter.ini.sh -R -p 12

An example of a full job script requesting 96 Ray workers is given below

#!/bin/bash
#PBS -P fp0
#PBS -q normal
#PBS -lwd,walltime=10:00:00,ncpus=96,mem=192GB,jobfs=400GB,storage=gdata/dk92+gdata/z00+scratch/fp0+gdata/fp0

module purge
module use /g/data/dk92/apps/Modules/modulefiles
module load NCI-data-analysis/2022.06 gadi_jupyterlab/22.06

jupyter.ini.sh -R # set up a Ray cluster with 48 Ray workers per node,96 total Ray workers, 1 thread per Ray worker.

# jupyter.ini.sh -R -p 12 # or set up a Ray cluster with 12 Ray workers per node, 24 total Ray workers, 1 threads per Ray worker.

python script.py

In "script.py", you need to connect to the pre-defined Ray cluster by calling ray.init() and specify the address flag as "auto".

connect to a predefined Ray cluster
import ray
ray.init(address="auto")
print(ray.cluster_resources())

The above script will print the following message

{'object_store_memory': 114296048025.0, 'CPU': 96.0, 'memory': 256690778727.0, 'node:10.6.48.66': 1.0, 'node:10.6.48.67': 1.0}

ARE

First of all, you need to request multiple nodes  and start a JupyterLab session as below

Step 1:

Selet multiple nodes resources.

Step 2:

Load NCI-data-analysis/2022.06 and gadi_jupyterlab/22.06 modules in the "Advanced options" area.

Put "jupyterlab.ini.sh -R" in "Pre-script" field.

Step 3:

Start the JupyterLab session by clicking "Open JupyterLab" button.


In the Jupyter notebook, using the following lines to connect the pre-defined Ray cluster and print the resources information. 

import ray
ray.init(address="auto")
print(ray.cluster_resources())

You will see 96 CPU Cores and two nodes are used by the cluster as expected.

Monitoring Ray status

You can easily monitor Ray status via the command "ray status".  Open a CLI terminal in either JupyterLab session or a Gadi PBS interactive job and type in the following command

 $ watch ray status

The Ray status will be kept updating every 2 seconds

Every 2.0s: ray status gadi-cpu-60021448.gadi.nci.org.au: Thu Jul 7 11:26:59 2022
======== Autoscaler status: 2022-07-07 11:27:26.480900 ========
Node status
---------------------------------------------------------------
Healthy:
1 node_c166858d7a953a2050cd004f54239c0014cc9fcf34b640d7cac21de
1 node_e6d5c4a8797357ebc0a5d0b76a6e264df133f0cba2dc236f696d7c8
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources
---------------------------------------------------------------

96.0/96.0 CPU
0.00/239.062 GiB memory
23.70/106.446 GiB object_store_memory

Demands:
{'CPU': 1.0}: 225+ pending tasks/actors






 




  • No labels