Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If you requested single node resources via OOD, ARE or Gadi PBS job, you can simply callingall ray.init() within your Jupyter Notebook or python script can to start a Ray runtime on the current host.

...

The easiest way to set up the pre-defined Ray cluster is to utilize the dk92 module "Gadigadi_jupyterlab".  

Launching a predefined Ray cluster

Gadi

In your PBS job script, you should load the gadi_jupyterlab module together with NCI-data-analysis/222022.06. Then execute the you need to run a script called "jupyter.ini.sh -R" to set up the predefined Ray cluster. It will start a Ray worker on each CPU core. You can also specify the number of Ray workers by following "-p" flag. For example, in a job requesting 96 cores ( 2 nodes), you can set up a pre-defined Ray cluster with 12 Ray workers per node, and 24 Ray workers in total via the following command:

jupyter.ini.sh -R -p 12

An example of a full job script is given below

#!/bin/bash
#PBS -P fp0
#PBS -q normal
#PBS -lwd,walltime=10:00:00,ncpus=96,mem=192GB,jobfs=400GB,storage=gdata/dk92+gdata/z00+scratch/fp0+gdata/fp0

module purge
module use /g/data/dk92/apps/Modules/modulefiles
module load NCI-data-analysis/22.06 gadi_jupyterlab/22.05

jupyter.ini.sh -R # set up a Ray cluster with 48 Ray workers per node,96 total Ray workers, 1 thread per Ray worker.

# jupyter.ini.sh -R -p 12 # or set up a Ray cluster with 12 Ray workers per node, 24 total Ray workers, 1 threads per Ray worker.

python script.py

In the "script.py", you can connect to the predefined Ray cluster by calling ray.init() and specify the address flag as "auto".

Code Block
languagepy
titleconnect to a predefined Ray cluster
import ray
ray.init(address="auto")
print(ray.cluster_resources())

The example output of the above commands in a Jupyter notebook connecting to a Ray cluster with 2 Gadi nodes are shown belowabove script will print the following messages

{'object_store_memory': 114296048025.0, 'CPU': 96.0, 'memory': 256690778727.0, 'node:10.6.48.66': 1.0, 'node:10.6.48.67': 1.0}

...

First of all, you need to request multiple nodes in ARE JupyterLab session interface and specify the proper storage projects.

Then click "advanced options", put "/g/data/dk92/apps/Modules/modulefiles" in "Module directories" and load both NCI-data-analysis/22.06 and gadi-jupyterlab/222022.06 modules in "Module" field. In the "Pre-script" field, fill in the command "jupyterlab.ini.sh -R" to set up the predefined Ray cluster.  

Wait until the JupyterLab session starts. Click "Open JupyterLab" button to open the JupyterLab session as soon it is highlighted.

In the Jupyter notebook, run using the following lines to connect to the predefined Ray cluster and print the available resources in usinginformation

You should can see 96 CPU Cores and two nodes are in useused by the cluster as expected..

Monitoring Ray status

You can easily monitor Ray status via the command "ray status".  Open a CLI Terminal terminal in either JupyterLab session or a Gadi PBS interactive job and type in the following command

 $ watch ray status

The Ray status will be kept updating every 2 seconds:

Every 2.0s: ray status gadi-cpu-clx-1146.gadi.nci.org.au: Mon May 23 15:27:27 2022

======== Autoscaler status: 2022-05-23 15:27:26.840051 ========
Node status
---------------------------------------------------------------
Healthy:
1 node_ab2c1ef9316a7ee01bae8cd9d087f5e5dfbe3c8c254cf0e8752be0b1
1 node_b2f22752fc2b329e68649a81ba3c26f2d4f6080fa822dda0121d5514
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources
---------------------------------------------------------------
Usage:
96.0/96.0 CPU
0.00/239.062 GiB memory
23.70/106.446 GiB object_store_memory

Demands:
{'CPU': 1.0}: 225+ pending tasks/actors

...