Introduction

As part of Ray technology, there are a number of supporting packages to enable Deep Learning. Ray Train is a lightweight library for distributed deep learning, allowing you to speed up training for your deep learning models by using multiple GPU-enabled nodes. You only need to change a couple lines of code to up-scale your training code to use a Ray cluster. Ray Train also interoperates with Ray Tune and/or Ray Datasets: Ray Tune enables your distributed model to train on large amounts of data that have been curated as Ray Datasets. We have configured this to allow your workflow to run within any of our NCI environments, including Gadi PBS jobs, and seamless Jupyter notebook support (ARE JupyterLab) to allow working interactively with your code.

Gadi PBS job

Modules

You need to load both 'NCI-ai-ml' and 'gadi_jupyterlab' modules from NCI project dk92 in your PBS batch job script or in the command line of your PBS interactive job as below

$ module use /g/data/dk92/apps/Modules/modulefiles
$ module load NCI-ai-ml/22.08
$ module load gadi_jupyterlab/22.06

The 'gadi_jupyterlab' module is used to set up a Ray cluster.

Set up a Ray cluster

You need to set up a Ray cluster before running any Ray+pytorch or Ray+Tensorflow script. An easy way is to run the following command after loading the above modules

$ jupyter.ini.sh -R -g

The flag "-R" means to set up the Ray cluster in the current PBS job and the flag "-g" means the cluster is set up based on the available GPU resources.

For more details on the above command, please visit the gadi_jupyterlab module page.

Verify Ray cluster resources

After setting up the Ray cluster, you can verify whether it utilises the requested resources. To do this, you need to connect to the established Ray cluster as below

import ray 
ray.init(address='auto')

Then you can check whether the Ray cluster consists of the expected GPU resources as below

print(ray.cluster_resources())

Or you can run the command "run_ray_gpu.sh" directly after setting up a Ray cluster as above. For a PBS job requesting 2 GPU nodes, you should expect something similar to the following outputs

$ run_ray_gpu.sh
{'CPU': 96.0, 'GPU': 8.0, 'object_store_memory': 231533975961.0, 'node:10.6.9.4': 1.0, 'memory': 530245943911.0, 'accelerator_type:V100': 2.0, 'node:10.6.9.5': 1.0}

It shows that the Ray cluster consists of 96 CPU cores and 8 V100 GPU devices, which is exactly the resources provided by 2 Gadi gpuvolta nodes.

ARE JupyterLab session

You can also interactively run your Ray+Pytorch or Ray+Tensorflow Jupyter notebooks in the NCI ARE JupyterLab session. For example, you can request a ARE JupyterLab session with 2 Gadi 'gpuvolta' nodes as below.

Note: change "fp0" to your own project ID and specify appropriate "Storage" flags which contains "gdata/dk92".

In the "Advanced options", you need to specify "/g/data/dk92/apps/Modules/modulefiles" in the "Module directories" field and "NCI-ai-ml/22.08 gadi_jupyterlab/22.06" in the "Modules" field.

In the "Pre-script" field, you should put in the command "jupyter.ini.sh -R -g" to set up the Ray GPU cluster.

Click "Launch" to start a JupyterLab session.

After starting the JupyterLab session, you can verify the Ray cluster by executing the following lines in your Jupyter notebook

import ray 
ray.init(address='auto')
ray.cluster_resources()

Next Steps

Now you can start working on your tensorflow or pytorch workflows together with Ray Train. Here are some examples of tensorflow+ray and pytorch+ray running on Gadi.

Page tree

Running Ray and associated Libraries