Introduction

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The NCI-ai-ml environment adopts Horovod as the parallelism framework to enable running Tensorflow and Pytorch scripts across multiple nodes. You can run it within either Gadi PBS jobs or ARE JupyterLab sessions.

Module

First of all, you need to load the NCI-ai-ml environment in your batch or interactive PBS job.

$ module use /g/data/dk92/apps/Modules/modulefiles
$ module load NCI-ai-ml/23.05

Test Script

You can check the current Horovod version and its frameworks and controllers via the following command

$ horovodrun --check-build
Horovod v0.25.0:

Available Frameworks:
[X] TensorFlow
[X] PyTorch
[ ] MXNet

Available Controllers:
[X] MPI
[X] Gloo

Available Tensor Operations:
[X] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo

You could also check whether Horovod can access the requested GPU resources by running a test python script called "test_hvdgpu.py". The script is shown below and It invokes Horovod and Tensorflow APIs to print the Horovod cluster information.

test_hvdgpu.py
#Import Necessary libraries
import tensorflow as tf
import socket

# Import Horovod
import horovod.tensorflow.keras as hvd
# 1. Horovod: initialize Horovod.
hvd.init()

# 2. Horovod: pin GPU to be used to process local rank (one GPU per process)
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
if hvd.rank()==0:
    print("size=",hvd.size())
print("local_rank=",hvd.local_rank()," node=",socket.gethostname()," rank=",hvd.rank(),"GPU=",hvd.tf.config.list_physical_devices('GPU'))

In the above script, each Horovod rank will print its ID in the local host ('local_rank') and the global cluster ('rank'). The #0 rank also prints the total number of GPU devices of the cluster.

Running Test Script

You can choose either mpirun (MPI backend) or horovodrun (Gloo backend) to run the above 'test_hvdgpu.py' script.

MPI backend

The MPI backend Horovod works on both single and multiple GPU nodes. A wrapper script called 'run_hvd_mpi_gpu.sh' has been put in the 'NCI-ai-ml' module path to verify the GPU resources via mpirun as shown below

run_hvd_mpi_gpu.sh
$ cat `which run_hvd_mpi_gpu.sh`
mpirun -np ${PBS_NGPUS} --map-by node  --bind-to socket  python3 ${NCI_AI_ML_ROOT}/tests/test_hvdgpu.py

Note to specify the flag  "--map-by node" so it can run over all available GPU nodes.  

You can expect the following outputs if you are running "run_hvd_mpi_gpu.sh" in an interactive job requesting 2 GPU nodes:

# Submit an interactive job to "gpuvolta" queue which requests 2 GPU nodes (8 GPU v100 devices). 
$ qsub -I -q gpuvolta -lwd,walltime=2:00:00,ngpus=8,ncpus=96,mem=380GB,jobfs=400GB,storage=gdata/dk92
qsub: waiting for job 54866306.gadi-pbs to start
qsub: job 54866306.gadi-pbs ready

# Load the NCI-ai-ml module
$ module use /g/data/dk92/apps/Modules/modulefiles
$ module load NCI-ai-ml/23.05

# Run the script in the command line.
$ run_hvd_mpi_gpu.sh
local_rank= 0  node= gadi-gpu-v100-0069.gadi.nci.org.au  rank= 1 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
local_rank= 1  node= gadi-gpu-v100-0062.gadi.nci.org.au  rank= 2 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
local_rank= 1  node= gadi-gpu-v100-0069.gadi.nci.org.au  rank= 3 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
size= 8
local_rank= 0  node= gadi-gpu-v100-0062.gadi.nci.org.au  rank= 0 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
local_rank= 2  node= gadi-gpu-v100-0062.gadi.nci.org.au  rank= 4 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
local_rank= 2  node= gadi-gpu-v100-0069.gadi.nci.org.au  rank= 5 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
local_rank= 3  node= gadi-gpu-v100-0062.gadi.nci.org.au  rank= 6 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
local_rank= 3  node= gadi-gpu-v100-0069.gadi.nci.org.au  rank= 7 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]

In the above outputs, each node has 4 GPU devices and there are 8 GPU devices in the Horovod cluster with 2 GPU nodes.

Gloo backend

Single GPU node

You can use 'horovodrun' command directly on the "test_hvdgpu.py" script to check the GPU resources within a single node as below

$ horovodrun -np ${PBS_NGPUS} python3 ${NCI_GPU_ML_ROOT}/tests/test_hvdgpu.py

Multiple GPU nodes

In a PBS jobs requesting multiple GPU nodes, you need two steps to run 'horovodrun'.

  • STEP 1: Initialise the Horovod cluster as below

    $ horovod.ini.sh

    It will start an ssh server within the container at each host.

  • STEP 2: Get the hosts information and the number of GPU devices per host and then feed them to the Gloo based 'horovodrun' command. You could run a test script called 'run_hvd_gloo_gpu.sh' to validate the 'horovodrun' command. The script "run_hvd_gloo_gpu.sh" is shown below. Note this script is running with the "--gloo" protocol. You also need the flag "-p 1212" to specify the ssh port number being 1212.
run_hvd_gloo_gpu.sh
$ #!/bin/bash
set -x
cur_host=`hostname`
for node in `cat $PBS_NODEFILE | uniq`
do
	if [[ ${node} == ${cur_host} ]]
	then
		host_flag="${node}:${node_gpu}"
	else
		host_flag="${host_flag},${node}:${node_gpu}"
	fi
done
horovodrun -np ${PBS_NGPUS} --gloo -H ${host_flag} -p 1212 python3 ${NCI_GPU_ML_ROOT}/tests/test_hvdgpu.py

You should get the following outputs when running it in a PBS job requesting 2 GPU nodes.

# Submit an interactive job to "gpuvolta" queue which requests 2 GPU nodes (8 GPU v100 devices). 
$ qsub -I -q gpuvolta -lwd,walltime=2:00:00,ngpus=8,ncpus=96,mem=380GB,jobfs=400GB,storage=gdata/dk92
qsub: waiting for job 54883361.gadi-pbs to start
qsub: job 54883361.gadi-pbs ready

$ module use /g/data/dk92/apps/Modules/modulefiles
$ module load NCI-ai-ml/23.05
Loading NCI-ai-ml/23.05
  Loading requirement: singularity openmpi/4.1.5

# Initialise the Horovod cluster.
$ horovod.ini.sh
Starting sshd servers ...
gadi-gpu-v100-0042.gadi.nci.org.au ... Done!
gadi-gpu-v100-0079.gadi.nci.org.au ... Done!
Please specify ' -p 1212 ' in horovodrun.

$ run_hvd_gloo_gpu.sh
[5]<stdout>:local_rank= 1  node= gadi-gpu-v100-0141.gadi.nci.org.au  rank= 5 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
[7]<stdout>:local_rank= 3  node= gadi-gpu-v100-0141.gadi.nci.org.au  rank= 7 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
[6]<stdout>:local_rank= 2  node= gadi-gpu-v100-0141.gadi.nci.org.au  rank= 6 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
[4]<stdout>:local_rank= 0  node= gadi-gpu-v100-0141.gadi.nci.org.au  rank= 4 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
[0]<stdout>:size= 8
[0]<stdout>:local_rank= 0  node= gadi-gpu-v100-0069.gadi.nci.org.au  rank= 0 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
[3]<stdout>:local_rank= 3  node= gadi-gpu-v100-0069.gadi.nci.org.au  rank= 3 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
[2]<stdout>:local_rank= 2  node= gadi-gpu-v100-0069.gadi.nci.org.au  rank= 2 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
[1]<stdout>:local_rank= 1  node= gadi-gpu-v100-0069.gadi.nci.org.au  rank= 1 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]

It shows similar outputs with "run_hvd_mpi_gpu.sh" command as both execute the script "test_hvdgpu.py".

ARE JupyterLab session

On the NCI ARE, you can request multiple GPU nodes in a single JupyterLab session. The JupyterLab itself runs at the master node and you can enable 'horovodrun' running across multiple GPU nodes.

For example, you can start a JupyterLab session with 2 GPU nodes in the gpuvolta queue as below.

After expanding "Advanced options", you can specify the module directory with "/g/data/dk92/apps/Modules/modulefiles" and load the module "NCI-ai-ml/23.05". You could also optionally run "horovod.ini.sh" in the Pre-script field. It will initialise the multiple node environment for horovodrun. 

After the Jupyterlab session starts, you can run "run_hvd_gloo_gpu.sh" directly in both Jupyter Notebook or CMD Console.

if you didn't run the"horovod.ini.sh" script when requesting the JupyterLab session as shown above, you can still run the two commands together

$ horovod.ini.sh
$ run_hvd_gloo_gpu.sh

Please note, you can only run "Gloo" based 'horovodrun' within ARE JupyterLab session. The MPI based horovodrun doesn't work within the JupyterLab session.

Next Steps

Now you are ready to use Horovod to run TensorFlow or Pytorch scripts across multiple nodes. For more details on developing code with Horovod, please refer to the official documentation.

We have modified their examples for use on Gadi below: