Introduction
Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The NCI-ai-ml environment adopts Horovod as the parallelism framework to enable running Tensorflow and Pytorch scripts across multiple nodes. You can run it within either Gadi PBS jobs or ARE JupyterLab sessions.
Module
First of all, you need to load the NCI-ai-ml environment in your batch or interactive PBS job.
$ module use /g/data/dk92/apps/Modules/modulefiles $ module load NCI-ai-ml/23.05
Test Script
You can check the current Horovod version and its frameworks and controllers via the following command
$ horovodrun --check-build |
You could also check whether Horovod can access the requested GPU resources by running a test python script called "test_hvdgpu.py". The script is shown below and It invokes Horovod and Tensorflow APIs to print the Horovod cluster information.
#Import Necessary libraries import tensorflow as tf import socket # Import Horovod import horovod.tensorflow.keras as hvd # 1. Horovod: initialize Horovod. hvd.init() # 2. Horovod: pin GPU to be used to process local rank (one GPU per process) gpus = tf.config.experimental.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) if gpus: tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU') if hvd.rank()==0: print("size=",hvd.size()) print("local_rank=",hvd.local_rank()," node=",socket.gethostname()," rank=",hvd.rank(),"GPU=",hvd.tf.config.list_physical_devices('GPU'))
In the above script, each Horovod rank will print its ID in the local host ('local_rank') and the global cluster ('rank'). The #0 rank also prints the total number of GPU devices of the cluster.
Running Test Script
You can choose either mpirun (MPI backend) or horovodrun (Gloo backend) to run the above 'test_hvdgpu.py' script.
MPI backend
The MPI backend Horovod works on both single and multiple GPU nodes. A wrapper script called 'run_hvd_mpi_gpu.sh' has been put in the 'NCI-ai-ml' module path to verify the GPU resources via mpirun as shown below
$ cat `which run_hvd_mpi_gpu.sh` mpirun -np ${PBS_NGPUS} --map-by node --bind-to socket python3 ${NCI_AI_ML_ROOT}/tests/test_hvdgpu.py
Note to specify the flag "--map-by node" so it can run over all available GPU nodes.
You can expect the following outputs if you are running "run_hvd_mpi_gpu.sh" in an interactive job requesting 2 GPU nodes:
# Submit an interactive job to "gpuvolta" queue which requests 2 GPU nodes (8 GPU v100 devices). $ qsub -I -q gpuvolta -lwd,walltime=2:00:00,ngpus=8,ncpus=96,mem=380GB,jobfs=400GB,storage=gdata/dk92 qsub: waiting for job 54866306.gadi-pbs to start qsub: job 54866306.gadi-pbs ready # Load the NCI-ai-ml module $ module use /g/data/dk92/apps/Modules/modulefiles $ module load NCI-ai-ml/23.05 # Run the script in the command line. $ run_hvd_mpi_gpu.sh local_rank= 0 node= gadi-gpu-v100-0069.gadi.nci.org.au rank= 1 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] local_rank= 1 node= gadi-gpu-v100-0062.gadi.nci.org.au rank= 2 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] local_rank= 1 node= gadi-gpu-v100-0069.gadi.nci.org.au rank= 3 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] size= 8 local_rank= 0 node= gadi-gpu-v100-0062.gadi.nci.org.au rank= 0 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] local_rank= 2 node= gadi-gpu-v100-0062.gadi.nci.org.au rank= 4 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] local_rank= 2 node= gadi-gpu-v100-0069.gadi.nci.org.au rank= 5 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] local_rank= 3 node= gadi-gpu-v100-0062.gadi.nci.org.au rank= 6 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] local_rank= 3 node= gadi-gpu-v100-0069.gadi.nci.org.au rank= 7 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
In the above outputs, each node has 4 GPU devices and there are 8 GPU devices in the Horovod cluster with 2 GPU nodes.
Gloo backend
Single GPU node
You can use 'horovodrun' command directly on the "test_hvdgpu.py" script to check the GPU resources within a single node as below
$ horovodrun -np ${PBS_NGPUS} python3 ${NCI_GPU_ML_ROOT}/tests/test_hvdgpu.py
Multiple GPU nodes
In a PBS jobs requesting multiple GPU nodes, you need two steps to run 'horovodrun'.
STEP 1: Initialise the Horovod cluster as below
$ horovod.ini.sh
It will start an ssh server within the container at each host.
- STEP 2: Get the hosts information and the number of GPU devices per host and then feed them to the Gloo based 'horovodrun' command. You could run a test script called 'run_hvd_gloo_gpu.sh' to validate the 'horovodrun' command. The script "run_hvd_gloo_gpu.sh" is shown below. Note this script is running with the "--gloo" protocol. You also need the flag "-p 1212" to specify the ssh port number being 1212.
$ #!/bin/bash set -x cur_host=`hostname` for node in `cat $PBS_NODEFILE | uniq` do if [[ ${node} == ${cur_host} ]] then host_flag="${node}:${node_gpu}" else host_flag="${host_flag},${node}:${node_gpu}" fi done horovodrun -np ${PBS_NGPUS} --gloo -H ${host_flag} -p 1212 python3 ${NCI_GPU_ML_ROOT}/tests/test_hvdgpu.py
You should get the following outputs when running it in a PBS job requesting 2 GPU nodes.
# Submit an interactive job to "gpuvolta" queue which requests 2 GPU nodes (8 GPU v100 devices). $ qsub -I -q gpuvolta -lwd,walltime=2:00:00,ngpus=8,ncpus=96,mem=380GB,jobfs=400GB,storage=gdata/dk92 qsub: waiting for job 54883361.gadi-pbs to start qsub: job 54883361.gadi-pbs ready $ module use /g/data/dk92/apps/Modules/modulefiles $ module load NCI-ai-ml/23.05 Loading NCI-ai-ml/23.05 Loading requirement: singularity openmpi/4.1.5 # Initialise the Horovod cluster. $ horovod.ini.sh Starting sshd servers ... gadi-gpu-v100-0042.gadi.nci.org.au ... Done! gadi-gpu-v100-0079.gadi.nci.org.au ... Done! Please specify ' -p 1212 ' in horovodrun. $ run_hvd_gloo_gpu.sh [5]<stdout>:local_rank= 1 node= gadi-gpu-v100-0141.gadi.nci.org.au rank= 5 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] [7]<stdout>:local_rank= 3 node= gadi-gpu-v100-0141.gadi.nci.org.au rank= 7 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] [6]<stdout>:local_rank= 2 node= gadi-gpu-v100-0141.gadi.nci.org.au rank= 6 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] [4]<stdout>:local_rank= 0 node= gadi-gpu-v100-0141.gadi.nci.org.au rank= 4 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] [0]<stdout>:size= 8 [0]<stdout>:local_rank= 0 node= gadi-gpu-v100-0069.gadi.nci.org.au rank= 0 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] [3]<stdout>:local_rank= 3 node= gadi-gpu-v100-0069.gadi.nci.org.au rank= 3 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] [2]<stdout>:local_rank= 2 node= gadi-gpu-v100-0069.gadi.nci.org.au rank= 2 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')] [1]<stdout>:local_rank= 1 node= gadi-gpu-v100-0069.gadi.nci.org.au rank= 1 GPU= [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
It shows similar outputs with "run_hvd_mpi_gpu.sh" command as both execute the script "test_hvdgpu.py".
ARE JupyterLab session
On the NCI ARE, you can request multiple GPU nodes in a single JupyterLab session. The JupyterLab itself runs at the master node and you can enable 'horovodrun' running across multiple GPU nodes.
For example, you can start a JupyterLab session with 2 GPU nodes in the gpuvolta queue as below.
After expanding "Advanced options", you can specify the module directory with "/g/data/dk92/apps/Modules/modulefiles" and load the module "NCI-ai-ml/23.05". You could also optionally run "horovod.ini.sh" in the Pre-script field. It will initialise the multiple node environment for horovodrun.
After the Jupyterlab session starts, you can run "run_hvd_gloo_gpu.sh" directly in both Jupyter Notebook or CMD Console.
if you didn't run the"horovod.ini.sh" script when requesting the JupyterLab session as shown above, you can still run the two commands together
$ horovod.ini.sh $ run_hvd_gloo_gpu.sh
Please note, you can only run "Gloo" based 'horovodrun' within ARE JupyterLab session. The MPI based horovodrun doesn't work within the JupyterLab session.
Next Steps
Now you are ready to use Horovod to run TensorFlow or Pytorch scripts across multiple nodes. For more details on developing code with Horovod, please refer to the official documentation.
We have modified their examples for use on Gadi below: