Page tree


We are going to run a Tensorflow benchmark with the ResNet-101 model and Imagenet dataset. The code used for the Restnet-101 model can be found in the following folder:

${NCI_AI_ML_ROOT}/examples/tf_cnn_benchmarks

tf_cnn_benchmarks.py is the main file of the benchmark package. 

Synthetic Data

First, we are going to run the model on synthetic data. There are some small synthetic datasets located in the test_data folder that contains Tensorflow records for testing. This data can be used to quickly demonstrate the power of Horovod to Train models on multi-node GPUs. 

The PBS job scripts can be found in the following locations: 

${NCI_AI_ML_ROOT}/examples/horovod_gloo.pbs

and

${NCI_AI_ML_ROOT}/examples/horovod_mpi.pbs

 
You can revise the above PBS job script to run on 16 GPUs as below. It will run for one minute and use 100% of the CUDA cores. This script is designed to demonstrate the usefulness of Horovod in distributed deep learning and takes only two minutes to complete. In the example below, replace "fp0" with your own project code.

Note: please change the "scratch/fp0" and "fp0" in the script below to your own NCI project ID.

Script-1, Synthetic Data
#!/bin/bash
#PBS -S /bin/bash
#PBS -q gpuvolta
#PBS -l ncpus=192
#PBS -l ngpus=16
#PBS -l jobfs=100GB
#PBS -P fp0
#PBS -l storage=gdata/dk92+gdata/wb00+scratch/fp0
#PBS -l mem=600GB
#PBS -l walltime=00:02:00
#PBS -N TF_Syn

cur_host=`hostname`
node_gpu=$((PBS_NGPUS / PBS_NNODES))
for node in `cat $PBS_NODEFILE | uniq`
do
    if [[ $node == $cur_host ]]
    then
       host_flag="${node}:${node_gpu}"
    else
       host_flag="${host_flag},${node}:${node_gpu}"
    fi
done

module use /g/data/dk92/apps/Modules/modulefiles
module load NCI-ai-ml/22.08

horovod.ini.sh

horovodrun -np ${PBS_NGPUS} --gloo -H ${host_flag} -p 1212 python3 \
	   ${NCI_AI_ML_ROOT}/examples/tf_cnn_benchmarks/tf_cnn_benchmarks.py  \
	   --model resnet101 --batch_size 64 --variable_update horovod

The above script will run the Tensorflow synthetic benchmarks on 16 GPUs in the gpuvolta queue. Each node in the queue has four GPUs; hence, four nodes will be allocated. When using the V100 GPUs, the benchmark takes about one minute to run on each GPU; that is why the wall time is set at two minutes. The horovod.ini.sh script is necessary to run the SSH server at each node for Horovod to connect. In the script, the Horovod uses port 1212 and the gloo controller. The Resnet 101 model is chosen, and the batch size is set to 64.

GPU monitoring 

GPU monitoring on multiple nodes requires some effort because connections have to be established on all nodes. We have used the gpustat application for multi-node GPU monitoring. It can connect to multiple servers, collect data, and show aggregated results on a webpage. The following screen recording shows a run of the benchmark. The benchmark is run on four nodes, and GPU utilization of four nodes is shown. Each node has four GPUs and those are shown together.

By starting the gpustat application, a local webserver is then started for your session, with the webpage only accessible through the local machine using port forwarding. In this case, only one connection is made to Gadi, and the application can show data for all GPUs on all nodes. It shows GPU temperature, CUDA core usage, GPU memory usage, and the user name. The data is updated at five seconds intervals to save network resources.