This section will show how to run the TensorFlow benchmark with real data (i.e., ImageNet). All data and code used in the section can be found in the folder: 

/g/data/wb00/ImageNet/

You can use the following example PBS job script to run the Resnet-101 model on the Imagenet dataset:

${NCI_AI_ML_ROOT}/examples/horovod_gloo.pbs

or

${NCI_AI_ML_ROOT}/examples/horovod_mpi.pbs

The Imagenet dataset can use up to 256 GPUs for training. You can revise the above job script to run on 16 GPUs which may take approximately one hour. The revised script is shown below; it can be easily modified to run on more GPUs. However, in that case, it will have to wait for a much longer time in the queue.

Note: please change the "scratch/fp0" and "fp0" in the script below to your own NCI project ID.

Script-2, Real data benchmark
#!/bin/bash
#PBS -S /bin/bash
#PBS -q gpuvolta
#PBS -l ncpus=192
#PBS -l ngpus=16
#PBS -l jobfs=200GB
#PBS -P fp0
#PBS -l storage=gdata/dk92+gdata/wb00+scratch/fp0
#PBS -l mem=1400GB
#PBS -l walltime=01:05:00
#PBS -N TF_RealData

cur_host=`hostname`
node_gpu=$((PBS_NGPUS / PBS_NNODES))
for node in `cat $PBS_NODEFILE | uniq`
do
    if [[ $node == $cur_host ]]
    then
       host_flag="${node}:${node_gpu}"
    else
       host_flag="${host_flag},${node}:${node_gpu}"
    fi
done

module use /g/data/dk92/apps/Modules/modulefiles
module load NCI-ai-ml/22.08

horovod.ini.sh

horovodrun -np ${PBS_NGPUS} --gloo -H ${host_flag} -p 1212 python3 \
	   ${NCI_AI_ML_ROOT}/examples/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
           --model resnet101 \
           --batch_size 64 \
           --variable_update horovod \
           --data_dir /g/data/wb00/ImageNet/ILSVRC2012/data_dir \
           --data_name imagenet \
           --num_batches=10000

The top half of the script is used for PBS resource allocation, and we request one hour to give enough time to train the model with real data. The rest of the script is used to run the benchmark. The data_dir option is used to set the data directory containing the Tensorflow Record (TFrecord) files.   
--data_dir /g/data/wb00/ImageNet/ILSVRC2012

The TFrecord files located in the path "/g/data/wb00/ImageNet/ILSVRC2012" are processed from the raw imagenet dataset, as discussed above. Another difference is that now that a real dataset is being used, we need to specify the number of batches, which is done using the flag --num_batches. We have calculated the time required to run 10,000 batches on the V100 GPU at approximately 1 hour. This is the reason the wall time is set to just above one hour so that the training can complete the process in the 10,000 batches. If you want to increase the number of batches, you will have to adjust the wall time accordingly. For example, to complete 40,000 batches, the wall time should be set at four hours.

A100 vs V100 performance comparison 

The above benchmarks have been run on two generations of GPUs; A100 and V100. The following table compares the performance of two GPU models.  


GPU 
class
GPUs
used
Nodes
used

Batch
Size

Total 
Batches
Benchmark Run timeComparison Image processed  Comparison 
A100Ampere162641000030:59 min44.64%
less time 

5681.31
images/sec

1.819352
times faster
V100

Volta

1646410000

55:54 min

--3122.71 
images/sec
--

On Gadi, the A100's are under the queue "dgxa100" and the V100's are under the queue "analysis".