Tensorflow using ImageNet

This section will show how to run the TensorFlow benchmark with real data (i.e., ImageNet). All data and code used in the section can be found in the folder:

/g/data/wb00/ImageNet/

You can use the following example PBS job script to run the Resnet-101 model on the Imagenet dataset:

${NCI_AI_ML_ROOT}/examples/horovod_gloo.pbs

or

${NCI_AI_ML_ROOT}/examples/horovod_mpi.pbs

The Imagenet dataset can use up to 256 GPUs for training. You can revise the above job script to run on 16 GPUs which may take approximately one hour. The revised script is shown below; it can be easily modified to run on more GPUs. However, in that case, it will have to wait for a much longer time in the queue.

Note: please change the "scratch/fp0" and "fp0" in the script below to your own NCI project ID.

Script-2, Real data benchmark

#!/bin/bash
#PBS -S /bin/bash
#PBS -q gpuvolta
#PBS -l ncpus=192
#PBS -l ngpus=16
#PBS -l jobfs=200GB
#PBS -P fp0
#PBS -l storage=gdata/dk92+gdata/wb00+scratch/fp0
#PBS -l mem=1400GB
#PBS -l walltime=01:05:00
#PBS -N TF_RealData

cur_host=`hostname`
node_gpu=$((PBS_NGPUS / PBS_NNODES))
for node in `cat $PBS_NODEFILE | uniq`
do
    if [[ $node == $cur_host ]]
    then
       host_flag="${node}:${node_gpu}"
    else
       host_flag="${host_flag},${node}:${node_gpu}"
    fi
done

module use /g/data/dk92/apps/Modules/modulefiles
module load NCI-ai-ml/22.08

horovod.ini.sh

horovodrun -np ${PBS_NGPUS} --gloo -H ${host_flag} -p 1212 python3 \
	   ${NCI_AI_ML_ROOT}/examples/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
           --model resnet101 \
           --batch_size 64 \
           --variable_update horovod \
           --data_dir /g/data/wb00/ImageNet/ILSVRC2012/data_dir \
           --data_name imagenet \
           --num_batches=10000

The top half of the script is used for PBS resource allocation, and we request one hour to give enough time to train the model with real data. The rest of the script is used to run the benchmark. The data_dir option is used to set the data directory containing the Tensorflow Record (TFrecord) files.
--data_dir /g/data/wb00/ImageNet/ILSVRC2012

The TFrecord files located in the path "/g/data/wb00/ImageNet/ILSVRC2012" are processed from the raw imagenet dataset, as discussed above. Another difference is that now that a real dataset is being used, we need to specify the number of batches, which is done using the flag --num_batches. We have calculated the time required to run 10,000 batches on the V100 GPU at approximately 1 hour. This is the reason the wall time is set to just above one hour so that the training can complete the process in the 10,000 batches. If you want to increase the number of batches, you will have to adjust the wall time accordingly. For example, to complete 40,000 batches, the wall time should be set at four hours.

A100 vs V100 performance comparison

The above benchmarks have been run on two generations of GPUs; A100 and V100. The following table compares the performance of two GPU models.

	GPU class	GPUs used	Nodes used	Batch Size	Total Batches	Benchmark Run time	Comparison	Image processed	Comparison
A100	Ampere	16	2	64	10000	30:59 min	44.64% less time	5681.31 images/sec	1.819352 times faster
V100	Volta	16	4	64	10000	55:54 min	--	3122.71 images/sec	--

On Gadi, the A100's are under the queue "dgxa100" and the V100's are under the queue "analysis".

Page tree

Tensorflow using ImageNet

A100 vs V100 performance comparison