This section will show how to run the TensorFlow benchmark with real data (i.e., ImageNet). All data and code used in the section can be found in the folder:
/g/data/wb00/ImageNet/
You can use the following example PBS job script to run the Resnet-101 model on the Imagenet dataset:
${NCI_AI_ML_ROOT}/examples/
horovod_gloo.pbs
or
${NCI_AI_ML_ROOT}/examples/horovod_mpi.pbs
The Imagenet dataset can use up to 256 GPUs for training. You can revise the above job script to run on 16 GPUs which may take approximately one hour. The revised script is shown below; it can be easily modified to run on more GPUs. However, in that case, it will have to wait for a much longer time in the queue.
Note: please change the "scratch/fp0" and "fp0" in the script below to your own NCI project ID.
#!/bin/bash #PBS -S /bin/bash #PBS -q gpuvolta #PBS -l ncpus=192 #PBS -l ngpus=16 #PBS -l jobfs=200GB #PBS -P fp0 #PBS -l storage=gdata/dk92+gdata/wb00+scratch/fp0 #PBS -l mem=1400GB #PBS -l walltime=01:05:00 #PBS -N TF_RealData cur_host=`hostname` node_gpu=$((PBS_NGPUS / PBS_NNODES)) for node in `cat $PBS_NODEFILE | uniq` do if [[ $node == $cur_host ]] then host_flag="${node}:${node_gpu}" else host_flag="${host_flag},${node}:${node_gpu}" fi done module use /g/data/dk92/apps/Modules/modulefiles module load NCI-ai-ml/22.08 horovod.ini.sh horovodrun -np ${PBS_NGPUS} --gloo -H ${host_flag} -p 1212 python3 \ ${NCI_AI_ML_ROOT}/examples/tf_cnn_benchmarks/tf_cnn_benchmarks.py \ --model resnet101 \ --batch_size 64 \ --variable_update horovod \ --data_dir /g/data/wb00/ImageNet/ILSVRC2012/data_dir \ --data_name imagenet \ --num_batches=10000
The top half of the script is used for PBS resource allocation, and we request one hour to give enough time to train the model with real data. The rest of the script is used to run the benchmark. The data_dir option is used to set the data directory containing the Tensorflow Record (TFrecord) files. --data_dir /g/data/wb00/ImageNet/ILSVRC2012
The TFrecord files located in the path "/g/data/wb00/ImageNet/ILSVRC2012
" are processed from the raw imagenet dataset, as discussed above. Another difference is that now that a real dataset is being used, we need to specify the number of batches, which is done using the flag --num_batches. We have calculated the time required to run 10,000 batches on the V100 GPU at approximately 1 hour. This is the reason the wall time is set to just above one hour so that the training can complete the process in the 10,000 batches. If you want to increase the number of batches, you will have to adjust the wall time accordingly. For example, to complete 40,000 batches, the wall time should be set at four hours.
A100 vs V100 performance comparison
The above benchmarks have been run on two generations of GPUs; A100 and V100. The following table compares the performance of two GPU models.
GPU class | GPUs used | Nodes used | Batch | Total Batches | Benchmark Run time | Comparison | Image processed | Comparison | |
---|---|---|---|---|---|---|---|---|---|
A100 | Ampere | 16 | 2 | 64 | 10000 | 30:59 min | 44.64% less time | 5681.31 | 1.819352 times faster |
V100 | Volta | 16 | 4 | 64 | 10000 | 55:54 min | -- | 3122.71 images/sec | -- |
On Gadi, the A100's are under the queue "dgxa100" and the V100's are under the queue "analysis".