Environments 

You will need to load the NCI-ai-ml module as below

module use /g/data/dk92/apps/Modules/modulefiles
module load NCI-ai-ml/22.08

Prepare your dataset

Please note the Gadi GPU node doesn't connect to the internet so you can't automatically download datasets from within a PBS job. As an alternative, you could download your input dataset on the Gadi login node and specify the data location in your script to be executed from a PBS job.

For example, you could download the MNIST dataset on the Gadi login node via the following tensorflow API:

tf.keras.datasets.mnist.load_data()

A copy of the MNIST dataset has also been placed under the project wb00, i.e. "/g/data/wb00/MNIST/npz". NCI also provides access to some other AI/ML datasets such as ImageNet on Gadi. Please join the project wb00 if you would like to access them.  

Benchmark and Examples

Some examples are taken from the Horovod repository. You can clone them at the Gadi login node from the reference link of each example case.

You can also find the revised examples (by directing the data directory to Gadi local file system) under the current NCI-ai-ml module space, i.e. "${NCI_GPU_ML_ROOT}/examples". The exact path is given in each example case as below.

You can monitor the runtime GPU utilisations via the gpustat tool.

For more details on using Horovod with the NCI-ai-ml module, please see here.

Example 1: tensorflow MNIST benchmark

Reference: https://github.com/horovod/horovod/blob/master/examples/tensorflow2/tensorflow2_mnist.py

Gadi location: ${NCI_AI_ML_ROOT}/examples/mnist/horovod_tensorflow2_mnist.py

You can run the example as below. The output shows it starts up on 8 GPU devices across 2 GPU nodes as expected.

$ mpirun -np ${PBS_NGPUS} --map-by node --bind-to socket python3 ${NCI_AI_ML_ROOT}/examples/mnist/horovod_tensorflow2_mnist.py
2022-08-11 15:58:43.072136: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-11 15:58:43.074997: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-11 15:58:43.075790: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-11 15:58:43.080710: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-11 15:58:43.082116: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-11 15:58:43.086961: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-11 15:58:43.102582: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-11 15:58:43.132138: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-11 15:58:45.524960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30943 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3d:00.0, compute capability: 7.0
2022-08-11 15:58:45.542629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30943 MB memory: -> device: 2, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b1:00.0, compute capability: 7.0
2022-08-11 15:58:45.549475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30943 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3d:00.0, compute capability: 7.0
2022-08-11 15:58:45.626467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30943 MB memory: -> device: 2, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b1:00.0, compute capability: 7.0
2022-08-11 15:58:45.635604: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30943 MB memory: -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3e:00.0, compute capability: 7.0
2022-08-11 15:58:45.638976: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30943 MB memory: -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b2:00.0, compute capability: 7.0
2022-08-11 15:58:45.672981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30943 MB memory: -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b2:00.0, compute capability: 7.0
2022-08-11 15:58:45.690307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30943 MB memory: -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3e:00.0, compute capability: 7.0
2022-08-11 15:58:51.212808: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8401
2022-08-11 15:58:51.258038: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8401
2022-08-11 15:58:51.313556: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8401
2022-08-11 15:58:51.367150: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8401
2022-08-11 15:58:51.380030: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8401
2022-08-11 15:58:51.456281: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8401
2022-08-11 15:58:51.550435: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8401
2022-08-11 15:58:51.568029: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8401
Step #0 Loss: 2.300352
...
Step #1230 Loss: 0.046702
Step #1240 Loss: 0.037691

You could use the gpustat module to monitor the run time GPU utilisations as below. It confirms the example is running on 8 GPU devices with moderate utilisations.