Environments 

You will need to load the NCI-ai-ml module as below

module use /g/data/dk92/apps/Modules/modulefiles
module load NCI-ai-ml/22.08

Preparing the Dataset

Please note the Gadi GPU nodes can not connect to the internet so you can't automatically download datasets in a PBS job. As an alternative, you can download your input dataset via the Gadi login node and specify the data location in your job script.

For example, you can download the MNIST dataset on the Gadi login node via the following script

from torchvision import datasets
data_dir="./data"
datasets.MNIST(data_dir,download=True)

A copy of  the MNIST dataset has also been placed under the project wb00, i.e. "/g/data/wb00/MNIST".

NCI also provides access to some other AI/ML datasets such as ImageNet at Gadi. Please join the project wb00 if you would like to access them.  

Benchmark and Examples

Under the NCI-ai-ml module we have provided some examples which are taken from the Horovod repository. You can clone them on the Gadi login node from the reference link of each example case.

You can also find the revised examples (by directing the data directory to Gadi local file system) under the current NCI-ai-ml module space, i.e. "${NCI_GPU_ML_ROOT}/examples". The exact path is given in each example case as below.

You can monitor the runtime GPU utilisations via the gpustat tool.

In this page we describe how to run these examples via horovod+mpi. You can also run these examples using horovod+gloo. For more details on using Horovod with NCI-ai-ml module, please see here.

Example 1: Pytorch synthetic benchmark

Reference: https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_synthetic_benchmark.py

Gadi location:  ${NCI_AI_ML_ROOT}/examples/resnet/horovod_pytorch_synthetic_benchmark.py

# Running on 2 gpuvolta GPU nodes.
$ mpirun -np ${PBS_NGPUS} --map-by node --bind-to socket python3 ${NCI_AI_ML_ROOT}/examples/resnet/horovod_pytorch_synthetic_benchmark.py
Model: resnet50
Batch size: 32
Number of GPUs: 8
Running warmup...
Running benchmark...
Iter #0: 270.8 img/sec per GPU
Iter #1: 268.7 img/sec per GPU
Iter #2: 270.9 img/sec per GPU
Iter #3: 266.5 img/sec per GPU
Iter #4: 267.5 img/sec per GPU
Iter #5: 267.9 img/sec per GPU
Iter #6: 269.4 img/sec per GPU
Iter #7: 269.9 img/sec per GPU
Iter #8: 265.9 img/sec per GPU
Iter #9: 268.0 img/sec per GPU
Img/sec per GPU: 268.5 +-3.2
Total img/sec on 8 GPU(s): 2148.3 +-25.6

By using the gpustat monitoring tool, we can see all 8 GPUs across 2 nodes are almost fully utilised.

Example 2: Pytorch MNIST benchmark

Reference: https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_mnist.py
Gadi location:  ${NCI_AI_ML_ROOT}/examples/mnist/horovod_pytorch_mnist.py

$ mpirun -np ${PBS_NGPUS} --map-by node --bind-to socket python3 ${NCI_AI_ML_ROOT}/examples/mnist/horovod_pytorch_mnist.py --epoch 50 --data-dir /g/data/wb00
Train Epoch: 1 [0/7500 (0%)] Loss: 2.309901
Train Epoch: 1 [0/7500 (0%)] Loss: 2.332907
Train Epoch: 1 [0/7500 (0%)] Loss: 2.359319
Train Epoch: 1 [0/7500 (0%)] Loss: 2.345739
Train Epoch: 1 [0/7500 (0%)] Loss: 2.337670
Train Epoch: 1 [0/7500 (0%)] Loss: 2.315854
Train Epoch: 1 [0/7500 (0%)] Loss: 2.344241
Train Epoch: 1 [0/7500 (0%)] Loss: 2.341527
Train Epoch: 1 [640/7500 (8%)] Loss: 2.253967
Train Epoch: 1 [640/7500 (8%)] Loss: 2.251903
Train Epoch: 1 [640/7500 (8%)] Loss: 2.238988
Train Epoch: 1 [640/7500 (8%)] Loss: 2.286994
Train Epoch: 1 [640/7500 (8%)] Loss: 2.227635
Train Epoch: 1 [640/7500 (8%)] Loss: 2.278378
Train Epoch: 1 [640/7500 (8%)] Loss: 2.272306
Train Epoch: 1 [640/7500 (8%)] Loss: 2.264846
...
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.059110
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.060333
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.189334
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.057295
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.029991
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.096529
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.243934
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.134590
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.083785
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.173053
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.067444
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.035630
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.122231
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.134888
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.182437
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.033505
Test set: Average loss: 0.0355, Accuracy: 98.89%

We conduct benchmark runs with different number of GPU devices. The walltime, results and GPU utilisations of each run are listed below. It presents a good scalability up to 2 GPU nodes.

NgpuswalltimeResultsGPU utilisationNotes
1
456s

real 7m36.371s
user 11m10.381s
sys 1m22.378s
Average loss: 0.0308
Accuracy: 99.04%

Just 1 GPU is using. 
2
249s
real 4m9.104s
user 8m28.538s
sys 3m15.816s
Average loss: 0.0329
Accuracy: 99.00%

The benchmark runs on 1 GPU per each node, 2 GPUs in total.
4
143s
real 2m22.943s
user 9m5.666s
sys 4m15.375s
Average loss: 0.0322
Accuracy: 99.04%

The benchmark runs on 2 GPU per each node, 4 GPUs in total.
8
95s

real 1m35.263s
user 9m57.525s
sys 5m43.280s
Average loss: 0.0335
Accuracy: 98.92%

The benchmark runs on all 8 GPUs across 2 nodes.

Example 3: Pytorch lightning MNIST benchmark

Reference: https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_lightning_mnist.py

Gadi location: ${NCI_AI_ML_ROOT}/examples/mnist/horovod_pytorch_lightning_mnist.py

# Running on 2 gpuvolta GPU nodes.
$ mpirun -np ${PBS_NGPUS} --map-by node --bind-to socket python3 ${NCI_AI_ML_ROOT}/examples/mnist/horovod_pytorch_lightning_mnist.py --data-dir /g/data/wb00/
Starting to init trainer!
Trainer is initialized.
Missing logger folder: /jobfs/50788088.gadi-pbs/tmplo32y2sb/logger/lightning_logs
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
| Name | Type | Params
-----------------------------------------
0 | conv1 | Conv2d | 260
1 | conv2 | Conv2d | 5.0 K
2 | conv2_drop | Dropout2d | 0
3 | fc1 | Linear | 16.1 K
4 | fc2 | Linear | 510
-----------------------------------------
21.8 K Trainable params
0 Non-trainable params
21.8 K Total params
0.087 Total estimated model params size (MB)
Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 120/120 [00:08<00:00, 14.75it/s, loss=0.526, v_num=0]
...
Epoch 9: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 120/120 [00:02<00:00, 59.93it/s, loss=0.179, v_num=0]
Test set: Average loss: 0.0593, Accuracy: 98.10%

The monitoring information from gpustat shows this example doesn't utilise GPU resources heavily.

Example 4: Pytorch ImageNET resnet50 Benchmark

Reference: https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_imagenet_resnet50.py

Gadi location: ${NCI_AI_ML_ROOT}/examples/imagenet/horovod_pytorch_imagenet_resnet50.py

$ mpirun -np ${PBS_NGPUS} --map-by node --bind-to socket python3 ${NCI_AI_ML_ROOT}/examples/imagenet/horovod_pytorch_imagenet_resnet50.py --epochs 1 --train-dir /g/data/wb00/ImageNet/ILSVRC2012/raw-data/train --val-dir /g/data/wb00/ImageNet/ILSVRC2012/raw-data/validation

[0]<stderr>:Train Epoch #1: 0%| | 1/5005 [00:07<9:36:15, 6.91s/it, loss=7.11, accuracy=0]
[0]<stderr>:Train Epoch #1: 0%| | 3/5005 [00:07<2:16:49, 1.64s/it, loss=7.09, accuracy=0.09
[0]<stderr>:Train Epoch #1: 0%| | 4/5005 [00:07<1:26:47, 1.04s/it, loss=7.09, accuracy=0.09
[0]<stderr>:Train Epoch #1: 0%| | 4/5005 [00:07<1:26:47, 1.04s/it, loss=7.09, accuracy=0.07
[0]<stderr>:Train Epoch #1: 0%| | 5/5005 [00:07<59:07, 1.41it/s, loss=7.09, accuracy=0.0781
[0]<stderr>:Train Epoch #1: 0%| | 5/5005 [00:07<59:07, 1.41it/s, loss=7.09, accuracy=0.0651
[0]<stderr>:Train Epoch #1: 0%| | 6/5005 [00:07<42:26, 1.96it/s, loss=7.09, accuracy=0.0651
[0]<stderr>:Train Epoch #1: 0%| | 6/5005 [00:07<42:26, 1.96it/s, loss=7.08, accuracy=0.112]
[0]<stderr>:Train Epoch #1: 0%| | 7/5005 [00:07<31:54, 2.61it/s, loss=7.08, accuracy=0.112]
...
[0]<stderr>:Train Epoch #1: 100%|██████████| 5005/5005 [35:39<00:00, 2.34it/s, loss=5.65, accuracy=5.25]
[0]<stderr>:Validate Epoch #1: 1%| | 1/196 [00:02<08:28, 2.61s/it, loss=5.22, accuracy=11.1]
[0]<stderr>:Validate Epoch #1: 2%|▏ | 4/196 [00:04<02:19, 1.38it/s, loss=5.31, accuracy=11.2]
[0]<stderr>:Validate Epoch #1: 4%|▎ | 7/196 [00:05<02:49, 1.12it/s, loss=5.13, accuracy=11.8]
[0]<stderr>:Validate Epoch #1: 5%|▍ | 9/196 [00:07<02:16, 1.37it/s, loss=5.06, accuracy=11.9]
...
[0]<stderr>:Validate Epoch #1: 99%|█████████▉| 194/196 [01:32<00:00, 2.08it/s, loss=5.22, accuracy=13]
[0]<stderr>:Validate Epoch #1: 100%|██████████| 196/196 [01:32<00:00, 2.11it/s, loss=5.22, accuracy=13]

This benchmark can heavily utilises all GPU devices.