Environments

You will need to load the NCI-ai-ml module as below

module use /g/data/dk92/apps/Modules/modulefiles
module load NCI-ai-ml/22.08

Preparing the Dataset

Please note the Gadi GPU nodes can not connect to the internet so you can't automatically download datasets in a PBS job. As an alternative, you can download your input dataset via the Gadi login node and specify the data location in your job script.

For example, you can download the MNIST dataset on the Gadi login node via the following script

from torchvision import datasets
data_dir="./data"
datasets.MNIST(data_dir,download=True)

A copy of the MNIST dataset has also been placed under the project wb00, i.e. "/g/data/wb00/MNIST".

NCI also provides access to some other AI/ML datasets such as ImageNet at Gadi. Please join the project wb00 if you would like to access them.

Benchmark and Examples

Under the NCI-ai-ml module we have provided some examples which are taken from the Horovod repository. You can clone them on the Gadi login node from the reference link of each example case.

You can also find the revised examples (by directing the data directory to Gadi local file system) under the current NCI-ai-ml module space, i.e. "${NCI_GPU_ML_ROOT}/examples". The exact path is given in each example case as below.

You can monitor the runtime GPU utilisations via the gpustat tool.

In this page we describe how to run these examples via horovod+mpi. You can also run these examples using horovod+gloo. For more details on using Horovod with NCI-ai-ml module, please see here.

Example 1: Pytorch synthetic benchmark

Reference: https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_synthetic_benchmark.py

Gadi location: ${NCI_AI_ML_ROOT}/examples/resnet/horovod_pytorch_synthetic_benchmark.py

# Running on 2 gpuvolta GPU nodes.
$ mpirun -np ${PBS_NGPUS} --map-by node --bind-to socket python3 ${NCI_AI_ML_ROOT}/examples/resnet/horovod_pytorch_synthetic_benchmark.py
Model: resnet50
Batch size: 32
Number of GPUs: 8
Running warmup...
Running benchmark...
Iter #0: 270.8 img/sec per GPU
Iter #1: 268.7 img/sec per GPU
Iter #2: 270.9 img/sec per GPU
Iter #3: 266.5 img/sec per GPU
Iter #4: 267.5 img/sec per GPU
Iter #5: 267.9 img/sec per GPU
Iter #6: 269.4 img/sec per GPU
Iter #7: 269.9 img/sec per GPU
Iter #8: 265.9 img/sec per GPU
Iter #9: 268.0 img/sec per GPU
Img/sec per GPU: 268.5 +-3.2
Total img/sec on 8 GPU(s): 2148.3 +-25.6

By using the gpustat monitoring tool, we can see all 8 GPUs across 2 nodes are almost fully utilised.

Example 2: Pytorch MNIST benchmark

Reference: https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_mnist.py
Gadi location: ${NCI_AI_ML_ROOT}/examples/mnist/horovod_pytorch_mnist.py

$ mpirun -np ${PBS_NGPUS} --map-by node --bind-to socket python3 ${NCI_AI_ML_ROOT}/examples/mnist/horovod_pytorch_mnist.py --epoch 50 --data-dir /g/data/wb00
Train Epoch: 1 [0/7500 (0%)] Loss: 2.309901
Train Epoch: 1 [0/7500 (0%)] Loss: 2.332907
Train Epoch: 1 [0/7500 (0%)] Loss: 2.359319
Train Epoch: 1 [0/7500 (0%)] Loss: 2.345739
Train Epoch: 1 [0/7500 (0%)] Loss: 2.337670
Train Epoch: 1 [0/7500 (0%)] Loss: 2.315854
Train Epoch: 1 [0/7500 (0%)] Loss: 2.344241
Train Epoch: 1 [0/7500 (0%)] Loss: 2.341527
Train Epoch: 1 [640/7500 (8%)] Loss: 2.253967
Train Epoch: 1 [640/7500 (8%)] Loss: 2.251903
Train Epoch: 1 [640/7500 (8%)] Loss: 2.238988
Train Epoch: 1 [640/7500 (8%)] Loss: 2.286994
Train Epoch: 1 [640/7500 (8%)] Loss: 2.227635
Train Epoch: 1 [640/7500 (8%)] Loss: 2.278378
Train Epoch: 1 [640/7500 (8%)] Loss: 2.272306
Train Epoch: 1 [640/7500 (8%)] Loss: 2.264846
...
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.059110
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.060333
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.189334
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.057295
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.029991
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.096529
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.243934
Train Epoch: 50 [6400/7500 (85%)] Loss: 0.134590
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.083785
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.173053
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.067444
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.035630
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.122231
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.134888
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.182437
Train Epoch: 50 [7040/7500 (93%)] Loss: 0.033505
Test set: Average loss: 0.0355, Accuracy: 98.89%

We conduct benchmark runs with different number of GPU devices. The walltime, results and GPU utilisations of each run are listed below. It presents a good scalability up to 2 GPU nodes.

Ngpus	walltime	Results	Notes
1	456s real 7m36.371s user 11m10.381s sys 1m22.378s	Average loss: 0.0308 Accuracy: 99.04%	Just 1 GPU is using.
2	249s real 4m9.104s user 8m28.538s sys 3m15.816s	Average loss: 0.0329 Accuracy: 99.00%	The benchmark runs on 1 GPU per each node, 2 GPUs in total.
4	143s real 2m22.943s user 9m5.666s sys 4m15.375s	Average loss: 0.0322 Accuracy: 99.04%	The benchmark runs on 2 GPU per each node, 4 GPUs in total.
8	95s real 1m35.263s user 9m57.525s sys 5m43.280s	Average loss: 0.0335 Accuracy: 98.92%	The benchmark runs on all 8 GPUs across 2 nodes.

Example 3: Pytorch lightning MNIST benchmark

Reference: https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_lightning_mnist.py

Gadi location: ${NCI_AI_ML_ROOT}/examples/mnist/horovod_pytorch_lightning_mnist.py

# Running on 2 gpuvolta GPU nodes.
$ mpirun -np ${PBS_NGPUS} --map-by node --bind-to socket python3 ${NCI_AI_ML_ROOT}/examples/mnist/horovod_pytorch_lightning_mnist.py --data-dir /g/data/wb00/
Starting to init trainer!
Trainer is initialized.
Missing logger folder: /jobfs/50788088.gadi-pbs/tmplo32y2sb/logger/lightning_logs
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
| Name | Type | Params
-----------------------------------------
0 | conv1 | Conv2d | 260
1 | conv2 | Conv2d | 5.0 K
2 | conv2_drop | Dropout2d | 0
3 | fc1 | Linear | 16.1 K
4 | fc2 | Linear | 510
-----------------------------------------
21.8 K Trainable params
0 Non-trainable params
21.8 K Total params
0.087 Total estimated model params size (MB)
Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 120/120 [00:08<00:00, 14.75it/s, loss=0.526, v_num=0]
...
Epoch 9: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 120/120 [00:02<00:00, 59.93it/s, loss=0.179, v_num=0]
Test set: Average loss: 0.0593, Accuracy: 98.10%

The monitoring information from gpustat shows this example doesn't utilise GPU resources heavily.

Example 4: Pytorch ImageNET resnet50 Benchmark

Reference: https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_imagenet_resnet50.py

Gadi location: ${NCI_AI_ML_ROOT}/examples/imagenet/horovod_pytorch_imagenet_resnet50.py

$ mpirun -np ${PBS_NGPUS} --map-by node --bind-to socket python3 ${NCI_AI_ML_ROOT}/examples/imagenet/horovod_pytorch_imagenet_resnet50.py --epochs 1 --train-dir /g/data/wb00/ImageNet/ILSVRC2012/raw-data/train --val-dir /g/data/wb00/ImageNet/ILSVRC2012/raw-data/validation

[0]<stderr>:Train Epoch #1: 0%| | 1/5005 [00:07<9:36:15, 6.91s/it, loss=7.11, accuracy=0]
[0]<stderr>:Train Epoch #1: 0%| | 3/5005 [00:07<2:16:49, 1.64s/it, loss=7.09, accuracy=0.09
[0]<stderr>:Train Epoch #1: 0%| | 4/5005 [00:07<1:26:47, 1.04s/it, loss=7.09, accuracy=0.09
[0]<stderr>:Train Epoch #1: 0%| | 4/5005 [00:07<1:26:47, 1.04s/it, loss=7.09, accuracy=0.07
[0]<stderr>:Train Epoch #1: 0%| | 5/5005 [00:07<59:07, 1.41it/s, loss=7.09, accuracy=0.0781
[0]<stderr>:Train Epoch #1: 0%| | 5/5005 [00:07<59:07, 1.41it/s, loss=7.09, accuracy=0.0651
[0]<stderr>:Train Epoch #1: 0%| | 6/5005 [00:07<42:26, 1.96it/s, loss=7.09, accuracy=0.0651
[0]<stderr>:Train Epoch #1: 0%| | 6/5005 [00:07<42:26, 1.96it/s, loss=7.08, accuracy=0.112]
[0]<stderr>:Train Epoch #1: 0%| | 7/5005 [00:07<31:54, 2.61it/s, loss=7.08, accuracy=0.112]
...
[0]<stderr>:Train Epoch #1: 100%|██████████| 5005/5005 [35:39<00:00, 2.34it/s, loss=5.65, accuracy=5.25]
[0]<stderr>:Validate Epoch #1: 1%| | 1/196 [00:02<08:28, 2.61s/it, loss=5.22, accuracy=11.1]
[0]<stderr>:Validate Epoch #1: 2%|▏ | 4/196 [00:04<02:19, 1.38it/s, loss=5.31, accuracy=11.2]
[0]<stderr>:Validate Epoch #1: 4%|▎ | 7/196 [00:05<02:49, 1.12it/s, loss=5.13, accuracy=11.8]
[0]<stderr>:Validate Epoch #1: 5%|▍ | 9/196 [00:07<02:16, 1.37it/s, loss=5.06, accuracy=11.9]
...
[0]<stderr>:Validate Epoch #1: 99%|█████████▉| 194/196 [01:32<00:00, 2.08it/s, loss=5.22, accuracy=13]
[0]<stderr>:Validate Epoch #1: 100%|██████████| 196/196 [01:32<00:00, 2.11it/s, loss=5.22, accuracy=13]

This benchmark can heavily utilises all GPU devices.

Page tree

Pytorch using Horovod