Environments
You will need to load the NCI-ai-ml module as below
module use /g/data/dk92/apps/Modules/modulefiles module load NCI-ai-ml/22.08
Preparing the Dataset
Please note the Gadi GPU nodes can not connect to the internet so you can't automatically download datasets in a PBS job. As an alternative, you can download your input dataset via the Gadi login node and specify the data location in your job script.
For example, you can download the MNIST dataset on the Gadi login node via the following script
from torchvision import datasets data_dir="./data" datasets.MNIST(data_dir,download=True)
A copy of the MNIST dataset has also been placed under the project wb00, i.e. "/g/data/wb00/MNIST".
NCI also provides access to some other AI/ML datasets such as ImageNet at Gadi. Please join the project wb00 if you would like to access them.
Benchmark and Examples
Under the NCI-ai-ml module we have provided some examples which are taken from the Horovod repository. You can clone them on the Gadi login node from the reference link of each example case.
You can also find the revised examples (by directing the data directory to Gadi local file system) under the current NCI-ai-ml module space, i.e. "${NCI_GPU_ML_ROOT}/examples". The exact path is given in each example case as below.
You can monitor the runtime GPU utilisations via the gpustat tool.
In this page we describe how to run these examples via horovod+mpi. You can also run these examples using horovod+gloo. For more details on using Horovod with NCI-ai-ml module, please see here.
Example 1: Pytorch synthetic benchmark
Reference: https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_synthetic_benchmark.py
Gadi location: ${NCI_AI_ML_ROOT}/examples/resnet/horovod_pytorch_synthetic_benchmark.py
# Running on 2 gpuvolta GPU nodes. $ mpirun -np ${PBS_NGPUS} --map-by node --bind-to socket python3 ${NCI_AI_ML_ROOT}/examples/resnet/horovod_pytorch_synthetic_benchmark.py Model: resnet50 Batch size: 32 Number of GPUs: 8 Running warmup... Running benchmark... Iter #0: 270.8 img/sec per GPU Iter #1: 268.7 img/sec per GPU Iter #2: 270.9 img/sec per GPU Iter #3: 266.5 img/sec per GPU Iter #4: 267.5 img/sec per GPU Iter #5: 267.9 img/sec per GPU Iter #6: 269.4 img/sec per GPU Iter #7: 269.9 img/sec per GPU Iter #8: 265.9 img/sec per GPU Iter #9: 268.0 img/sec per GPU Img/sec per GPU: 268.5 +-3.2 Total img/sec on 8 GPU(s): 2148.3 +-25.6
By using the gpustat monitoring tool, we can see all 8 GPUs across 2 nodes are almost fully utilised.
Example 2: Pytorch MNIST benchmark
Reference: https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_mnist.py
Gadi location: ${NCI_AI_ML_ROOT}/examples/mnist/horovod_pytorch_mnist.py
$ mpirun -np ${PBS_NGPUS} --map-by node --bind-to socket python3 ${NCI_AI_ML_ROOT}/examples/mnist/horovod_pytorch_mnist.py --epoch 50 --data-dir /g/data/wb00 Train Epoch: 1 [0/7500 (0%)] Loss: 2.309901 Train Epoch: 1 [0/7500 (0%)] Loss: 2.332907 Train Epoch: 1 [0/7500 (0%)] Loss: 2.359319 Train Epoch: 1 [0/7500 (0%)] Loss: 2.345739 Train Epoch: 1 [0/7500 (0%)] Loss: 2.337670 Train Epoch: 1 [0/7500 (0%)] Loss: 2.315854 Train Epoch: 1 [0/7500 (0%)] Loss: 2.344241 Train Epoch: 1 [0/7500 (0%)] Loss: 2.341527 Train Epoch: 1 [640/7500 (8%)] Loss: 2.253967 Train Epoch: 1 [640/7500 (8%)] Loss: 2.251903 Train Epoch: 1 [640/7500 (8%)] Loss: 2.238988 Train Epoch: 1 [640/7500 (8%)] Loss: 2.286994 Train Epoch: 1 [640/7500 (8%)] Loss: 2.227635 Train Epoch: 1 [640/7500 (8%)] Loss: 2.278378 Train Epoch: 1 [640/7500 (8%)] Loss: 2.272306 Train Epoch: 1 [640/7500 (8%)] Loss: 2.264846 ... Train Epoch: 50 [6400/7500 (85%)] Loss: 0.059110 Train Epoch: 50 [6400/7500 (85%)] Loss: 0.060333 Train Epoch: 50 [6400/7500 (85%)] Loss: 0.189334 Train Epoch: 50 [6400/7500 (85%)] Loss: 0.057295 Train Epoch: 50 [6400/7500 (85%)] Loss: 0.029991 Train Epoch: 50 [6400/7500 (85%)] Loss: 0.096529 Train Epoch: 50 [6400/7500 (85%)] Loss: 0.243934 Train Epoch: 50 [6400/7500 (85%)] Loss: 0.134590 Train Epoch: 50 [7040/7500 (93%)] Loss: 0.083785 Train Epoch: 50 [7040/7500 (93%)] Loss: 0.173053 Train Epoch: 50 [7040/7500 (93%)] Loss: 0.067444 Train Epoch: 50 [7040/7500 (93%)] Loss: 0.035630 Train Epoch: 50 [7040/7500 (93%)] Loss: 0.122231 Train Epoch: 50 [7040/7500 (93%)] Loss: 0.134888 Train Epoch: 50 [7040/7500 (93%)] Loss: 0.182437 Train Epoch: 50 [7040/7500 (93%)] Loss: 0.033505 Test set: Average loss: 0.0355, Accuracy: 98.89%
We conduct benchmark runs with different number of GPU devices. The walltime, results and GPU utilisations of each run are listed below. It presents a good scalability up to 2 GPU nodes.
Ngpus | walltime | Results | GPU utilisation | Notes |
---|---|---|---|---|
1 | 456s | Average loss: 0.0308 | Just 1 GPU is using. | |
2 | 249s real 4m9.104s | Average loss: 0.0329 | The benchmark runs on 1 GPU per each node, 2 GPUs in total. | |
4 | 143s real 2m22.943s | Average loss: 0.0322 | The benchmark runs on 2 GPU per each node, 4 GPUs in total. | |
8 | 95s | Average loss: 0.0335 | The benchmark runs on all 8 GPUs across 2 nodes. |
Example 3: Pytorch lightning MNIST benchmark
Reference: https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_lightning_mnist.py
Gadi location: ${NCI_AI_ML_ROOT}/examples/mnist/horovod_pytorch_lightning_mnist.py
# Running on 2 gpuvolta GPU nodes. $ mpirun -np ${PBS_NGPUS} --map-by node --bind-to socket python3 ${NCI_AI_ML_ROOT}/examples/mnist/horovod_pytorch_lightning_mnist.py --data-dir /g/data/wb00/ Starting to init trainer! Trainer is initialized. Missing logger folder: /jobfs/50788088.gadi-pbs/tmplo32y2sb/logger/lightning_logs LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3] LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3] LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3] LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3] LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3] LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3] | Name | Type | Params ----------------------------------------- 0 | conv1 | Conv2d | 260 1 | conv2 | Conv2d | 5.0 K 2 | conv2_drop | Dropout2d | 0 3 | fc1 | Linear | 16.1 K 4 | fc2 | Linear | 510 ----------------------------------------- 21.8 K Trainable params 0 Non-trainable params 21.8 K Total params 0.087 Total estimated model params size (MB) Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 120/120 [00:08<00:00, 14.75it/s, loss=0.526, v_num=0] ... Epoch 9: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 120/120 [00:02<00:00, 59.93it/s, loss=0.179, v_num=0] Test set: Average loss: 0.0593, Accuracy: 98.10%
The monitoring information from gpustat shows this example doesn't utilise GPU resources heavily.
Example 4: Pytorch ImageNET resnet50 Benchmark
Reference: https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_imagenet_resnet50.py
Gadi location: ${NCI_AI_ML_ROOT}/examples/imagenet/horovod_pytorch_imagenet_resnet50.py
$ mpirun -np ${PBS_NGPUS} --map-by node --bind-to socket python3 ${NCI_AI_ML_ROOT}/examples/imagenet/horovod_pytorch_imagenet_resnet50.py --epochs 1 --train-dir /g/data/wb00/ImageNet/ILSVRC2012/raw-data/train --val-dir /g/data/wb00/ImageNet/ILSVRC2012/raw-data/validation [0]<stderr>:Train Epoch #1: 0%| | 1/5005 [00:07<9:36:15, 6.91s/it, loss=7.11, accuracy=0] [0]<stderr>:Train Epoch #1: 0%| | 3/5005 [00:07<2:16:49, 1.64s/it, loss=7.09, accuracy=0.09 [0]<stderr>:Train Epoch #1: 0%| | 4/5005 [00:07<1:26:47, 1.04s/it, loss=7.09, accuracy=0.09 [0]<stderr>:Train Epoch #1: 0%| | 4/5005 [00:07<1:26:47, 1.04s/it, loss=7.09, accuracy=0.07 [0]<stderr>:Train Epoch #1: 0%| | 5/5005 [00:07<59:07, 1.41it/s, loss=7.09, accuracy=0.0781 [0]<stderr>:Train Epoch #1: 0%| | 5/5005 [00:07<59:07, 1.41it/s, loss=7.09, accuracy=0.0651 [0]<stderr>:Train Epoch #1: 0%| | 6/5005 [00:07<42:26, 1.96it/s, loss=7.09, accuracy=0.0651 [0]<stderr>:Train Epoch #1: 0%| | 6/5005 [00:07<42:26, 1.96it/s, loss=7.08, accuracy=0.112] [0]<stderr>:Train Epoch #1: 0%| | 7/5005 [00:07<31:54, 2.61it/s, loss=7.08, accuracy=0.112] ... [0]<stderr>:Train Epoch #1: 100%|██████████| 5005/5005 [35:39<00:00, 2.34it/s, loss=5.65, accuracy=5.25] [0]<stderr>:Validate Epoch #1: 1%| | 1/196 [00:02<08:28, 2.61s/it, loss=5.22, accuracy=11.1] [0]<stderr>:Validate Epoch #1: 2%|▏ | 4/196 [00:04<02:19, 1.38it/s, loss=5.31, accuracy=11.2] [0]<stderr>:Validate Epoch #1: 4%|▎ | 7/196 [00:05<02:49, 1.12it/s, loss=5.13, accuracy=11.8] [0]<stderr>:Validate Epoch #1: 5%|▍ | 9/196 [00:07<02:16, 1.37it/s, loss=5.06, accuracy=11.9] ... [0]<stderr>:Validate Epoch #1: 99%|█████████▉| 194/196 [01:32<00:00, 2.08it/s, loss=5.22, accuracy=13] [0]<stderr>:Validate Epoch #1: 100%|██████████| 196/196 [01:32<00:00, 2.11it/s, loss=5.22, accuracy=13]
This benchmark can heavily utilises all GPU devices.