Environments
You will need to load both the NCI-ai-ml and gadi_jupyterlab modules as below
module use /g/data/dk92/apps/Modules/modulefiles module load NCI-ai-ml/22.08 gadi_jupyterlab/22.06
Preparing the Dataset
Please note the Gadi GPU nodes can not connect to the internet so you can't automatically download datasets in a PBS job. As an alternative, you can download your input dataset via the Gadi login node and specify the data location in your job script.
For example, you can download the MNIST dataset on the Gadi login node via the following script
from torchvision import datasets data_dir="./data" datasets.MNIST(data_dir,download=True)
A copy of the MNIST dataset has also been placed under the project wb00, i.e. "/g/data/wb00/MNIST".
NCI also provides access to some other AI/ML datasets such as ImageNet at Gadi. Please join the project wb00 if you would like to access them.
Benchmark and Examples
Some examples are taken from the Ray repository. You can clone them on the Gadi login node from the reference link of each example case.
You can also find the revised examples (by directing the data directory to Gadi local file system) under the current NCI-ai-ml module space, i.e. "${NCI_GPU_ML_ROOT}/examples". The exact path is given in each example case as below.
You can monitor the runtime GPU utilisations via the gpustat tool.
For details on using Ray with NCI-ai-ml module, please see here.
Example 1: pytorch MNIST benchmark
Gadi location: ${NCI_AI_ML_ROOT}/examples/mnist/ray_pytorch_mnist.py
Here is the outputs of running this example within a PBS job with 2 GPU nodes (8 GPU devices).
The start-up information indicates there are 8 ranks across two nodes and the models are moved to those devices.
$ python $NCI_AI_ML_ROOT/examples/mnist/ray_pytorch_mnist.py -n ${PBS_NGPUS} --use-gpu 2022-08-10 16:13:27,113 INFO trainer.py:223 -- Trainer logs will be logged in: /home/900/rxy900/ray_results/train_2022-08-10_16-13-27 (BaseWorkerMixin pid=4166294) 2022-08-10 16:13:30,738 INFO torch.py:334 -- Setting up process group for: env:// [rank=7, world_size=8] (BaseWorkerMixin pid=4166293) 2022-08-10 16:13:30,751 INFO torch.py:334 -- Setting up process group for: env:// [rank=6, world_size=8] (BaseWorkerMixin pid=4166292) 2022-08-10 16:13:30,752 INFO torch.py:334 -- Setting up process group for: env:// [rank=5, world_size=8] (BaseWorkerMixin pid=4166291) 2022-08-10 16:13:30,750 INFO torch.py:334 -- Setting up process group for: env:// [rank=4, world_size=8] (BaseWorkerMixin pid=162379, ip=10.6.10.12) 2022-08-10 16:13:30,830 INFO torch.py:334 -- Setting up process group for: env:// [rank=0, world_size=8] (BaseWorkerMixin pid=162382, ip=10.6.10.12) 2022-08-10 16:13:30,839 INFO torch.py:334 -- Setting up process group for: env:// [rank=3, world_size=8] (BaseWorkerMixin pid=162381, ip=10.6.10.12) 2022-08-10 16:13:30,845 INFO torch.py:334 -- Setting up process group for: env:// [rank=2, world_size=8] (BaseWorkerMixin pid=162380, ip=10.6.10.12) 2022-08-10 16:13:30,846 INFO torch.py:334 -- Setting up process group for: env:// [rank=1, world_size=8] 2022-08-10 16:13:31,826 INFO trainer.py:229 -- Run results will be logged in: /home/900/rxy900/ray_results/train_2022-08-10_16-13-27/run_001 (BaseWorkerMixin pid=162379, ip=10.6.10.12) 2022-08-10 16:13:37,432 INFO torch.py:92 -- Moving model to device: cuda:0 (BaseWorkerMixin pid=162379, ip=10.6.10.12) 2022-08-10 16:13:37,461 INFO torch.py:126 -- Wrapping provided model in DDP. (BaseWorkerMixin pid=162382, ip=10.6.10.12) 2022-08-10 16:13:37,427 INFO torch.py:92 -- Moving model to device: cuda:3 (BaseWorkerMixin pid=162380, ip=10.6.10.12) 2022-08-10 16:13:37,559 INFO torch.py:92 -- Moving model to device: cuda:1 (BaseWorkerMixin pid=162381, ip=10.6.10.12) 2022-08-10 16:13:37,617 INFO torch.py:92 -- Moving model to device: cuda:2 (BaseWorkerMixin pid=4166294) 2022-08-10 16:13:37,711 INFO torch.py:92 -- Moving model to device: cuda:3 (BaseWorkerMixin pid=4166292) 2022-08-10 16:13:37,639 INFO torch.py:92 -- Moving model to device: cuda:1 (BaseWorkerMixin pid=4166291) 2022-08-10 16:13:37,639 INFO torch.py:92 -- Moving model to device: cuda:0 (BaseWorkerMixin pid=4166291) 2022-08-10 16:13:37,675 INFO torch.py:126 -- Wrapping provided model in DDP. (BaseWorkerMixin pid=4166293) 2022-08-10 16:13:37,773 INFO torch.py:92 -- Moving model to device: cuda:2 (BaseWorkerMixin pid=162382, ip=10.6.10.12) 2022-08-10 16:13:38,422 INFO torch.py:126 -- Wrapping provided model in DDP. (BaseWorkerMixin pid=162381, ip=10.6.10.12) 2022-08-10 16:13:38,535 INFO torch.py:126 -- Wrapping provided model in DDP. (BaseWorkerMixin pid=162380, ip=10.6.10.12) 2022-08-10 16:13:38,522 INFO torch.py:126 -- Wrapping provided model in DDP. (BaseWorkerMixin pid=4166294) 2022-08-10 16:13:38,631 INFO torch.py:126 -- Wrapping provided model in DDP. (BaseWorkerMixin pid=4166292) 2022-08-10 16:13:38,628 INFO torch.py:126 -- Wrapping provided model in DDP. (BaseWorkerMixin pid=4166293) 2022-08-10 16:13:38,774 INFO torch.py:126 -- Wrapping provided model in DDP
At the end of outputs, each rank presents its own Accuracy and Average loss.
(BaseWorkerMixin pid=4166294) loss: 1.936992 [ 7200/ 7500] (BaseWorkerMixin pid=4166293) loss: 2.004175 [ 7200/ 7500] (BaseWorkerMixin pid=4166292) loss: 2.116247 [ 7200/ 7500] (BaseWorkerMixin pid=4166291) loss: 1.992803 [ 7200/ 7500] (BaseWorkerMixin pid=162379, ip=10.6.10.12) loss: 1.923531 [ 7200/ 7500] (BaseWorkerMixin pid=162382, ip=10.6.10.12) loss: 2.598968 [ 7200/ 7500] (BaseWorkerMixin pid=162381, ip=10.6.10.12) loss: 1.744774 [ 7200/ 7500] (BaseWorkerMixin pid=162380, ip=10.6.10.12) loss: 1.906338 [ 7200/ 7500] (BaseWorkerMixin pid=4166294) Test Error: (BaseWorkerMixin pid=4166294) Accuracy: 48.4%, Avg loss: 1.859378 (BaseWorkerMixin pid=4166294) (BaseWorkerMixin pid=4166293) Test Error: (BaseWorkerMixin pid=4166293) Accuracy: 47.3%, Avg loss: 1.888940 (BaseWorkerMixin pid=4166293) (BaseWorkerMixin pid=4166292) Test Error: (BaseWorkerMixin pid=4166292) Accuracy: 46.5%, Avg loss: 1.911980 (BaseWorkerMixin pid=4166292) (BaseWorkerMixin pid=4166291) Test Error: (BaseWorkerMixin pid=4166291) Accuracy: 46.5%, Avg loss: 1.898387 (BaseWorkerMixin pid=4166291) (BaseWorkerMixin pid=162379, ip=10.6.10.12) Test Error: (BaseWorkerMixin pid=162379, ip=10.6.10.12) Accuracy: 47.4%, Avg loss: 1.884367 (BaseWorkerMixin pid=162379, ip=10.6.10.12) (BaseWorkerMixin pid=162382, ip=10.6.10.12) Test Error: (BaseWorkerMixin pid=162382, ip=10.6.10.12) Accuracy: 48.6%, Avg loss: 1.872148 (BaseWorkerMixin pid=162382, ip=10.6.10.12) (BaseWorkerMixin pid=162381, ip=10.6.10.12) Test Error: (BaseWorkerMixin pid=162381, ip=10.6.10.12) Accuracy: 45.8%, Avg loss: 1.892654 (BaseWorkerMixin pid=162381, ip=10.6.10.12) (BaseWorkerMixin pid=162380, ip=10.6.10.12) Test Error: (BaseWorkerMixin pid=162380, ip=10.6.10.12) Accuracy: 46.4%, Avg loss: 1.907122 (BaseWorkerMixin pid=162380, ip=10.6.10.12) Loss results: [[2.224302826413683, 2.121929356246997, 1.994212176389755, 1.8843671979418226], [2.2261931486190503, 2.1300062024669284, 2.0105923748320076, 1.9071215406344955], [2.226132330621124, 2.126165986820391, 2.0006669721785624, 1.8926543301078165], [2.2204210834138713, 2.1165097412789704, 1.9868468387871032, 1.8721482882833784], [2.228307853079146, 2.1300130308054057, 2.005946327166952, 1.898387407800954], [2.228195699157229, 2.1333496631330746, 2.015632641543249, 1.9119798476528969], [2.2267418424035332, 2.1260054050737125, 1.999349110445399, 1.888940335838658], [2.219858216631944, 2.1121316457250314, 1.9763364594453459, 1.8593782960988914]]
You can monitor GPU utilisations via the gpu_stat module as below. It shows that this example can heavily utilise the GPU devices.