Environments 

You will need to load both the 'NCI-ai-ml' and 'gadi_jupyterlab' modules as below

module use /g/data/dk92/apps/Modules/modulefiles
module load NCI-ai-ml/22.08 gadi_jupyterlab/22.06

The 'gadi_jupyterlab' is used to set up the Ray cluster.

Prepare the Dataset

Please note the Gadi GPU node doesn't connect to the internet so you can't automatically download datasets in your PBS job. As an alternative, you could download your input dataset via the Gadi login node and specify the data location in your script to be executed in a PBS job.

For example, you can download the MNIST dataset at the Gadi login node via the following tensorflow API:

tf.keras.datasets.mnist.load_data()

A copy of the MNIST dataset has also been put under "/g/data/wb00/MNIST/npz". NCI also provide access to some AI/ML datasets such as ImageNet at Gadi. Please join the project wb00 if you would like to access them.  

Benchmark and Examples

Some examples are taken from the Ray repository. You can clone them at the Gadi login node from the reference link of each example case.

You can also find the revised examples (by directing the data directory to Gadi local file system) under the current NCI-ai-ml module space, i.e. "${NCI_GPU_ML_ROOT}/examples". The exact path is given in each example case as below.

You can monitor the runtime GPU utilisations via the gpustat tool.

For more details on using Ray with NCI-ai-ml module, please see here.

Example 1: tensorflow MNIST benchmark

Reference: https://github.com/ray-project/ray/blob/master/python/ray/train/examples/tensorflow_mnist_example.py

Gadi location: ${NCI_AI_ML_ROOT}/examples/mnist/ray_tensorflow_mnist.py

You can run the example as below with 8 GPUs (2 gpuvolta GPU nodes). From its start up information, we can see 8 GPU devices is used in the Ray cluster.

$ python $NCI_AI_ML_ROOT/examples/mnist/ray_tensorflow_mnist.py -n ${PBS_NGPUS} --use-gpu
2022-08-10 17:07:42,327 INFO trainer.py:223 -- Trainer logs will be logged in: /home/900/rxy900/ray_results/train_2022-08-10_17-07-42
2022-08-10 17:07:47,625 INFO trainer.py:229 -- Run results will be logged in: /home/900/rxy900/ray_results/train_2022-08-10_17-07-42/run_001
(BaseWorkerMixin pid=164396, ip=10.6.10.12) 2022-08-10 17:07:47.956994: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (on
eDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
(BaseWorkerMixin pid=164396, ip=10.6.10.12) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(BaseWorkerMixin pid=164394, ip=10.6.10.12) 2022-08-10 17:07:47.906882: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (on
eDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
(BaseWorkerMixin pid=164394, ip=10.6.10.12) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(BaseWorkerMixin pid=4170012) 2022-08-10 17:07:47.977785: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use t
he following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
(BaseWorkerMixin pid=4170012) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(BaseWorkerMixin pid=4170013) 2022-08-10 17:07:47.934390: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use t
he following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
(BaseWorkerMixin pid=4170013) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(BaseWorkerMixin pid=4170010) 2022-08-10 17:07:47.884217: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use t
he following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
(BaseWorkerMixin pid=4170010) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(BaseWorkerMixin pid=164395, ip=10.6.10.12) 2022-08-10 17:07:48.015297: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (on
eDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
(BaseWorkerMixin pid=164395, ip=10.6.10.12) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(BaseWorkerMixin pid=164397, ip=10.6.10.12) 2022-08-10 17:07:48.151559: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (on
eDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
(BaseWorkerMixin pid=164397, ip=10.6.10.12) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(BaseWorkerMixin pid=4170011) 2022-08-10 17:07:48.121281: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use t
he following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
(BaseWorkerMixin pid=4170011) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(BaseWorkerMixin pid=164395, ip=10.6.10.12) 2022-08-10 17:07:48.826591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB m
emory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3e:00.0, compute capability: 7.0
(BaseWorkerMixin pid=164395, ip=10.6.10.12) 2022-08-10 17:07:48.832490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:5/device:GPU:0 with 30989 MB memo
ry: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3e:00.0, compute capability: 7.0
(BaseWorkerMixin pid=164395, ip=10.6.10.12) 2022-08-10 17:07:48.842818: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865,
1 -> 10.6.10.6:56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715}
(BaseWorkerMixin pid=164395, ip=10.6.10.12) 2022-08-10 17:07:48.843293: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.12:60421
(BaseWorkerMixin pid=164396, ip=10.6.10.12) 2022-08-10 17:07:48.855935: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB m
emory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b1:00.0, compute capability: 7.0
(BaseWorkerMixin pid=164396, ip=10.6.10.12) 2022-08-10 17:07:48.861014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:6/device:GPU:0 with 30989 MB memo
ry: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b1:00.0, compute capability: 7.0
(BaseWorkerMixin pid=164396, ip=10.6.10.12) 2022-08-10 17:07:48.873149: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865,
1 -> 10.6.10.6:56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715}
(BaseWorkerMixin pid=164396, ip=10.6.10.12) 2022-08-10 17:07:48.873456: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.12:42355
(BaseWorkerMixin pid=164397, ip=10.6.10.12) 2022-08-10 17:07:48.842712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB m
emory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b2:00.0, compute capability: 7.0
(BaseWorkerMixin pid=164397, ip=10.6.10.12) 2022-08-10 17:07:48.849770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:7/device:GPU:0 with 30989 MB memo
ry: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b2:00.0, compute capability: 7.0
(BaseWorkerMixin pid=164397, ip=10.6.10.12) 2022-08-10 17:07:48.865833: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865,
1 -> 10.6.10.6:56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715}
(BaseWorkerMixin pid=164397, ip=10.6.10.12) 2022-08-10 17:07:48.866318: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.12:57715
(BaseWorkerMixin pid=164394, ip=10.6.10.12) 2022-08-10 17:07:48.840350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB m
emory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3d:00.0, compute capability: 7.0
(BaseWorkerMixin pid=164394, ip=10.6.10.12) 2022-08-10 17:07:48.844701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:4/device:GPU:0 with 30989 MB memo
ry: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3d:00.0, compute capability: 7.0
(BaseWorkerMixin pid=164394, ip=10.6.10.12) 2022-08-10 17:07:48.854210: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865,
1 -> 10.6.10.6:56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715}
(BaseWorkerMixin pid=164394, ip=10.6.10.12) 2022-08-10 17:07:48.854448: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.12:41421
(BaseWorkerMixin pid=4170012) 2022-08-10 17:07:48.813775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB memory: -> dev
ice: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b1:00.0, compute capability: 7.0
(BaseWorkerMixin pid=4170012) 2022-08-10 17:07:48.818832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:2/device:GPU:0 with 30989 MB memory: -> device
: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b1:00.0, compute capability: 7.0
(BaseWorkerMixin pid=4170012) 2022-08-10 17:07:48.828559: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865, 1 -> 10.6.10.6
:56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715}
(BaseWorkerMixin pid=4170012) 2022-08-10 17:07:48.828828: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.6:53477
(BaseWorkerMixin pid=4170013) 2022-08-10 17:07:48.816269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB memory: -> dev
ice: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b2:00.0, compute capability: 7.0
(BaseWorkerMixin pid=4170013) 2022-08-10 17:07:48.820912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:3/device:GPU:0 with 30989 MB memory: -> device
: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b2:00.0, compute capability: 7.0
(BaseWorkerMixin pid=4170013) 2022-08-10 17:07:48.834523: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865, 1 -> 10.6.10.6
:56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715}
(BaseWorkerMixin pid=4170013) 2022-08-10 17:07:48.834990: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.6:46449
(BaseWorkerMixin pid=4170010) 2022-08-10 17:07:48.837828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB memory: -> dev
ice: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3d:00.0, compute capability: 7.0
(BaseWorkerMixin pid=4170010) 2022-08-10 17:07:48.842887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:0/device:GPU:0 with 30989 MB memory: -> device
: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3d:00.0, compute capability: 7.0
(BaseWorkerMixin pid=4170010) 2022-08-10 17:07:48.856112: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865, 1 -> 10.6.10.6
:56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715}
(BaseWorkerMixin pid=4170010) 2022-08-10 17:07:48.856568: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.6:51865
(BaseWorkerMixin pid=4170011) 2022-08-10 17:07:48.824444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB memory: -> dev
ice: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3e:00.0, compute capability: 7.0
(BaseWorkerMixin pid=4170011) 2022-08-10 17:07:48.829377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:1/device:GPU:0 with 30989 MB memory: -> device
: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3e:00.0, compute capability: 7.0
(BaseWorkerMixin pid=4170011) 2022-08-10 17:07:48.839946: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865, 1 -> 10.6.10.6
:56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715}
(BaseWorkerMixin pid=4170011) 2022-08-10 17:07:48.840232: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.6:56733
(BaseWorkerMixin pid=164395, ip=10.6.10.12) 2022-08-10 17:07:51.007341: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:776] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "Tens
orSli ceDataset/_2"
...

At the end of run, each rank just prints out their loss and accuracy.

...
69/70 [============================>.] - ETA: 0s - loss: 2.1512 - accuracy: 0.5079
69/70 [============================>.] - ETA: 0s - loss: 2.1512 - accuracy: 0.5079
69/70 [============================>.] - ETA: 0s - loss: 2.1512 - accuracy: 0.5079
69/70 [============================>.] - ETA: 0s - loss: 2.1512 - accuracy: 0.5079
69/70 [============================>.] - ETA: 0s - loss: 2.1512 - accuracy: 0.5079
69/70 [============================>.] - ETA: 0s - loss: 2.1512 - accuracy: 0.5079
69/70 [============================>.] - ETA: 0s - loss: 2.1512 - accuracy: 0.5079
69/70 [============================>.] - ETA: 0s - loss: 2.1512 - accuracy: 0.5079
70/70 [==============================] - 5s 73ms/step - loss: 2.1506 - accuracy: 0.5089
70/70 [==============================] - 5s 73ms/step - loss: 2.1506 - accuracy: 0.5089
70/70 [==============================] - 5s 73ms/step - loss: 2.1506 - accuracy: 0.5089
70/70 [==============================] - 5s 73ms/step - loss: 2.1506 - accuracy: 0.5089
70/70 [==============================] - 5s 73ms/step - loss: 2.1506 - accuracy: 0.5089
70/70 [==============================] - 5s 73ms/step - loss: 2.1506 - accuracy: 0.5089
70/70 [==============================] - 5s 73ms/step - loss: 2.1506 - accuracy: 0.5089
70/70 [==============================] - 5s 73ms/step - loss: 2.1506 - accuracy: 0.5089
Results: {'loss': [2.282585382461548, 2.22174334526062, 2.150625228881836], 'accuracy': [0.1520089358091354
4, 0.34743303060531616, 0.5088727474212646]}

The gpustat monitoring tool shows the example runs on 8 GPU devices but it doesn't heavily utilise them.