PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open-source software released under the Modified BSD licence.

More information:

How to use 

You can check the versions installed in Gadi with a module query:

 $ module avail pytorch

 We normally recommend using the latest version available and always recommend to specify the version number with the module command:

 $ module load pytorch/1.10.0

For more details on using modules see our software applications guide.

Run PyTorch Elastic Distributed Data Parallel code

The following section describes how to run a demo PyTorch Elastic Distributed Data Parallel (DDP) code available at the end of the page:

NCCL as communication backend and torchrun as DDP initialiser 

An example PBS job submission script named is provided below. This script will run the demo code which uses NCCL as the communication backend and torchrun as the initialiser of DDP.

It requests 96 CPUs, 8 GPUs, 760 GiB memory, and 800 GiB local disk on 2 compute nodes on Gadi from the gpuvolta queue for 30 minutes against the project a00. It also requests the system to enter the working directory once the job is started. This script should be saved in the working directory from which the analysis will be done.

#PBS -P a00
#PBS -q gpuvolta
#PBS -l ncpus=96
#PBS -l ngpus=8
#PBS -l mem=760GB
#PBS -l jobfs=800GB
#PBS -l walltime=00:30:00
#PBS -l wd
# Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job
# needs access to `/scratch/ab12/` and `/g/data/yz98/`

# Set variables
MASTER_ADDR=$(cat $PBS_NODEFILE | head -n 1)
# Launch script
# Set execute permission
chmod u+x ${LAUNCH_SCRIPT}
# Run PyTorch application
for inode in $(seq 1 $PBS_NCI_NCPUS_PER_NODE $PBS_NCPUS); do
  pbsdsh -n $inode ${LAUNCH_SCRIPT} ${NNODES} ${PROC_PER_NODE} ${MASTER_ADDR} &

The content of the file is as follows:

# Load shell environment variables
source ~/.bashrc
# Load module, always specify version number.
module load pytorch/1.10.0
# Application script
# Set execute permission
# Run PyTorch application
torchrun --nnodes=${1} --nproc_per_node=${2} --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=${3}:29400 ${APPLICATION_SCRIPT}

The content of the file is as follows:

# This is a demo PyTorch Elastic Distributed Data Parallel (DDP)
# code which uses NCCL as the communication backend and torchrun
# as the initialiser of DDP.
# This example is available at the end of the following page:
import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)
    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))
def demo_basic():
    rank = dist.get_rank()
    node = os.uname()[1]
    # create model and move it to GPU with id rank
    device_id = rank % torch.cuda.device_count()
    model = ToyModel().to(device_id)
    ddp_model = DDP(model, device_ids=[device_id])
        f"Start running basic DDP example with process rank {rank} "
        f"and GPU ID {device_id} on host {node}."
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(device_id)
    loss_fn(outputs, labels).backward()
if __name__ == "__main__":

To run the job you would use the PBS command:

 $ qsub

Check the files**** for any errors and**** for any outputs and to see the time consumed.

Running jobs in an interactive way is also possible. Please see the details in our job submission guide.

MPI as communication backend and DDP initialiser 

Change Line 29 of the file with


(let's say the new filename is and replace Lines between 16 and 31 of the file with

# Load module, always specify version number.
module load pytorch/1.10.0

# Run PyTorch application
mpirun -np ${PBS_NGPUS} -map-by numa:SPAN -mca coll_hcoll_enable 0 -mca pml ^ucx python3

(let's say the new filename is and submit the job script

Authors: Mohsin Ali
