Page tree

On this page

Overview

PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open-source software released under the Modified BSD license..

More information: https://pytorch.org/

Usage

You can check the versions installed in Gadi with a module query:

$ module avail pytorch

We normally recommend using the latest version available and always recommend to specify the version number with the module command:

$ module load pytorch/1.10.0

For more details on using modules see our modules help guide at https://opus.nci.org.au/display/Help/Environment+Modules.

Run PyTorch Elastic Distributed Data Parallel code

The following section describes how to run a demo PyTorch Elastic Distributed Data Parallel (DDP) code available at the end of the page: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

NCCL as communication backend and torchrun as DDP initialiser

An example PBS job submission script named elastic_ddp_nccl_job.sh is provided below. This script will run the demo code which uses NCCL as the communication backend and torchrun as the initialiser of DDP. It requests 96 CPU cores, 8 GPUs, 760 GiB memory, and 800 GiB local disk on 2 compute nodes on Gadi from the gpuvolta queue for its exclusive access for 30 minutes against the project a00. It also requests the system to enter the working directory once the job is started. This script should be saved in the working directory from which the analysis will be done.

#!/bin/bash

#PBS -P a00
#PBS -q gpuvolta
#PBS -l ncpus=96
#PBS -l ngpus=8
#PBS -l mem=760GB
#PBS -l jobfs=800GB
#PBS -l walltime=00:30:00
#PBS -l wd

# Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job
# needs access to `/scratch/ab12/` and `/g/data/yz98/`. Details on:
# https://opus.nci.org.au/display/Help/PBS+Directives+Explained

# Set variables
if [[ $PBS_NCPUS -ge $PBS_NCI_NCPUS_PER_NODE ]]
then
  NNODES=$((PBS_NCPUS / PBS_NCI_NCPUS_PER_NODE))
else
  NNODES=1
fi

PROC_PER_NODE=$((PBS_NGPUS / NNODES))

MASTER_ADDR=$(cat $PBS_NODEFILE | head -n 1)

# Launch script
LAUNCH_SCRIPT=/path/to/launch_elastic_ddp_nccl.sh

# Set execute permission
chmod u+x ${LAUNCH_SCRIPT}

# Run PyTorch application
for inode in $(seq 1 $PBS_NCI_NCPUS_PER_NODE $PBS_NCPUS); do
  pbsdsh -n $inode ${LAUNCH_SCRIPT} ${NNODES} ${PROC_PER_NODE} ${MASTER_ADDR} &
done
wait

The content of the launch_elastic_ddp_nccl.sh file is as follows:

#!/bin/bash

# Load shell environment variables
source ~/.bashrc

# Load module, always specify version number.
module load pytorch/1.10.0

# Application script
APPLICATION_SCRIPT=/path/to/elastic_ddp_nccl.py

# Set execute permission
chmod u+x ${APPLICATION_SCRIPT}

# Run PyTorch application
torchrun --nnodes=${1} --nproc_per_node=${2} --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=${3}:29400 ${APPLICATION_SCRIPT} 

The content of the elastic_ddp_nccl.py file is as follows:

# This is a demo PyTorch Elastic Distributed Data Parallel (DDP)
# code which uses NCCL as the communication backend and torchrun
# as the initialiser of DDP.
#
# This example is available at the end of the following page:
# https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

import os

import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim

from torch.nn.parallel import DistributedDataParallel as DDP

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def demo_basic():
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    node = os.uname()[1]

    # create model and move it to GPU with id rank
    device_id = rank % torch.cuda.device_count()
    model = ToyModel().to(device_id)
    ddp_model = DDP(model, device_ids=[device_id])

    print(
        f"Start running basic DDP example with process rank {rank} "
        f"and GPU ID {device_id} on host {node}."
    )

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(device_id)
    loss_fn(outputs, labels).backward()
    optimizer.step()

if __name__ == "__main__":
    demo_basic()

To run the job you would use the PBS command:

$ qsub elastic_ddp_nccl_job.sh

Check the files elastic_ddp_nccl_job.sh.e**** for any errors and elastic_ddp_nccl_job.sh.o****  for any outputs and to see the time consumed.

Running jobs in an interactive way is also possible. Please see the details at https://opus.nci.org.au/display/Help/0.+Welcome+to+Gadi#id-0.WelcometoGadi-InteractiveJobs.

MPI as communication backend and DDP initialiser

Change Line 29 of the elastic_ddp_nccl.py file with

    dist.init_process_group("mpi")

(let's say the new filename is elastic_ddp_mpi.py) and replace Lines between 16 and 31 of the elastic_ddp_nccl_job.sh file with

# Load module, always specify version number.
module load pytorch/1.10.0

# Run PyTorch application
mpirun -np ${PBS_NGPUS} -map-by numa:SPAN -mca coll_hcoll_enable 0 -mca pml ^ucx python3 elastic_ddp_mpi.py

(let's say the new filename is elastic_ddp_mpi_job.sh) and submit the job script elastic_ddp_mpi_job.sh.