How to use
You can check the versions installed in Gadi with a module
query:
Code Block |
---|
|
$ module avail pytorch |
We normally recommend using the latest version available and always recommend to specify the version number with the module
command:
Code Block |
---|
|
$ module load pytorch/1.10.0 |
For more details on using modules see our software applications guide.
Anchor |
---|
| Elastic DDP code |
---|
| Elastic DDP code |
---|
|
Run PyTorch Elastic Distributed Data Parallel code
The following section describes how to run a demo PyTorch Elastic Distributed Data Parallel (DDP) code available at the end of the page: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
NCCL as communication backend and torchrun as DDP initialiser initialiser
An example PBS job submission script named elastic_ddp_nccl_job.sh
is provided below. This script will run the demo code which uses NCCL as the communication backend and torchrun
as the initialiser of DDP.
It requests 96 CPUs, 8 GPUs, 760 GiB memory, and 800 GiB local disk on 2 compute nodes on Gadi from the gpuvolta
queue for 30 minutes against the project a00
. It also requests the system to enter the working directory once the job is started. This script should be saved in the working directory from which the analysis will be done.
Code Block |
---|
theme | FadeToGrey |
---|
linenumbers | true |
---|
|
#!/bin/bash
#PBS -P a00
#PBS -q gpuvolta
#PBS -l ncpus=96
#PBS -l ngpus=8
#PBS -l mem=760GB
#PBS -l jobfs=800GB
#PBS -l walltime=00:30:00
#PBS -l wd
# Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job
# needs access to `/scratch/ab12/` and `/g/data/yz98/`
# Set variables
if [[ $PBS_NCPUS -ge $PBS_NCI_NCPUS_PER_NODE ]]
then
NNODES=$((PBS_NCPUS / PBS_NCI_NCPUS_PER_NODE))
else
NNODES=1
fi
PROC_PER_NODE=$((PBS_NGPUS / NNODES))
MASTER_ADDR=$(cat $PBS_NODEFILE | head -n 1)
# Launch script
LAUNCH_SCRIPT=/path/to/launch_elastic_ddp_nccl.sh
# Set execute permission
chmod u+x ${LAUNCH_SCRIPT}
# Run PyTorch application
for inode in $(seq 1 $PBS_NCI_NCPUS_PER_NODE $PBS_NCPUS); do
pbsdsh -n $inode ${LAUNCH_SCRIPT} ${NNODES} ${PROC_PER_NODE} ${MASTER_ADDR} &
done
wait |
The content of the launch_elastic_ddp_nccl.sh
file is as follows:
Code Block |
---|
|
#!/bin/bash
# Load shell environment variables
source ~/.bashrc
# Load module, always specify version number.
module load pytorch/1.10.0
# Application script
APPLICATION_SCRIPT=/path/to/elastic_ddp_nccl.py
# Set execute permission
chmod u+x ${APPLICATION_SCRIPT}
# Run PyTorch application
torchrun --nnodes=${1} --nproc_per_node=${2} --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=${3}:29400 ${APPLICATION_SCRIPT} |
The content of the elastic_ddp_nccl.py
file is as follows:
Code Block |
---|
theme | FadeToGrey |
---|
linenumbers | true |
---|
|
# This is a demo PyTorch Elastic Distributed Data Parallel (DDP)
# code which uses NCCL as the communication backend and torchrun
# as the initialiser of DDP.
#
# This example is available at the end of the following page:
# https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(10, 10)
self.relu = nn.ReLU()
self.net2 = nn.Linear(10, 5)
def forward(self, x):
return self.net2(self.relu(self.net1(x)))
def demo_basic():
dist.init_process_group("nccl")
rank = dist.get_rank()
node = os.uname()[1]
# create model and move it to GPU with id rank
device_id = rank % torch.cuda.device_count()
model = ToyModel().to(device_id)
ddp_model = DDP(model, device_ids=[device_id])
print(
f"Start running basic DDP example with process rank {rank} "
f"and GPU ID {device_id} on host {node}."
)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(device_id)
loss_fn(outputs, labels).backward()
optimizer.step()
if __name__ == "__main__":
demo_basic() |
To run the job you would use the PBS command:
Code Block |
---|
|
$ qsub elastic_ddp_nccl_job.sh |
Check the files elastic_ddp_nccl_job.sh.e****
for any errors and elastic_ddp_nccl_job.sh.o****
for any outputs and to see the time consumed.
Running jobs in an interactive way is also possible. Please see the details in our job submission guide.
MPI as communication backend and DDP initialiser initialiser
Change Line 29 of the elastic_ddp_nccl.py
file with
Code Block |
---|
theme | FadeToGrey |
---|
firstline | 29 |
---|
linenumbers | true |
---|
|
dist.init_process_group("mpi") |
(let's say the new filename is elastic_ddp_mpi.py
) and replace Lines between 16 and 31 of the elastic_ddp_nccl_job.sh
file with
Code Block |
---|
theme | FadeToGrey |
---|
firstline | 16 |
---|
linenumbers | true |
---|
|
# Load module, always specify version number.
module load pytorch/1.10.0
# Run PyTorch application
mpirun -np ${PBS_NGPUS} -map-by numa:SPAN -mca coll_hcoll_enable 0 -mca pml ^ucx python3 elastic_ddp_mpi.py
|
(let's say the new filename is elastic_ddp_mpi_job.sh
) and submit the job script elastic_ddp_mpi_job.sh
.