PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open-source software released under the Modified BSD licence.
More information: https://pytorch.org/
You can check the versions installed in Gadi with a module
query:
$ module avail pytorch
We normally recommend using the latest version available and always recommend to specify the version number with the module
command:
$ module load pytorch/1.10.0
For more details on using modules see our software applications guide.
The following section describes how to run a demo PyTorch Elastic Distributed Data Parallel (DDP) code available at the end of the page: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
An example PBS job submission script named elastic_ddp_nccl_job.sh
is provided below. This script will run the demo code which uses NCCL as the communication backend and torchrun
as the initialiser of DDP.
It requests 96 CPUs, 8 GPUs, 760 GiB memory, and 800 GiB local disk on 2 compute nodes on Gadi from the gpuvolta
queue for 30 minutes against the project a00
. It also requests the system to enter the working directory once the job is started. This script should be saved in the working directory from which the analysis will be done.
#!/bin/bash #PBS -P a00 #PBS -q gpuvolta #PBS -l ncpus=96 #PBS -l ngpus=8 #PBS -l mem=760GB #PBS -l jobfs=800GB #PBS -l walltime=00:30:00 #PBS -l wd # Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job # needs access to `/scratch/ab12/` and `/g/data/yz98/` # Set variables if [[ $PBS_NCPUS -ge $PBS_NCI_NCPUS_PER_NODE ]] then NNODES=$((PBS_NCPUS / PBS_NCI_NCPUS_PER_NODE)) else NNODES=1 fi PROC_PER_NODE=$((PBS_NGPUS / NNODES)) MASTER_ADDR=$(cat $PBS_NODEFILE | head -n 1) # Launch script LAUNCH_SCRIPT=/path/to/launch_elastic_ddp_nccl.sh # Set execute permission chmod u+x ${LAUNCH_SCRIPT} # Run PyTorch application for inode in $(seq 1 $PBS_NCI_NCPUS_PER_NODE $PBS_NCPUS); do pbsdsh -n $inode ${LAUNCH_SCRIPT} ${NNODES} ${PROC_PER_NODE} ${MASTER_ADDR} & done wait
The content of the launch_elastic_ddp_nccl.sh
file is as follows:
#!/bin/bash # Load shell environment variables source ~/.bashrc # Load module, always specify version number. module load pytorch/1.10.0 # Application script APPLICATION_SCRIPT=/path/to/elastic_ddp_nccl.py # Set execute permission chmod u+x ${APPLICATION_SCRIPT} # Run PyTorch application torchrun --nnodes=${1} --nproc_per_node=${2} --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=${3}:29400 ${APPLICATION_SCRIPT}
The content of the elastic_ddp_nccl.py
file is as follows:
# This is a demo PyTorch Elastic Distributed Data Parallel (DDP) # code which uses NCCL as the communication backend and torchrun # as the initialiser of DDP. # # This example is available at the end of the following page: # https://pytorch.org/tutorials/intermediate/ddp_tutorial.html import os import torch import torch.distributed as dist import torch.nn as nn import torch.optim as optim from torch.nn.parallel import DistributedDataParallel as DDP class ToyModel(nn.Module): def __init__(self): super(ToyModel, self).__init__() self.net1 = nn.Linear(10, 10) self.relu = nn.ReLU() self.net2 = nn.Linear(10, 5) def forward(self, x): return self.net2(self.relu(self.net1(x))) def demo_basic(): dist.init_process_group("nccl") rank = dist.get_rank() node = os.uname()[1] # create model and move it to GPU with id rank device_id = rank % torch.cuda.device_count() model = ToyModel().to(device_id) ddp_model = DDP(model, device_ids=[device_id]) print( f"Start running basic DDP example with process rank {rank} " f"and GPU ID {device_id} on host {node}." ) loss_fn = nn.MSELoss() optimizer = optim.SGD(ddp_model.parameters(), lr=0.001) optimizer.zero_grad() outputs = ddp_model(torch.randn(20, 10)) labels = torch.randn(20, 5).to(device_id) loss_fn(outputs, labels).backward() optimizer.step() if __name__ == "__main__": demo_basic()
To run the job you would use the PBS command:
$ qsub elastic_ddp_nccl_job.sh
Check the files elastic_ddp_nccl_job.sh.e****
for any errors and elastic_ddp_nccl_job.sh.o****
for any outputs and to see the time consumed.
Running jobs in an interactive way is also possible. Please see the details in our job submission guide.
Change Line 29 of the elastic_ddp_nccl.py
file with
dist.init_process_group("mpi")
(let's say the new filename is elastic_ddp_mpi.py
) and replace Lines between 16 and 31 of the elastic_ddp_nccl_job.sh
file with
# Load module, always specify version number. module load pytorch/1.10.0 # Run PyTorch application mpirun -np ${PBS_NGPUS} -map-by numa:SPAN -mca coll_hcoll_enable 0 -mca pml ^ucx python3 elastic_ddp_mpi.py
(let's say the new filename is elastic_ddp_mpi_job.sh
) and submit the job script elastic_ddp_mpi_job.sh
.