torchrun: Multi-node Distributed Training

PyTorch provide the native API, i.e. torchrun, to enable multiple node distributed training based on DistributedDataParallel (DDP).

It is necessary to execute torchrun at each working node. You need to specify a batch of environment variables in the PBS job script and produce a wrapper script to run torchrun as described in the instruction page of /apps/pytorch module.

To simplify using torchrun on Gadi, the NCI-AI-ML environment provides a single wrapper script called "torchrun_nccl.sh" to wrap the whole process.

You can simply submit a job script as below

PBS job script

#!/bin/bash
 
#PBS -q gpuvolta
#PBS -l ncpus=96
#PBS -l ngpus=8
#PBS -l mem=760GB
#PBS -l jobfs=800GB
#PBS -l walltime=00:30:00
#PBS -l storage=gdata/dk92+scratch/a00
#PBS -l wd
#PBS -N torchrun_test

# Must include `#PBS -l storage=gdata/dk92+scratch/ab12` if the job
# needs access to `/scratch/a00/` and using NCI-AI-ML module.
# Details on:
# https://opus.nci.org.au/display/Help/PBS+Directives+Explained
 
module use /g/data/dk92/apps/Modules/modulefiles
module load NCI-ai-ml/23.05

torchrun_nccl.sh  ${NCI_AI_ML_ROOT}/examples/pytorch/torchrun_example.py 50 10

In the above job script, the example PyTorch script "torchrun_example.py" slightly modified a PyTorch Tutorial script multinode.py. This script accepts two arguments (i.e. "50 10" in the notebook above) to indicate the total epochs to train the model and how often to save a snapshot.

The script "torchrun_example.py" contains the following lines to utilise DDP.

Import DDP and other PyTorch distributed libraries.
- DistributedDataParallel (DDP) is multi-process and works for both single- and multi- machine training
- DDP uses collective communications in the torch.distributed package to synchronize gradients and buffers.
- The distributed process group contains all the processes that can communicate and synchronize with each other.
```
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
```
Leverage PyTorch Elastic to simplify the DDP code and initialize the job by setting the process group communication protocol being "nccl".
```
def ddp_setup():
    init_process_group(backend="nccl")
```
DDP also works with multi-GPU models as below. DDP wrapping multi-GPU models is especially helpful when training large models with a huge amount of data.
```
self.model = DDP(self.model, device_ids=[self.local_rank])
```

Distribute training data around different processes with DistributedSampler.

return DataLoader(
  dataset,
  batch_size=batch_size,
  pin_memory=True,
  shuffle=False,
  sampler=DistributedSampler(dataset)
)

After running the above job script you could see two new output files produced with each per GPU node, i.e. output.SCRIPT_NAME.HOSTNAME.log

output.torchrun_example.py.gadi-gpu-v100-0031.log
output.torchrun_example.py.gadi-gpu-v100-0132.log

Each output file shows there are 4 GPUs at each local node together with their global rank IDs.

[rxy900@gadi-gpu-v100-0129 torchrun_examples]$ more output.torchrun_example.py.gadi-gpu-v100-0031.log
[GPU0] Epoch 0 | Batchsize: 32 | Steps: 8
[GPU3] Epoch 0 | Batchsize: 32 | Steps: 8
[GPU1] Epoch 0 | Batchsize: 32 | Steps: 8
[GPU2] Epoch 0 | Batchsize: 32 | Steps: 8
...
[rxy900@gadi-gpu-v100-0129 torchrun_examples]$ more output.torchrun_example.py.gadi-gpu-v100-0060.log
[GPU4] Epoch 0 | Batchsize: 32 | Steps: 8
[GPU6] Epoch 0 | Batchsize: 32 | Steps: 8
[GPU7] Epoch 0 | Batchsize: 32 | Steps: 8
[GPU5] Epoch 0 | Batchsize: 32 | Steps: 8
...

You could find both torchrun_nccl.sh and torchrun_example.py under the NCI-AI-ML module space:

$NCI_AI_ML_BASE/bin/torchrun_nccl.sh
$NCI_AI_ML_BASE/examples/pytorch/torchrun_example.py

Page tree

torchrun: Multi-node Distributed Training