PyTorch Lightning is a lightweight PyTorch wrapper that simplifies the training and research process for complex deep learning models. It provides a high-level interface for organizing PyTorch code and facilitates best practices in terms of reproducibility, scalability, and maintainability.

Distributed Data Parallel (DDP) is a distributed training strategy commonly used in deep learning frameworks like PyTorch to train neural networks across multiple GPUs or multiple machines. It is particularly useful when dealing with large datasets or complex models that require significant computational resources.

PyTorch Lightning's DDPStrategy (Distributed Data Parallel Strategy) is a way to easily enable distributed training across multiple GPUs or machines. It leverages PyTorch's torch.nn.parallel.DistributedDataParallel (DDP) module under the hood to manage communication and synchronization between different processes. It simplifies the process of distributed training by abstracting away the complexities of setting up and managing distributed communication. It allows you to scale your models to multiple GPUs or machines without having to modify your training code significantly. This makes it easier to take advantage of parallelism and accelerate training on large datasets or complex models.

Example

NCI provides the following example to demonstrate how to run PyTorch Lightning with DDP across multiple GPU nodes. 

/g/data/dk92/apps/NCI-ai-ml/23.10/examples/lightning/pl-mnist-ddp.py

You can test it with the following NCI Specialised Environments or via your own software environment. 

Note

You must join wb00 to access the MNIST dataset used by this example.

Environment

You will need to load the NCI-ai-ml module to run the above example script.

module use /g/data/dk92/apps/Modules/modulefiles
module load NCI-ai-ml/23.10

PBS job script

You must request GPU resource in a PBS job to run the example script. An example PBS job script is shown below and it is located under Gadi as "/g/data/dk92/apps/NCI-ai-ml/23.10/examples/lightning/plrun.pbs".

 The script "plrun_nccl.sh" is a wrapper to simplify the execution of Pytorch Lightning DDP script.

#!/bin/bash
  
#PBS -q gpuvolta
#PBS -l ncpus=96
#PBS -l ngpus=8
#PBS -l mem=760GB
#PBS -l jobfs=800GB
#PBS -l walltime=00:30:00
#PBS -l storage=gdata/dk92+gdata/wb00+scratch/a00
#PBS -l wd
#PBS -N plrun_test
 
# Must include `#PBS -l storage=gdata/dk92+gdata/wb00+scratch/ab12` if the job
# needs access to `/scratch/a00/` and using NCI-AI-ML module.
# Details on:
# https://opus.nci.org.au/display/Help/PBS+Directives+Explained
  
module use /g/data/dk92/apps/Modules/modulefiles
module load NCI-ai-ml/23.10
 
plrun_nccl.sh  ${NCI_AI_ML_ROOT}/examples/lightning/pl-mnist-ddp.py >& output.log


The default number of epoch is 10. You can override it with the flag '--epochs', i.e.

plrun_nccl.sh  ${NCI_AI_ML_ROOT}/examples/lightning/pl-mnist-ddp.py --epochs 50

Outputs

The above PBS job requests 2 GPU nodes and 8 V100 GPU devices in total. It will produce the following outputs: 

OutputsDescription
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Detecting available accelerators, i.e. GPU in Gadi.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Find and initialise 8 GPU devices across 2 nodes. Each devices are assigned unique GLOBAK_RANK in the range of 0~7.
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
Clarify the distributed backend as nccl and there are 8 distributed processes registered. 
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
The LOCAL_RANK denotes their rank index within a single node. Each v100 GPU node is equipped with 4 GPU devices, thus the LOCAL_RANK lies in the range of 0~3.
| Name    | Type    | Params
-----------------------------
0 | encoder | Encoder | 108 K 
1 | decoder | Decoder | 109 K
-----------------------------
218 K     Trainable params
0         Non-trainable params
218 K     Total params
0.875     Total estimated model params size (MB)
Shows the total number of network parameters
Repeat 10 training/validate epochs and a test run. Start training/validating and test.





  • No labels