PyTorch Lightning is a lightweight PyTorch wrapper that simplifies the training and research process for complex deep learning models. It provides a high-level interface for organizing PyTorch code and facilitates best practices in terms of reproducibility, scalability, and maintainability.
Distributed Data Parallel (DDP) is a distributed training strategy commonly used in deep learning frameworks like PyTorch to train neural networks across multiple GPUs or multiple machines. It is particularly useful when dealing with large datasets or complex models that require significant computational resources.
PyTorch Lightning's DDPStrategy
(Distributed Data Parallel Strategy) is a way to easily enable distributed training across multiple GPUs or machines. It leverages PyTorch's torch.nn.parallel.DistributedDataParallel
(DDP) module under the hood to manage communication and synchronization between different processes. It simplifies the process of distributed training by abstracting away the complexities of setting up and managing distributed communication. It allows you to scale your models to multiple GPUs or machines without having to modify your training code significantly. This makes it easier to take advantage of parallelism and accelerate training on large datasets or complex models.
Example
NCI provides the following example to demonstrate how to run PyTorch Lightning with DDP across multiple GPU nodes.
/g/data/dk92/apps/NCI-ai-ml/23.10/examples/lightning/pl-mnist-ddp.py
You can test it with the following NCI Specialised Environments or via your own software environment.
Note
You must join wb00 to access the MNIST dataset used by this example.
Environment
You will need to load the NCI-ai-ml module to run the above example script.
module use /g/data/dk92/apps/Modules/modulefiles |
PBS job script
You must request GPU resource in a PBS job to run the example script. An example PBS job script is shown below and it is located under Gadi as "/g/data/dk92/apps/NCI-ai-ml/23.10/examples/lightning/plrun.pbs".
The script "plrun_nccl.sh" is a wrapper to simplify the execution of Pytorch Lightning DDP script.
|
The default number of epoch is 10. You can override it with the flag '--epochs', i.e.
plrun_nccl.sh ${NCI_AI_ML_ROOT}/examples/lightning/pl-mnist-ddp.py --epochs 50
Outputs
The above PBS job requests 2 GPU nodes and 8 V100 GPU devices in total. It will produce the following outputs:
Outputs | Description |
---|---|
GPU available: True (cuda), used: True | Detecting available accelerators, i.e. GPU in Gadi. |
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8 | Find and initialise 8 GPU devices across 2 nodes. Each devices are assigned unique GLOBAK_RANK in the range of 0~7. |
distributed_backend=nccl | Clarify the distributed backend as nccl and there are 8 distributed processes registered. |
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3] | The LOCAL_RANK denotes their rank index within a single node. Each v100 GPU node is equipped with 4 GPU devices, thus the LOCAL_RANK lies in the range of 0~3. |
| Name | Type | Params | Shows the total number of network parameters |
Repeat 10 training/validate epochs and a test run. | Start training/validating and test. |