Page tree

On this page

Overview

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. It provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.

More information: https://developer.nvidia.com/nccl

Usage

You can check the versions installed in Gadi with a module query:

$ module avail nccl

We normally recommend using the latest version available and always recommend to specify the version number with the module command:

$ module load nccl/2.10.3-cuda11.4

For more details on using modules see our modules help guide at https://opus.nci.org.au/display/Help/Environment+Modules.

Compile source code with MPI, CUDA and NCCL using the following command:

# Load modules, always specify version number.
$ module load nccl/2.10.3-cuda11.4
$ module load openmpi/4.1.1

$ nvcc -o nccl_application.exe -g -lm -lstdc++ -lmpi -lcudart -lnccl <nccl application>.cu

A complete working example with multiple MPI processes and multiple GPU devices per process is available at https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/examples.html.

An example PBS job submission script named nccl_job.sh is provided below. It requests 48 CPU cores, 4 GPUs, 350 GiB memory, and 400 GiB local disk on a compute node on Gadi from the gpuvolta queue for its exclusive access for 30 minutes against the project a00. It also requests the system to enter the working directory once the job is started. This script should be saved in the working directory from which the analysis will be done. To change the number of CPU cores, memory, or jobfs required, simply modify the appropriate PBS resource requests at the top of the job scrip files according to the information available at https://opus.nci.org.au/display/Help/Queue+Structure. Note that if your application does not work in parallel, setting the number of CPU cores to 1 and changing the memory and jobfs accordingly is required to prevent the compute resource waste.

#!/bin/bash

#PBS -P a00
#PBS -q gpuvolta
#PBS -l ncpus=48
#PBS -l ngpus=4
#PBS -l mem=350GB
#PBS -l jobfs=400GB
#PBS -l walltime=00:30:00
#PBS -l wd

# Load modules, always specify version number.
module load nccl/2.10.3-cuda11.4
module load openmpi/4.1.1

# Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job
# needs access to `/scratch/ab12/` and `/g/data/yz98/`. Details on:
# https://opus.nci.org.au/display/Help/PBS+Directives+Explained

# Run application
# The following will run 1 MPI process per node and each MPI process
# will use 4 GPUs as there are 4 GPUs in each GPU node. 
mpirun -np $PBS_NNODES --map-by ppr:1:node nccl_application.exe

To run the job you would use the PBS command:

$ qsub nccl_job.sh

See https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html for the detailed documentation of NCCL.