Page tree


The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking.

It provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.

More information:

How to use 

You can check the versions installed in Gadi with a module query:

$ module avail nccl

We normally recommend using the latest version available and always recommend to specify the version number with the module command:

$ module load nccl/2.10.3-cuda11.4

For more details on using modules see our software applications guide.

Compile source code with MPI, CUDA and NCCL using the following command:

# Load modules, always specify version number.
$ module load nccl/2.10.3-cuda11.4
$ module load openmpi/4.1.1
$ nvcc -o nccl_application.exe -g -lm -lstdc++ -lmpi -lcudart -lnccl <nccl application>.cu

A complete working example with multiple MPI processes and multiple GPU devices per process is available at

An example PBS job submission script named is provided below.

It requests 48 CPUs, 4 GPUs, 350 GiB memory, and 400 GiB local disk on a compute node on Gadi from the gpuvolta queue for 30 minutes against the project a00. It also requests the system to enter the working directory once the job is started. This script should be saved in the working directory from which the analysis will be done.

To change the number of CPU cores, memory, or jobfs required, simply modify the appropriate PBS resource requests at the top of the job script files according to the information available in our queue structure guide.

Note that if your application does not work in parallel, setting the number of CPU cores to 1 and changing the memory and jobfs accordingly is required to prevent the compute resource waste.

#PBS -P a00
#PBS -q gpuvolta
#PBS -l ncpus=48
#PBS -l ngpus=4
#PBS -l mem=350GB
#PBS -l jobfs=400GB
#PBS -l walltime=00:30:00
#PBS -l wd
# Load modules, always specify version number.
module load nccl/2.10.3-cuda11.4
module load openmpi/4.1.1
# Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job
# needs access to `/scratch/ab12/` and `/g/data/yz98/`
# Run application
# The following will run 1 MPI process per node and each MPI process
# will use 4 GPUs as there are 4 GPUs in each GPU node.
mpirun -np $PBS_NNODES --map-by ppr:1:node nccl_application.exe

To run the job you would use the PBS command:

$ qsub

See for the detailed documentation of NCCL.

Authors: Mohsin Ali
  • No labels