Page tree
Skip to end of metadata
Go to start of metadata

MPI (Message Passing Interface) is a parallel program interface for explicitly passing messages between parallel processes. To take advantage of MPI, you must add message passing constructs to your program.

Usage

Both OpenMPI and Intel MPI are supported on Raijin.

OpenMPI

OpenMPI is our preferred package. It is built specifically to run under raijin's PBS and infiniband and is normally very fast. We recommend using the latest version installed, but if you need to use an older version, the following versions seems to be most stable: 1.65, 1.88, 1.10.2. You will get the current OpenMPI module loaded by default on login. To load OpenMPI, do

module load intel-fc/16.0.3.210
module load intel-cc/16.0.3.210
module load openmpi/1.10.2

This will create links to intel compilers, while

module unload intel-fc intel-cc
module load openmpi/1.10.2

will create links to gcc and gfortran compilers.

If in doubt, use mpif90 -v or mpicc -v command to see what compiler is being used.

Use mpif77, mpif90, mpicc, mpic++ to build and link an mpi program. This way you do not need to specify any additional MPI specific libraries. The mpi-compiler will take care of this itself, For example,

mpif90 x.F90

will compile x.F90 program and create an mpi executable a.out.

Running an MPI program

We recommend using mpirun command to run an MPI program. For example, assuming that we have an mpi program a.out in the current directory, the following script will run it on 256 CPUs:

#!/bin/bash
#PBS -q normal
#PBS -l walltime=20:00:00
#PBS -l ncpus=256
#PBS -l mem=256GB
#PBS -l jobfs=1GB
#PBS -l wd

module load openmpi/1.10.2

mpirun ./a.out > output

Note that you do not need to give any special parameters to mpirun as it is integrated with PBS and "knows" which nodes to use for the program. It is a good practice to load the openmpi module that was used to build the program, however, on raijin, this is not strictly necessary. You can load any openmpi version module and it will automatically switch to the correct one during the run time.

Many modern program can run in so called mixed mode, i.e. an mpi communication is done between the nodes (or physical CPUs) and OMP threads are used inside nodes (or physical CPUs). This may allow you to get a better performance. If you have a program that support such a hybrid mode, the following script shows how to run it:

#!/bin/bash
#PBS -q normal
#PBS -l walltime=20:00:00
#PBS -l ncpus=256
#PBS -l mem=256GB
#PBS -l jobfs=1GB
#PBS -l wd

module load openmpi/1.10.2
n=8

export OMP_NUM_THREADS=$n

mpirun -map-by ppr:$((8/$n)):socket:PE=$n a.out >output

This will run 8 threads for each MPI processes binding each MPI thread to a socket (a physical CPU).

Use man mpirun command to see see other variants of the map-by keyword.

Intel MPI

Intel MPI can be useful for testing and may be necessary for programs using MPI_THREAD_MULTIPLE. Intel MPI works with Intel compilers, so it very important to make sure you choose the compiler correctly. For example:

module load intel-fc/16.0.3.210
module load intel-cc/16.0.3.210
module load intel-mpi/5.1.3.210

This sets Intel MPI v5.1 with Intel 16.0 compilers.

Use mpiifort, mpiicc, mpiicpc to build and link an mpi program. This way you do not need to specify any additional MPI specific libraries. The mpi-compiler will take care of this itself, For example,

mpiifort x.F90

will compile x.F90 program and create an mpi executable a.out.

We recommend using mpirun command to run an MPI program. For example, assuming that we have an intel mpi program a.out in the current directory, the following script will run it on 256 CPUs:

#!/bin/bash
#PBS -q normal
#PBS -l walltime=20:00:00
#PBS -l ncpus=256
#PBS -l mem=256GB
#PBS -l jobfs=1GB
#PBS -l wd

module load intel-mpi/5.1.3.210

export I_MPI_HYDRA_BRANCH_COUNT=$(($PBS_NCPUS / 16))

mpirun ./a.out > output

Note that you do not need to give any special parameters to mpirun as it is integrated with PBS and "knows" which nodes to use for the program. The I_MPI_HYDRA_BRANCH_COUNT variable sets the number of nodes for the Intel MPI (thus ncpus / 16, as each raijin node has 16 cores). It is only strictly needed for program using more than 127 nodes (2032 ncpus). Intel MPI should run fine without it for on smaller number of nodes.

Many modern program can run in so called mixed mode, i.e. an mpi communication is done between the nodes (or physical CPUs) and OMP threads are used inside nodes (or physical CPUs). This may allow you to get a better performance. If you have a program that support such a hybrid mode, the following script shows how to run it:

#!/bin/bash
#PBS -q normal
#PBS -l walltime=20:00:00
#PBS -l ncpus=256
#PBS -l mem=256GB
#PBS -l jobfs=1GB
#PBS -l wd

module load  intel-mpi/5.1.3.210

export I_MPI_HYDRA_BRANCH_COUNT=$(($PBS_NCPUS / 16))

n=4

export OMP_NUM_THREADS=$n

uniq $PBS_NODEFILE >myhosts

mpirun -perhost $((16/$n)) -hostfile myhosts -np $(($PBS_NCPUS/$n)) a.out >output

This script will run a.out program so that each MPI thread has 4 OMP threads.

For more details on using MPI, see the userguide

Additional Notes

OpenMPI is available for both intel and gnu compilers