Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

top
top

Table of Contents


Compiling


  1. Compilers and Options

    Several versions of the Intel compilers are installed. These

are icc for
  1. are icc for the C compiler,

 ifort for
  1.  ifort for the Fortran compiler

and icpc for
  1. and icpc for the C++ compiler. To access these compilers you will need to load the relevant module.

Type module avail to
  1. Type 

    Code Block
    languagebash
    themeRDark
    module avail

    to see what versions of the Intel compiler are available and which is the default. Note that the C and Fortran compilers are loaded as separate modules, for example,

 module load
  1. Code Block
    languagebash
    themeRDark
    module load intel-cc/
10
  1. 12.1.
018 and module load
  1. 9.293
    module load intel-fc/
10
  1. 12.1.
018 to
  1. 9.293

    to load a version of the Intel

10
  1. 12.1 compiler.

  2. User Guides and other documentation are available online for the Intel Fortran and C/C++ compilers. Local versions of the compiler documentation can be accessed via the software pages for Fortran and C/C++
and
  1.  and choosing a particular version.
  2. If required, the GNU
compilers gfortran, gcc and g
  1. compilers gfortrangcc and g++ are available. We recommend the Intel compiler for best performance for Fortran code.
  2. Mixed language programming hints.
    • If your application contains both C and Fortran codes you should link as follows:

      Code Block
      languagebash
      themeRDark
      icc -c cfunc.c 

    • 
      ifort -o myprog myprog.for cfunc.o

      Use the -

cxxlib compiler
    • cxxlib compiler option to tell the compiler to link using the C++ run-time libraries provided by gcc. By default, C++ libraries are not linked with Fortran applications.

    • Use the -
fexceptions compiler
    • fexceptions compiler option to enable C++ exception handling table generation so C++ programs can handle C++ exceptions when there are calls to Fortran routines on the call stack. This option causes additional information to be added to the object file that is required during C++ exception handling. By default, mixed Fortran/C++ applications abort in the Fortran code if a C++ exception is thrown.
    • Use the -nofor_

main compiler
    • main compiler option if your C/C++ program calls an Intel Fortran subprogram, as shown:

      Code Block
      languagebash
      themeRDark
      icc -c cmain.c

    • 
      ifort -nofor_main cmain.o fsub.f90

      The handling of Fortran90 modules by ifort and gfortran is incompatible – the resultant .mod and object files are not interoperable. Otherwise gfortran and ifort are generally compatible. 

  • The Intel C and C++ compilers are highly compatible and interoperable with GCC.
  • A full list of compiler options can be obtained from the man page for each command (i.e. man ifort or man gfortran). Some pointers to useful options for the Intel compiler are as follows:
    • We recommend that Intel Fortran users start with the options -O2 -ip -fpe0.
    • The default -fpe setting
for ifort is -fpe3 which
    • for ifort is -fpe3 which means that all floating point exceptions produce exceptional values and execution continues. To be sure that you are not getting floating point exceptions use -fpe0. This means that floating underflows are set to zero and all other exceptions cause the code to abort. If you are certain that these errors can be ignored then you can recompile with the -
fpe3 option
    • fpe3 option.
    • -
fast sets
    • fast sets the options -O3 -ipo -static -no-prec-div -
static and
    • xHost on Linux* systems and maximises speed across the entire program.
    • However -
fast cannot
    • fast cannot be used for MPI programs as the MPI libraries are shared, not static. To use with MPI programs use -O3 -ipo.
    • The -
ipo option
    • ipo option provides interprocedural optimization but should be used with care as it does not produce the standard .o files. Do not use this if you are linking to libraries.
    • -O0, -O1, -O2, -
O3 give
    • O3 give increasing levels of optimisation from no optimization to agressive optimization. The option -
O is
    • O is equivalent to -O2. Note that if -
g is
    • g is specified then the default optimization level is -O0.
    • -
parallel tells
    • parallel tells the auto-parallelizer to generate multithreaded code for loops that can safely be executed in parallel. This option requires that you also specify -
O2 or 
    • O2 or -O3. Before using this option for production work make sure that it is resulting in a worthwhile increase in speed by timing the code on a single processor then multiple processors. This option rarely gives appreciable parallel speedup.
  • Environment variables can be used to affect the behaviour of various programs (particularly compilers and build systems), for
example $PATH
  • example $PATH, $FC, $CFLAGS, $LD_LIBRARY_
PATH and
  • PATH and many others. Our Canonical User Environment Variables webpage has a detailed list of these variables, including information on which programs use what variables, how they are used, common misconceptions/gotchas, and so on.
  • Handling floating exceptions.
    • The standard Intel ifort option is -
fpe3 by
    • fpe3 by default. All floating-point exceptions are thus disabled. Floating-point underflow is gradual, unless you explicitly specify a compiler option that enables flush-to-zero. This is the default; it provides full IEEE support. (Also see -ftz.) The option -
fpe0 will
    • fpe0 will lead to the code aborting at errors such as divide-by-zeros.
    • For C++ code
using icpc the
    • using icpc the default behaviour is to replace arithmetic exceptions with
NaNs
    • NaNs and continue the program. If you rely on seeing the arithmetic exceptions and the code aborting you will need to include
the fenv
    • the fenv.
h header
    • h header and raise signals by
using feenableexcept. See man fenv for
    • using feenableexcept() function. See man fenv for further details.
  • The Intel compiler provides an optimised math library, libimf, which is linked before the standard libm by default. If the -
lm link
  • lm link option is used then this behaviour changes and libm is linked before libimf.

 

Back to top

USING

Using MPI

MPI is a parallel program interface for explicitly passing messages between parallel processes – you must have added message passing constructs to your program. Then to enable your programs to use MPI, you must include the MPI header file in your source and link to the MPI libraries when you compile. 

  

COMPILING AND LINKING

Compiling and Linking

The preferred MPI library is

OpenMPI

Open MPI. To see what versions are available type 

 

Code Block
languagebash
themeRDark
module avail openmpi

Loading the openmpi module sets (by typing module load openmpi for the default version or module load openmpi/<version> for version <version>) sets a variety of environment variables which you can see from 

Code Block
languagebash
themeRDark
module show openmpi # for default version
module show openmpi/<version> # for version <version>

For Fortran, compile with one of the following commands:

%
Code Block
languagebash
themeRDark
ifort myprog.f -o myprog.exe $OMPI_FLIBS
%
mpif77 myprog.f -o myprog.exe
%
mpif90 myprog.f90 -o myprog.exe

The environment variable $OMPI_FLIBS has been set up to insert the correct libraries for linking. These are the same as is used by the wrapper functions mpif77 and mpif90. functions mpif77 and mpif90. Passing -show option in the corresponding wrapper function gives the full list of options that the wrapper passes on to the backend compiler (e.g. by typing mpif77 -show for Intel MPI gives something like ifort -I/<path>/include -L/<path>/lib -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker /<path>/lib -Xlinker -rpath -Xlinker /<path>/mpi-rt/4.1 -lmpigf -lmpi -lmpigi -ldl -lrt -lpthread). If you are using the Fortran90 bindings for MPI (unlikely), then you need $OPENMPI_F90LIBS.

For C and C++, compile with one of:

%
Code Block
languagebash
themeRDark
icc -pthread myprog.c -o myprog.exe $OMPI_CLIBS
%
mpicc myprog.c -o myprog.exe
%
icpc -pthread myprog.C -o myprog.exe $OMPI_CXXLIBS
%
mpiCC myprog.C -o myprog.exe

Note that $OMPI_CXXLIBS is only relevant if you are actually using the C++ bindings for MPI. Most C++ MPI applications use the C bindings so linking with $OMPI_CLIBS is sufficient.

As mentioned above, do not use the -fast option fast option as this sets the -static option which conflicts with using the MPI libraries which are shared libraries. Alternatively, use -03 -ipo ipo (which is equivalent to -fast without fast without -static).

If you do not have an Intel compiler module loaded, the MPI compiler wrappers will use the GNU compilers by default. In that case, the following pairs of commands are equivalent:

%
Code Block
languagebash
themeRDark
mpif90 myprog.F
%
gfortran myprog.F $OMPI_FLIBS
%
mpicc myprog.c
%
gcc -pthread myprog.c $OMPI_CLIBS
%
mpiCC myprog.C
%
g++ -pthread myprog.C $OMPI_CXXLIBS

Note that that the appropriate include paths are placed in the CPATH and FPATH environment variables when you load the openmpi module. 

RUNNING MPI JOBS

Running MPI Jobs

To run an MPI application, you need to have an MPI module loaded in your environment. The modules of software packages requiring MPI will generally load the appropriate MPI module for you.

MPI programs are executed using the mpirun command. To run a small test with 4 processes (or tasks) where the MPI executable is called a.out, enter any of the following equivalent commands:

%
Code Block
languagebash
themeRDark
mpirun -n 4 ./a.out
%
mpirun -np 4 ./a.out

The argument to -n or n or -np is np is the number of aof a.out processes out processes that will be run.

For larger jobs and production use, submit a job to the PBS batch system with a command like

%

the following (note that there is a carriage return after -l wd, module load openmpi and ./a.out):

Code Block
languagebash
themeRDark
qsub -q express -
lncpus
l ncpus=
4
16,walltime=
30
20:00,
vmem
mem=400mb -l wd
module load openmpi
mpirun ./a.out
^D (that 
%
is control-D)

By not specifying the -n option np option with the batch job mpirun, mpiprun will job mpirunmpiprun will start as many MPI processes as there have been cpus requested with qsubwith qsub. It is possible to specify the number of processes on the batch job mpirun command, as mpirun as mpirun -n np 4 ./a.out, or more generally mpirun generally mpirun -n np $PBS_NCPUS ./a.out.

To improve performance on the NUMA nodes of raijin, both cpu and memory binding of MPI processes is performed by default with the current default version of Open MPI. If your application requires any non-standard layout (ege.g. the ranks are threaded or some ranks require very large memory) then may require options to mpirun to avoid the default binding. In the extreme case, binding can be disabled with

%
Code Block
languagebash
themeRDark
mpirun --bind-to-none -
n
np 4 ./a.out

however there may be more appropriate options.

 

Back to top

USING OPENMP

Using OpenMP

OpenMP is an extension to standard Fortran, C and C++ to support shared memory parallel execution. Directives have to be added to your source code to

parallelize

parallelise loops and specify certain properties of variables. (Note that OpenMP and

OpenMPI are

Open MPI are unrelated.) 

  

COMPILING AND LINKING

Compiling and Linking

Fortran with OpenMP directives is compiled as:

%

 

Code Block
languagebash
themeRDark
ifort -openmp myprog.f -o myprog.exe
%
gfortran -fopenmp myprog.f -o myprog.exe

C code with OpenMP directives is compiled as:

%
Code Block
languagebash
themeRDark
icc -openmp myprog.c -o myprog.exe
%
gcc -fopenmp myprog.c -o myprog.exe

 

RUNNING OPENMP JOBS

Running OpenMP Jobs

To run the OpenMP job interactively, first set

the OMP

the OMP_NUM_

THREADS environment

THREADS environment variable, then run the executable as shown below:

 

Code Block
languagebash
themeRDark
export OMP_NUM_THREADS=16 # for 
%
bash shell (default on Raijin)
./a.out
or
Code Block
languagebash
themeRDark
setenv OMP_NUM_THREADS 
4
16 # for csh or 
%
tcsh shell
./a.out

For larger jobs and production use, submit a job to the PBS batch system with something like

%
Code Block
languagebash
themeRDark
qsub -q 
express
normal -
lnodes
l nodes=1:ppn=
4
16,walltime=
30
20:00,
vmem
mem=400mb -l wd
#!/bin/csh
setenv OMP_NUM_THREADS $PBS_NCPUS
./a.out
^D (that 
%
is control-D)

OpenMP is a shared memory parallelism model – only one host (node) can be used to execute an OpenMP application. The clusters have nodes with 8 16 cpu cores. It makes no sense to try to run an OpenMP application on more than 8 16 processes. Note that in the above qsub example, the request specifies 1 node and the number of “processors per node” (ppn) required.

You should time your OpenMP code on a single processor then on increasing numbers of CPUs to find the optimal number of processors for running it. Keep in mind that your job is charged ncpus*walltime. 

OPENMP PERFORMANCE

Since each Raijin node has 2 processor sockets each with 8 cores (i.e. a total of 16 cores), it makes more sense to run one MPI process per processor socket. For the probable best performance of an OpenMP application on Raijin, the configuration is: launching 1 MPI process per socket, binding 8 cores per MPI process (binding to socket), running 8 OMP threads per MPI process and launching 2 MPI processes per node. Executing something like the following (openmpi module may vary):

Code Block
languagebash
themeRDark
qsub -q normal omp_job_script.bash

where omp_job_script.bash is as follows:

Code Block
languagebash
themeRDark
#!/bin/bash
#PBS -l ncpus=16
#PBS -l walltime=20:00
#PBS -l mem=400mb
#PBS -l wd
module load openmpi
CORES_PER_SOCKET=8
NPROC=`expr $PBS_NCPUS / $CORES_PER_SOCKET`
export OMP_NUM_THREADS=$CORES_PER_SOCKET
export GOMP_CPU_AFFINITY=0-15
OMP_PARAM="-cpus-per-proc $OMP_NUM_THREADS -npersocket 1 -x OMP_NUM_THREADS -x GOMP_CPU_AFFINITY"
mpirun -np $NPROC $OMP_PARAM ./a.out

 


OpenMP Performance

Parallel loop overheads

COMMON PROBLEMS

Common Problems

Back to top

CODE DEVELOPMENT

Code Development

DEBUGGING

Debugging

Intel debugger 

Read man idb for
Read man idb for further information.
  1. To use first compile and link your program using the -g switch e.g.

%
  1. Code Block
    languagebash
    themeRDark
    cc -g prog.c
  1. Start the debugger

% idb
  1. Code Block
    languagebash
    themeRDark
    idbc ./a.out
  1. Enter commands such as

  1. Code Block
    languagebash
    themeRDark
    ...
    (idb) list
    
  1. (idb) stop at 10
    
  1. (idb) run
    
  1. (idb) print 
var
  1. variable
    ...
    (idb) quit
  1. By starting idb up with the option -
gui you
  1. gui (i.e. idbc -gui) you get a useful graphical user interface.

Coredump files

 


By default your jobs will not produce coredump files when they crash. To generate corefiles you need:

%
Code Block
languagebash
themeRDark
limit coredumpsize unlimited   (for tcsh)
%
ulimit -c unlimited            (for bash)

Also if you are using the Intel Fortran compiler, you will will need

%
Code Block
languagebash
themeRDark
setenv decfort_dump_flag y    (for tcsh)
%
export decfort_dump_flag=y    (for bash)

To use the coredump file, enter

% gdb executable coredumpfile ...
Code Block
languagebash
themeRDark
gdb /path/to/the/executable /path/to/the/coredumpfile
...
(gdb) where
(gdb) bt
(gdb) frame number
(gdb) list
(gdb) info locals
(gdb) print variable
...
(gdb) 
where
quit

Coredump files can take up a lot of disk space especially from a large parallel job – be careful not to generate them if you are not going to use them. 

DEBUGGING PARALLEL PROGRAMS

Debugging Parallel Programs

Totalview can be used to debug parallel MPI or OpenMP programs. Introductory information and userguides on using Totalview are available from this site.

First add the following line to your .cshrc file.

  1. Load module to use Totalview,

    Code Block
    languagebash
    themeRDark
    module load totalview
  1. Compile code with the -g option. For example, for an MPI program,

%
  1. Code Block
    languagebash
    themeRDark
    mpif90 -g prog.f 
  1. Start Totalview. For example, to debug an MPI program using 4 processors the usual command is,

%
  1. Code Block
    languagebash
    themeRDark
    mpirun --debug -np 4 ./a.out
  1. as mpirun is a locally written wrapper.

Note that to ensure that Totalview can obtain information on all variables compile with no optimisation. This is the default if -g is used with no specific optimisation level.

Totalview shows source code for mpirun when it first starts an MPI job. A GUI like the following is generated. Click on GO and all the processes will start up and you will be asked if you want to stop the parallel job. At this point click YES if you want to insert breakpoints. The source code will be shown and you can click on any lines where you wish to stop.

Image Added

If your source code is in a different directory from where you fired up Totalview you may need to add the path to Search Path under the File Menu. Right clicking on any subroutine name will “dive” into the source code for that routine and break points can be set there.

The procedure for viewing variables in an MPI job is a little complicated. If your code has stopped at a breakpoint right click on the variable or array of interest in the stack frame window. If the variable cannot be displayed then choose “Add to expression list” and the variable will appear listed in a new window. If it is marked “Invalid compilation scope” in this new window right click again on the variable name in this window and chose “Compilation scope”. Change this to “Floating” and the value of the variable or array should appear. Right clicking on it again and chosing “Dive” will give you values for arrays. In this window you can chose “Laminate” then “Process” under the View menu to see the values on different processors.

Under the Tools option on the top toolbar of the window displaying the variable values you can choose

Visualize

Visualise to display the data graphically which can be useful for large arrays.

It is also possible to use Totalview for memory debugging, showing graphical representations of outstanding messages in MPI or setting action points based on evaluation of a user defined expression. See the Totalview User Guide for more information.

Alternatively, PADB, a light weight parallel program inspection/debug tool, is also available in NCI

NF
NOTES ON BENCHMARKING

Notes on Benchmarking

Before running MPI jobs on many processors you should run some smaller benchmarking and timing tests to see how well your application scales. There is an art to this and the following points are things you should consider when setting up timing examples. For scaling tests you need a typically sized problem that does not take too long to run. So, for example, it may be possible to time just a few iterations of your application. Test runs should be replicated several times as there will be some variation in timings.

OpenMPI 1
  • Open MPI 1.4.3 (now the default version of
OpenMPI
  • Open MPI) has memory binding set by default. Ask for help if you think you need this turned off. The good news is that the
OpenMPI
  • Open MPI developers will support memory binding in future releases. 
    Fix: Use
OpenMPI 1
  • Open MPI 1.4.3 or later.
  • Paging overhead: 
    Because of suspend/resume, there is a high probability that any reasonable size parallel job will have to (partially) page out at least some suspended jobs when starting up. In the worst case, this could take up to the order of a minute. In the case of a 3hr production job this is not an issue but it is significant for a 10 minute test job. There are various ways to avoid this:
    • just use the walltime of some later iteration/timestep as the benchmark measure and ignore the first couple of iterations
    • run the mpirun command 2 or 3 times in the one PBS job and use the last run for timing. The first runs should be about a minute long.
    • run “mpirun memhog size” where size is the real memory required per MPI task (takes an “m” or “g” suffix). Beware of causing excessive paging or getting your job killed by PBS by trying to memhog too much memory.

    Fix: Clear memory for the job as above.

  • Network interactions and
locality 
  • locality: 
    Generally small and hard to avoid except in special cases. The scheduler does try to impose some degree of locality but that could be strengthened. It should also be possible to add a qsub option to request best locality. But the job would probably queue longer and could still see issues. For example we regularly have InfiniBand links down and, with IB static routing, that means some links are doing double duty. 
    Fix: New PBS option to request best locality or live with it.
  • IO interactions with other jobs:
    This is almost impossible to control. There is usually lots of scope for tuning IO (both at the system level and in the job) so it certainly shouldn’t be ignored. But for benchmarking purposes: 
    Fix: do as little IO as possible and test the IO part separately.
  • Communication
startup 
  • startup: 
    MPI communication is connection based meaning there is a fair amount of negotiation before the first message can be sent. There are ways to control when and how this happens (look for “mpi_preconnect_” in the output from “ompi_info -a”) but you cannot avoid this overhead. 
    Fix: try to discount or quantify the startup cost as discussed above.
PROFILING

Profiling

The gprof profiler
  1. The gprof profiler provides information on the most time-consuming subprograms in your code. Profiling the executable prog.exe will lead to profiling data being stored in gmon.out which can then be interpreted by gprof as follows:

%
  1. Code Block
    languagebash
    themeRDark
    ifort -p -o prog.exe prog.f
    
%
  1. ./prog.exe
    
%
  1. gprof ./prog.exe gmon.out 
  1. For the GNU compilers do

%
  1. Code Block
    languagebash
    themeRDark
    gfortran -pg -o prog.exe prog.f
    
%
  1. gprof ./prog.exe gmon.
out
GRAPHICAL PROFILING OF MPI CODE

Graphical Profiling of MPI Code

Two lightweight MPI profilers, IPM and mpiP, are available for Open MPI parallel codes. A minimal user instruction is described below. For more detailed user guide, please refer MPI Performance Analysis User Guide.

(version
    1. ipm/0.983-nci
      1. works with openmpi/1.6.5 or less

)IPM (version
      1. gives nice graph with communication pattern
    1. ipm/0.983-cache
      1. works with openmpi/1.6.5 or less
      2. gives L1 L2 L3 cache misses
    2. ipm/2.0.2
      1. only works with openmpi/1.7
or more) %
      1. .* and 1.8.*
      2. the only version gives flops
    1. ipm/2.0.5
      1. works with openmpi/1.10.2
      Code Block
      languagebash
      themeRDark
      module load openmpi
      
%
    1. module load ipm
      
%
    1. mpirun ./prog.exe
    1. To view the IPM profile results (usually in the format of username.xxxxxxxxxx.xxxxxx.0):

% ssh raijin -X
    1. Code Block
      languagebash
      themeRDark
      ssh -X abc123@raijin.nci.org.au (to enable remote display)
      
%
    1. ipm_view username.xxxxxxxxxx.xxxxxx.0
  1. mpiP
(version

    1. mpiP/3.2.1
) %
    1. mpiP/3.4.1

      Code Block
      languagebash
      themeRDark
      mpicc -g -o prog.exe prog.c (optional)
      
%
    1. module load openmpi
      
%
    1. module load mpiP
      
%
    1. mpirun -np $n ./prog.exe
    1. To view the mpiP profile results (usually in the format of prog.exe.${n}.xxxxx.x.mpiP):

% ssh raijin -X
    1. Code Block
      languagebash
      themeRDark
      ssh -X abc123@raijin.nci.org.au (to enable remote display)
      
%
    1. mpipview prog.exe.${n}.xxxxx.x.mpiP

Back to top


 
Email problems, suggestions, questions to help@nci.org.au