The GNU Compiler Collection C compiler is called gcc
and the C++ compiler is g++
. The GNU Compiler Collection Fortran compiler command is gfortran
.
This 8.4.1 version of GCC (located at /opt/nci/bin/gcc
) is the system built-in default version. As such, it does not require any module to load.
You can also check the other versions installed in Gadi with a module
query:
$ module avail gcc
We normally recommend using the latest version available and always recommend to specify the version number with the module
command:
$ module load gcc/11.1.0
The Intel C compiler is called icc
and the C++ compiler is icpc
. The Intel Fortran compiler is called ifort
.
You can check the versions installed in Gadi with a module
query:
$ module avail intel-compiler
We normally recommend using the latest version available and always recommend to specify the version number with the module
command:
$ module load intel-compiler/2021.2.0
For more details on using modules see our modules help guide at here.
If your application contains both C and Fortran codes you should link as follows:
$ icc -O3 -c cfunc.c $ ifort -O3 -o myprog myprog.for cfunc.o
Use the -cxxlib
compiler option to tell the compiler to link using the C++ run-time libraries provided by gcc. By default, C++ libraries are not linked with Fortran applications.
-fexceptions
compiler option to enable C++ exception handling table generation so C++ programs can handle C++ exceptions when there are calls to Fortran routines on the call stack. This option causes additional information to be added to the object file that is required during C++ exception handling. By default, mixed Fortran/C++ applications abort in the Fortran code if a C++ exception is thrown.Use the -nofor_main
compiler option if your C/C++ program calls an Intel Fortran subprogram, as shown:
$ icc -O3 -c cmain.c $ ifort -O3 -nofor_main cmain.o fsub.f90
The handling of Fortran90 modules by ifort
and gfortran
is incompatible – the resultant .mod
and object files are not interoperable. Otherwise, gfortran
and ifort
are generally compatible.
man ifort
or man gfortran
). Some pointers to useful options for the Intel compiler are as follows:-O2 -ip -fpe0
.-fpe
setting for ifort
is -fpe3
which means that all floating point exceptions produce exceptional values and execution continues. To be sure that you are not getting floating point exceptions use -fpe0
. This means that floating underflows are set to zero and all other exceptions cause the code to abort. If you are certain that these errors can be ignored then you can recompile with the -fpe3
option.-fast
sets the options -O3 -ipo -static -no-prec-div -xHost
on Linux* systems and maximises speed across the entire program.-fast
cannot be used for MPI programs as the MPI libraries are shared, not static. To use with MPI programs use -O3 -ipo
.-ipo
option provides interprocedural optimization but should be used with care as it does not produce the standard .o
files. Do not use this if you are linking to libraries.-O0, -O1, -O2, -O3
give increasing levels of optimisation from no optimization to agressive optimization. The option -O
is equivalent to -O2
. Note that if -g
is specified then the default optimization level is -O0
.-parallel
tells the auto-parallelizer to generate multithreaded code for loops that can safely be executed in parallel. This option requires that you also specify -O2
or -O3
. Before using this option for production work make sure that it is resulting in a worthwhile increase in speed by timing the code on a single processor then multiple processors. This option rarely gives appreciable parallel speedup.$PATH, $FC, $CFLAGS, $LD_LIBRARY_PATH
and many others.ifort
option is -fpe3
by default. All floating-point exceptions are thus disabled. Floating-point underflow is gradual, unless you explicitly specify a compiler option that enables flush-to-zero. This is the default; it provides full IEEE support. (Also see -ftz
.) The option -fpe0
will lead to the code aborting at errors such as divide-by-zeros.icpc
the default behaviour is to replace arithmetic exceptions with NaN
s and continue the program. If you rely on seeing the arithmetic exceptions and the code aborting you will need to include the fenv.h
header and raise signals by using feenableexcept()
function. See man fenv
for further details.libimf
, which is linked before the standard libm
by default. If the -lm
link option is used then this behaviour changes and libm
is linked before libimf
.OpenMP is an extension to standard Fortran, C and C++ to support shared memory parallel execution. Directives have to be added to your source code to parallelise loops and specify certain properties of variables. Note that OpenMP and Open MPI are unrelated.
$ ifort -O3 -qopenmp my_openmp_prog.f -o my_openmp_prog.exe $ gfortran -O3 -fopenmp -lgomp my_openmp_prog.f -o my_openmp_prog.exe
C code with OpenMP directives is compiled as:
$ icc -O3 -qopenmp my_openmp_prog.c -o my_openmp_prog.exe $ gcc -O3 -fopenmp -lgomp my_openmp_prog.c -o my_openmp_prog.exe
OpenMP is a shared memory parallelism model. So only one host (node) can be used to execute an OpenMP application. The Gadi clusters have Cascade Lake nodes with 48 CPU cores. It makes no sense to try to run an OpenMP application on more than 48 processes on these nodes.
You should time your OpenMP code on a single CPU core then on increasing numbers of CPUs to find the optimal number of processors for running it.
There is an overhead in starting and ending any parallel work distribution construct – an empty parallel loop takes a lot longer than an empty serial loop. And that overhead in wasted time grows with the number of threads used. Meanwhile the time to do the real work has (hopefully) decreased by using more threads. So you can end up with timelines like the following for a parallel work distribuition region:
4 cpus 8 cpus time ---- ---- | startup startup | ---- V ---- work work ____ ____ cleanup cleanup ---- ----
Bottom line: the amount of work in a parallel loop (or section) has to be large compared with the startup time. You are looking at 10's of microseconds startup cost or the equivalent time for doing 1000's of floating point operations. Given another order-of-magnitude because you are splitting work over O(10) threads and at least another order-of-magnitude because you want the work to dominate over startup cost and very quickly you need O(million) operations in a parallelised loop to make it scale okay.
Segmentation violations can also be caused by the thread stack size being too small. Change this by setting the environment variable OMP_STACKSIZE
, for example,
$ export OMP_STACKSIZE="10M"
MPI is a parallel program interface for explicitly passing messages between parallel processes – you must have added message passing constructs to your program. Then to enable your programs to use MPI, you must include the MPI header file in your source and link to the MPI libraries when you compile.
Both Open MPI and Intel MPI are supported on Gadi. You can check the versions installed in Gadi with a module
query:
$ module avail openmpi
or
$ module avail intel-mpi
We normally recommend using the latest version available and always recommend to specify the version number with the module
command:
$ module load intel-mpi/2021.1.1 $ module load intel-compiler/2021.1.1 # To use mpiifort, mpiicc and mpiicpc
For more details on using modules see our modules help guide at here.
For Fortran, compile with one of the following commands:
$ mpif77 -O3 my_mpi_prog.f -o my_mpi_prog.exe $ mpif90 -O3 my_mpi_prog.f90 -o my_mpi_prog.exe $ mpiifort -O3 my_mpi_prog.f -o my_mpi_prog.exe
Passing -show
option in the corresponding wrapper function gives the full list of options that the wrapper passes on to the backend compiler. As for example, by typing mpif77 -show
for Intel MPI gives something like ifort -I'/apps/intel-mpi/2021.1.1/include' -L'/apps/intel-mpi/2021.1.1/lib/release' -L'/apps/intel-mpi/2021.1.1/lib' -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker '/apps/intel-mpi/2021.1.1/lib/release' -Xlinker -rpath -Xlinker '/apps/intel-mpi/2021.1.1/lib' -lmpifort -lmpi -lrt -lpthread -Wl,-z,now -Wl,-z,relro -Wl,-z,noexecstack -Xlinker --enable-new-dtags -ldl
.
For C and C++, compile with one of:
$ mpicc -O3 my_mpi_prog.c -o my_mpi_prog.exe $ mpiicc -O3 my_mpi_prog.c -o my_mpi_prog.exe
As mentioned above, do not use the -fast
option as this sets the -static
option which conflicts with using the MPI libraries which are shared libraries.
For Fortran, compile with one of the following commands:
# Load modules, always specify version number. $ module load openmpi/4.0.2 $ module load intel-mpi/2021.1.1 $ module load intel-compiler/2021.1.1 # To use mpiifort, mpiicc and mpiicpc $ mpif77 -O3 -fopenmp -lgomp my_hybrid_prog.f -o my_hybrid_prog.exe # Open MPI $ mpif90 -O3 -fopenmp -lgomp my_hybrid_prog.f90 -o my_hybrid_prog.exe # Open MPI $ mpiifort -O3 -qopenmp my_hybrid_prog.f -o my_hybrid_prog.exe # Intel MPI
For C and C++, compile with one of:
# Load modules, always specify version number. $ module load openmpi/4.0.2 $ module load intel-mpi/2021.1.1 $ module load intel-compiler/2021.1.1 # To use mpiifort, mpiicc and mpiicpc $ mpicc -O3 -fopenmp -lgomp my_hybrid_prog.c -o my_hybrid_prog.exe # Open MPI $ mpiicc -O3 -qopenmp my_hybrid_prog.c -o my_hybrid_prog.exe # Intel MPI
Passing -show
option in the corresponding wrapper function gives the full list of options that the wrapper passes on to the backend compiler.
As mentioned above, do not use the -fast
option as this sets the -static
option which conflicts with using the MPI libraries which are shared libraries.
Details of MPI process binding to CPU core for MPI and hybrid (MPI + OpenMP) application are available at here.
CUDA is a parallel computing platform and programming model developed by Nvidia for general computing on its own GPUs (graphics processing units). It enables developers to speed up compute-intensive applications by harnessing the power of GPUs for the parallelisable part of the computation. To enable your programs to use CUDA, you must include the CUDA header file in your source and link to the CUDA libraries when you compile.
You can check the CUDA versions installed in Gadi with a module
query:
$ module avail cuda
We normally recommend using the latest version available and always recommend to specify the version number with the module
command:
$ module load cuda/11.2.2
For more details on using modules see our modules help guide at here.
Compile with the following command:
$ nvcc -O3 my_cuda_prog.cu -o my_cuda_prog.exe -lcudart
Use nvcc --help
command to see all the options of the nvcc
command.
Compile with the following commands:
# Load modules, always specify version number. $ module load openmpi/4.0.2 $ module load cuda/11.2.2 $ mpicc -O3 -lm -lstdc++ -c my_hybrid_mpi_cuda_main_prog.c $ nvcc -O3 -lcudart -c my_hybrid_mpi_cuda_prog.cu $ mpicc -o my_hybrid_mpi_cuda_main_prog my_hybrid_mpi_cuda_main_prog.o my_hybrid_mpi_cuda_prog.o -lm -lstdc++ -lcudart
Passing -show
option in the mpicc
wrapper function gives the full list of options that the wrapper passes on to the backend compiler.
As mentioned above, do not use the -fast
option as this sets the -static
option which conflicts with using the MPI libraries which are shared libraries.
Before running MPI jobs on many processors you should run some smaller benchmarking and timing tests to see how well your application scales. There is an art to this and the following points are things you should consider when setting up timing examples. For scaling tests you need a typically sized problem that does not take too long to run. So, for example, it may be possible to time just a few iterations of your application. Test runs should be replicated several times as there will be some variation in timings.
mpirun memhog size
where size is the real memory required per MPI task (takes an "m" or "g" suffix). Beware of causing excessive paging or getting your job killed by PBS by trying to memhog too much memory.Fix: Clear memory for the job as above.
ompi_info -a
) but you cannot avoid this overhead.