Compiling and Linking

On this page

Using C, Fortran and C++

The GNU Compiler Collection C compiler is called gcc and the C++ compiler is g++. The GNU Compiler Collection Fortran compiler command is gfortran.
This 8.4.1 version of GCC (located at /opt/nci/bin/gcc) is the system built-in default version. As such, it does not require any module to load.
You can also check the other versions installed in Gadi with a module query:
```
$ module avail gcc
```
We normally recommend using the latest version available and always recommend to specify the version number with the module command:
```
$ module load gcc/11.1.0
```
The Intel C compiler is called icc and the C++ compiler is icpc. The Intel Fortran compiler is called ifort.
You can check the versions installed in Gadi with a module query:
```
$ module avail intel-compiler
```
We normally recommend using the latest version available and always recommend to specify the version number with the module command:
```
$ module load intel-compiler/2021.2.0
```
For more details on using modules see our modules help guide at here.
More information is available at here and here.
We recommend the Intel compiler for best performance for Fortran code.
Mixed language programming hints:
- If your application contains both C and Fortran codes you should link as follows:
```
$ icc -O3 -c cfunc.c 
$ ifort -O3 -o myprog myprog.for cfunc.o
```
  Use the -cxxlib compiler option to tell the compiler to link using the C++ run-time libraries provided by gcc. By default, C++ libraries are not linked with Fortran applications.
- Use the -fexceptions compiler option to enable C++ exception handling table generation so C++ programs can handle C++ exceptions when there are calls to Fortran routines on the call stack. This option causes additional information to be added to the object file that is required during C++ exception handling. By default, mixed Fortran/C++ applications abort in the Fortran code if a C++ exception is thrown.
- Use the -nofor_main compiler option if your C/C++ program calls an Intel Fortran subprogram, as shown:
```
$ icc -O3 -c cmain.c
$ ifort -O3 -nofor_main cmain.o fsub.f90
```
  The handling of Fortran90 modules by ifort and gfortran is incompatible – the resultant .mod and object files are not interoperable. Otherwise, gfortran and ifort are generally compatible.
The Intel C and C++ compilers are highly compatible and interoperable with GCC.
A full list of compiler options can be obtained from the man page for each command (i.e. man ifort or man gfortran). Some pointers to useful options for the Intel compiler are as follows:
- We recommend that Intel Fortran users start with the options -O2 -ip -fpe0.
- The default -fpe setting for ifort is -fpe3 which means that all floating point exceptions produce exceptional values and execution continues. To be sure that you are not getting floating point exceptions use -fpe0. This means that floating underflows are set to zero and all other exceptions cause the code to abort. If you are certain that these errors can be ignored then you can recompile with the -fpe3 option.
- -fast sets the options -O3 -ipo -static -no-prec-div -xHost on Linux* systems and maximises speed across the entire program.
- However -fast cannot be used for MPI programs as the MPI libraries are shared, not static. To use with MPI programs use -O3 -ipo.
- The -ipo option provides interprocedural optimization but should be used with care as it does not produce the standard .o files. Do not use this if you are linking to libraries.
- -O0, -O1, -O2, -O3 give increasing levels of optimisation from no optimization to agressive optimization. The option -O is equivalent to -O2. Note that if -g is specified then the default optimization level is -O0.
- -parallel tells the auto-parallelizer to generate multithreaded code for loops that can safely be executed in parallel. This option requires that you also specify -O2 or -O3. Before using this option for production work make sure that it is resulting in a worthwhile increase in speed by timing the code on a single processor then multiple processors. This option rarely gives appreciable parallel speedup.
Environment variables can be used to affect the behaviour of various programs (particularly compilers and build systems), for example $PATH, $FC, $CFLAGS, $LD_LIBRARY_PATH and many others.
Handling floating exceptions.
- The standard Intel ifort option is -fpe3 by default. All floating-point exceptions are thus disabled. Floating-point underflow is gradual, unless you explicitly specify a compiler option that enables flush-to-zero. This is the default; it provides full IEEE support. (Also see -ftz.) The option -fpe0 will lead to the code aborting at errors such as divide-by-zeros.
- For C++ code using icpc the default behaviour is to replace arithmetic exceptions with NaNs and continue the program. If you rely on seeing the arithmetic exceptions and the code aborting you will need to include the fenv.h header and raise signals by using feenableexcept() function. See man fenv for further details.
The Intel compiler provides an optimised math library, libimf, which is linked before the standard libm by default. If the -lm link option is used then this behaviour changes and libm is linked before libimf.

Using OpenMP

OpenMP is an extension to standard Fortran, C and C++ to support shared memory parallel execution. Directives have to be added to your source code to parallelise loops and specify certain properties of variables. Note that OpenMP and Open MPI are unrelated.

Fortran with OpenMP directives is compiled as:

$ ifort -O3 -qopenmp my_openmp_prog.f -o my_openmp_prog.exe
$ gfortran -O3 -fopenmp -lgomp my_openmp_prog.f -o my_openmp_prog.exe

C code with OpenMP directives is compiled as:

$ icc -O3 -qopenmp my_openmp_prog.c -o my_openmp_prog.exe
$ gcc -O3 -fopenmp -lgomp my_openmp_prog.c -o my_openmp_prog.exe

OpenMP Performance

OpenMP is a shared memory parallelism model. So only one host (node) can be used to execute an OpenMP application. The Gadi clusters have Cascade Lake nodes with 48 CPU cores. It makes no sense to try to run an OpenMP application on more than 48 processes on these nodes.

You should time your OpenMP code on a single CPU core then on increasing numbers of CPUs to find the optimal number of processors for running it.

Parallel loop overheads

There is an overhead in starting and ending any parallel work distribution construct – an empty parallel loop takes a lot longer than an empty serial loop. And that overhead in wasted time grows with the number of threads used. Meanwhile the time to do the real work has (hopefully) decreased by using more threads. So you can end up with timelines like the following for a parallel work distribuition region:

 4 cpus 8 cpus time ---- ---- | startup startup | ---- V ---- work work ____ ____ cleanup cleanup ---- ----

Bottom line: the amount of work in a parallel loop (or section) has to be large compared with the startup time. You are looking at 10's of microseconds startup cost or the equivalent time for doing 1000's of floating point operations. Given another order-of-magnitude because you are splitting work over O(10) threads and at least another order-of-magnitude because you want the work to dominate over startup cost and very quickly you need O(million) operations in a parallelised loop to make it scale okay.

Common Problems

One of the most common problems encountered after parallelizing a code is the generation of floating point exceptions or segmentation violations that were not occurring before. This is usually due to uninitialised variables – check your code very carefully.
Segmentation violations can also be caused by the thread stack size being too small. Change this by setting the environment variable OMP_STACKSIZE, for example,
```
$ export OMP_STACKSIZE="10M"
```

Using MPI

MPI is a parallel program interface for explicitly passing messages between parallel processes – you must have added message passing constructs to your program. Then to enable your programs to use MPI, you must include the MPI header file in your source and link to the MPI libraries when you compile.

Both Open MPI and Intel MPI are supported on Gadi. You can check the versions installed in Gadi with a module query:

$ module avail openmpi

or

$ module avail intel-mpi

We normally recommend using the latest version available and always recommend to specify the version number with the module command:

$ module load intel-mpi/2021.1.1
$ module load intel-compiler/2021.1.1  # To use mpiifort, mpiicc and mpiicpc

For more details on using modules see our modules help guide at here.

For Fortran, compile with one of the following commands:

$ mpif77 -O3 my_mpi_prog.f -o my_mpi_prog.exe
$ mpif90 -O3 my_mpi_prog.f90 -o my_mpi_prog.exe
$ mpiifort -O3 my_mpi_prog.f -o my_mpi_prog.exe

Passing -show option in the corresponding wrapper function gives the full list of options that the wrapper passes on to the backend compiler. As for example, by typing mpif77 -show for Intel MPI gives something like ifort -I'/apps/intel-mpi/2021.1.1/include' -L'/apps/intel-mpi/2021.1.1/lib/release' -L'/apps/intel-mpi/2021.1.1/lib' -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker '/apps/intel-mpi/2021.1.1/lib/release' -Xlinker -rpath -Xlinker '/apps/intel-mpi/2021.1.1/lib' -lmpifort -lmpi -lrt -lpthread -Wl,-z,now -Wl,-z,relro -Wl,-z,noexecstack -Xlinker --enable-new-dtags -ldl.

For C and C++, compile with one of:

$ mpicc -O3 my_mpi_prog.c -o my_mpi_prog.exe
$ mpiicc -O3 my_mpi_prog.c -o my_mpi_prog.exe

As mentioned above, do not use the -fast option as this sets the -static option which conflicts with using the MPI libraries which are shared libraries.

Using Hybrid MPI and OpenMP

For Fortran, compile with one of the following commands:

# Load modules, always specify version number.
$ module load openmpi/4.0.2
$ module load intel-mpi/2021.1.1
$ module load intel-compiler/2021.1.1  # To use mpiifort, mpiicc and mpiicpc

$ mpif77 -O3 -fopenmp -lgomp my_hybrid_prog.f -o my_hybrid_prog.exe  # Open MPI
$ mpif90 -O3 -fopenmp -lgomp my_hybrid_prog.f90 -o my_hybrid_prog.exe  # Open MPI
$ mpiifort -O3 -qopenmp my_hybrid_prog.f -o my_hybrid_prog.exe  # Intel MPI

For C and C++, compile with one of:

# Load modules, always specify version number.
$ module load openmpi/4.0.2
$ module load intel-mpi/2021.1.1
$ module load intel-compiler/2021.1.1  # To use mpiifort, mpiicc and mpiicpc

$ mpicc -O3 -fopenmp -lgomp my_hybrid_prog.c -o my_hybrid_prog.exe  # Open MPI
$ mpiicc -O3 -qopenmp my_hybrid_prog.c -o my_hybrid_prog.exe  # Intel MPI

Passing -show option in the corresponding wrapper function gives the full list of options that the wrapper passes on to the backend compiler.

As mentioned above, do not use the -fast option as this sets the -static option which conflicts with using the MPI libraries which are shared libraries.

Details of MPI process binding to CPU core for MPI and hybrid (MPI + OpenMP) application are available at here.

Using GPU CUDA

CUDA is a parallel computing platform and programming model developed by Nvidia for general computing on its own GPUs (graphics processing units). It enables developers to speed up compute-intensive applications by harnessing the power of GPUs for the parallelisable part of the computation. To enable your programs to use CUDA, you must include the CUDA header file in your source and link to the CUDA libraries when you compile.

You can check the CUDA versions installed in Gadi with a module query:

$ module avail cuda

We normally recommend using the latest version available and always recommend to specify the version number with the module command:

$ module load cuda/11.2.2

For more details on using modules see our modules help guide at here.

Compile with the following command:

$ nvcc -O3 my_cuda_prog.cu -o my_cuda_prog.exe -lcudart

Use nvcc --help command to see all the options of the nvcc command.

Using Hybrid MPI and GPU CUDA

Compile with the following commands:

# Load modules, always specify version number.
$ module load openmpi/4.0.2
$ module load cuda/11.2.2

$ mpicc -O3 -lm -lstdc++ -c my_hybrid_mpi_cuda_main_prog.c
$ nvcc -O3 -lcudart -c my_hybrid_mpi_cuda_prog.cu
$ mpicc -o my_hybrid_mpi_cuda_main_prog my_hybrid_mpi_cuda_main_prog.o my_hybrid_mpi_cuda_prog.o -lm -lstdc++ -lcudart

Passing -show option in the mpicc wrapper function gives the full list of options that the wrapper passes on to the backend compiler.

As mentioned above, do not use the -fast option as this sets the -static option which conflicts with using the MPI libraries which are shared libraries.

Notes on Benchmarking

Before running MPI jobs on many processors you should run some smaller benchmarking and timing tests to see how well your application scales. There is an art to this and the following points are things you should consider when setting up timing examples. For scaling tests you need a typically sized problem that does not take too long to run. So, for example, it may be possible to time just a few iterations of your application. Test runs should be replicated several times as there will be some variation in timings.

NUMA page placement:
If the memory of a node is not empty when a job starts on it, there is a chance that pages will be placed badly. (Suspend/resume is an obvious "cause" of this problem but even after exiting, a job can leave a memory footprint (pagecache) meaning "idle" nodes can also cause problems.)
Fix: Ask for help if you are experiencing any issues like this.
Paging overhead:
Because of suspend/resume, there is a high probability that any reasonable size parallel job will have to (partially) page out at least some suspended jobs when starting up. In the worst case, this could take up to the order of a minute. In the case of a 3 hour production job this is not an issue but it is significant for a 10 minute test job. There are various ways to avoid this:
- just use the walltime of some later iteration/timestep as the benchmark measure and ignore the first couple of iterations
- run the mpirun command 2 or 3 times in the one PBS job and use the last run for timing. The first runs should be about a minute long.
- run mpirun memhog size where size is the real memory required per MPI task (takes an "m" or "g" suffix). Beware of causing excessive paging or getting your job killed by PBS by trying to memhog too much memory.
Fix: Clear memory for the job as above.
Network interactions and locality:
Generally small and hard to avoid except in special cases. The scheduler does try to impose some degree of locality but that could be strengthened. It should also be possible to add a qsub option to request best locality. But the job would probably queue longer and could still see issues. For example we regularly have InfiniBand links down and, with IB static routing, that means some links are doing double duty.
Fix: New PBS option to request best locality or live with it.
IO interactions with other jobs:
This is almost impossible to control. There is usually lots of scope for tuning IO (both at the system level and in the job) so it certainly shouldn’t be ignored. But for benchmarking purposes:
Fix: Do as little IO as possible and test the IO part separately.
Communication startup:
MPI communication is connection based meaning there is a fair amount of negotiation before the first message can be sent. There are ways to control when and how this happens (look for "mpi_preconnect_" in the output from ompi_info -a) but you cannot avoid this overhead.
Fix: Try to discount or quantify the startup cost as discussed above.

Page tree

Compiling and Linking

Using C, Fortran and C++

Using OpenMP

OpenMP Performance

Parallel loop overheads

Common Problems

Using MPI

Using Hybrid MPI and OpenMP

Using GPU CUDA

Using Hybrid MPI and GPU CUDA

Notes on Benchmarking