Page tree
Skip to end of metadata
Go to start of metadata

Arm provides two separated modules on raijin. arm-reports for users to characterise and understand the performance of HPC application runs and arm-forge for developers to debug, profile, optimise, edit and build applications for high performance. 

Note, NCI arm license allows users to arm-reports up to 2048 cpus and arm-forge up to 128 cpus.


To produce a performance report like the ones below, simply replace mpirun with mpiexec, and add perf-report in the front.

For openmpi/1.x, make sure to replace mpirun with mpiexec.
For openmpi/2.x, please use mpirun.

A sample batch job for running parallel MPI batch job is as follows:

#PBS -q express
#PBS -l walltime=01:00:00
#PBS -l mem=48GB
#PBS -l ncpus=48
#PBS -l jobfs=10gb
#PBS -l wd

module load openmpi/1.10.2
module load arm-reports/18.0
perf-report mpiexec my_program >& output


At the end of job completion, you will see two outputs, txt and html file with names like my_program_${PBS_NCPUS}p_..._datetime.html


Normally you run arm Performance Reports simply by putting perf-report in front of the command you wish to measure, but for some programs like “bowtie2” is actually a perl script that calls several different programs. Before running the alignment we just edit the “bowtie2” script and add perf-report to the command that it runs: 

my $cmd = "$align_prog$debug_str --wrapper basic-0 ".join(" ", @bt2_args);

like this:

my $cmd = "perf-report $align_prog$debug_str …




More detailed Performance Reports user guide can be found here:
You can also see performance reports with examples on various application characterisations here:


With arm-forge, you can debug with arm DDT or profile with arm MAP.

You will need to have X11 forwarding enabled when login to raijin in order to use arm DDT or MAP, and make sure to submit an interactive job with the flag "-X" which allows X11 forwarding to the compute node e.g. 

ssh -X

qsub -I -X -q expressbw -lwalltime=02:00:00 -lncpus=28 -lmem=32gb -ljobfs=10gb

Make sure to submit a job with some jobfs disk request as all the arm tools will create lots of output in your jobfs.

In order to use DDT or MAP, code must be compiled with the debug -g option. Add the -O0 flag with the Intel compiler. We also recommend that you do not run with optimisation turned on, flags such as -fast.

For openmpi/1.x, make sure to replace mpirun with mpiexec.
For openmpi/2.x, please use mpirun.

Make sure to replace mpirun with mpiexec in the command line to run your MPI program, otherwise you will see error messages like:

The target program encountered an error before it initialised the MPI environment.
Thread -1 exited.
Check arm MAP is using the correct MPI implementation for your system and can start programs without arm MAP. If you contact we'll be happy to help you further.


DDT is a parallel debugger which can be run with up to 128 processors on raijin. It can be used to debug serial, OpenMP, MPI codes.

Totalview users will find DDT has very similar functionality and an intuitive user interface. All of the primary parallel debugging features from Totalview are available with DDT.

To launch the debugger with the dot command followed by the name of the executable to debug:

module load arm-forge/18.0

mpicc -g -O0 -o my_program my_program.c 
ddt mpiexec ./my_program # will open X windows on the compute node, need interactive session (qsub -I -X [options] job.pbs)
ddt -connect mpiexec ./my_program # Have to have a remote client running.
ddt -offline mpiexec ./my_program # Generate debug report for offline session.

Python Debugging

To debug Python scripts, start the Python interpreter that will execute the script under DDT. To get line level resolution, rather than function level resolution, you must also insert %allinea_python_debug% before your script when passing arguments to Python. For example:

ddt --start -np 4 /usr/bin/python %allinea_python_debug%
Open the 'Stacks' view and select a Python frame to see the Python local variables

Note: DDT does not search in your PATH when launching executables, so you must specify the full path to Python.

More detailed DDT user guide can be found here:


To collect performance data, MAP uses two small libraries: MAP sampler (map-sampler) and MPI wrapper (map-sampler-pmpi) libraries. These must be used with your program. There are somewhat strict rules regarding linking order among object codes and these libraries (please read the User Guide for detailed information). But if you follow the instructions printed by MAP utility scripts, then it is very likely your code will run with MAP.

Before you rebuild your command, you have to load the arm-forge module, and then recompile your program with the -g option to keep debugging symbols, together with optimization flags that you would normally use, this will build a dynamically-linked executable

module load arm-forge/18.0
mpicc -g -o my_program my_program.c 
map mpiexec ./my_program # will open X windows on the compute node, need interactive session (qsub -I -X [options] job.pbs)
map -profile mpiexec ./my_program # To create a profile to be viewed later.

More detailed MAP user guide can be found here:

For UM users

For UM users

UM users can use ARM tools via the workflow tools – Rose,

Within rose you’ve got a job launch script (essentially a wrapper for mpiexec / exec) that takes a number of environment variables to specify arguments etc to tasks.

To use ddt (for example):

§ Set ROSE_LAUNCHER=ddt for the UM task. 

§ Add --connect mpiexec to the start of your ROSE_LAUNCHER_PREOPTS for the UM task.

Make sure to add modules load arm-forge/arm-reports, and launching the front-end debugger/profiler.

Note: if it was to use ARM MAP, they would just use “ROSE_LAUNCHER=map” and “ROSE_LAUNCHER_PREOPTS=--profile mpiexec” instead.