Page tree

Overview

Arm provides two separated modules on Gadi:

  1. arm-reports for users to characterise and understand the performance of HPC application runs, and
  2. arm-forge for developers to debug, profile, optimise, edit and build applications for high performance. 

More information:

  1. https://www.arm.com/products/development-tools/server-and-hpc/forge
  2. https://developer.arm.com/tools-and-software/server-and-hpc/debug-and-profile/arm-forge/arm-performance-reports

How to use 


You can check the versions installed in Gadi with a module query:

$ module avail arm-reports
$ module avail arm-forge

We normally recommend using the latest version available and always recommend to specify the version number with the module command:

$ module load arm-reports/20.0.2
$ module load python3-as-python # To fix a wrapper error in arm-reports
$ module load arm-forge/21.0

For more details on using modules see our software applications guide.

Make sure to submit a job with some jobfs disk request as all the arm tools will create lots of output in your jobfs.

arm-reports 

With arm-reports, you can generate the performance report of an HPC application.

An example PBS job submission script named arm_reports_job.sh is provided below. It requests 48 CPUs, 128 GiB memory, and 400 GiB local disk on a compute node on Gadi from the normal queue for its exclusive access for 30 minutes against the project a00. It also requests the system to enter the working directory once the job is started.

This script should be saved in the working directory from which the analysis will be done. To change the number of CPU cores, memory, or jobfs required, simply modify the appropriate PBS resource requests at the top of the job script files according to the information available in our queue structure guide.

Note that if your application does not work in parallel, setting the number of CPU cores to 1 and changing the memory and jobfs accordingly is required to prevent the compute resource waste.

#!/bin/bash
 
#PBS -P a00
#PBS -q normal
#PBS -l ncpus=48
#PBS -l mem=128GB
#PBS -l jobfs=400GB
#PBS -l walltime=00:30:00
#PBS -l wd
 
# Load modules, always specify version number.
module load openmpi/4.1.1
module load arm-reports/20.0.2
module load python3-as-python # To fix a wrapper error in arm-reports
 
# Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job
# needs access to `/scratch/ab12/` and `/g/data/yz98/`
 
# Run application
perf-report mpirun -np $PBS_NCPUS ./mpi_program >& profile_output
NCI arm licence allows users to run arm-reports up to 2048 cpus.

To run the job you would use the PBS command:

 $ qsub arm_reports_job.sh

At the end of the job completion, you will see two output files: mpi_program_${PBS_NCPUS}p_..._<datetime>.txt and mpi_program_${PBS_NCPUS}p_..._<datetime>.html.

You can copy this HTML file to your local computer disk and open using your favourite web browser on your local computer to view the graphical HTML webpage report. A sample webpage report with description is available at https://developer.arm.com/documentation/101137/2002/Interpreting-performance-reports.

Normally you run arm Performance Reports simply by putting perf-report in front of the command you wish to measure, but for some programs like "bowtie2" is actually a perl script that calls several different programs. Before running the alignment we just edit the "bowtie2" script and add perf-report to the command that it runs:

 my $cmd = "$align_prog$debug_str --wrapper basic-0 ".join(" ", @bt2_args);

like this:

 my $cmd = "perf-report $align_prog$debug_str ...

More detailed Performance Reports user guide can be found here:
https://developer.arm.com/docs/101137/latest/introduction

You can also see performance reports with examples on various application characterisations here:
https://developer.arm.com/products/software-development-tools/hpc/documentation/characterizing-hpc-codes-with-arm-performance-reports

arm-forge 

With arm-forge, you can profile with arm MAP or debug with arm DDT.

Compile the application/program as normal but with the -g option added. Add the -O0 flag with the Intel compiler. We also recommend that you do not run with optimisation turned on, flags such as -fast.

# Load modules, always specify version number.
$ module load openmpi/4.1.1
$ module load arm-forge/21.0
 
$ mpicc -g -o mpi_program mpi_program.c
Do not compile your program with optimisation flags while you are debugging it. Compiler optimisations can "rewrite" your program and produce machine code that does not necessarily match your source code.

MAP 

To collect performance data, MAP uses two small libraries: MAP sampler (map-sampler) and MPI wrapper (map-sampler-pmpi) libraries. These must be used with your program. There are somewhat strict rules regarding linking order among object codes and these libraries (please read the User Guide for detailed information). But if you follow the instructions printed by MAP utility scripts, then it is very likely your code will run with MAP.

DDT 

DDT is a parallel debugger which can be run with up to 128 processors on Gadi. It can be used to debug serial, OpenMP and MPI codes.

TotalView users will find DDT has very similar functionality and an intuitive user interface. All of the primary parallel debugging features from TotalView are available with DDT.

Offline Profiling/Debugging 

An example PBS job submission script named arm_forge_job.sh is provided below. It requests 48 CPUs, 128 GiB memory, and 400 GiB local disk on a compute node on Gadi from the normal queue for its exclusive access for 30 minutes against the project a00. It also requests the system to enter the working directory once the job is started.

This script should be saved in the working directory from which the analysis will be done. To change the number of CPU cores, memory, or jobfs required, simply modify the appropriate PBS resource requests at the top of the job scrip files according to the information available in our queue structure guide.

Note that if your application does not work in parallel, setting the number of CPU cores to 1 and changing the memory and jobfs accordingly is required to prevent the compute resource waste.

#!/bin/bash
 
#PBS -P a00
#PBS -q normal
#PBS -l ncpus=48
#PBS -l mem=128GB
#PBS -l jobfs=400GB
#PBS -l walltime=00:30:00
#PBS -l wd
 
# Load modules, always specify version number.
module load openmpi/4.1.1
module load arm-forge/21.0
 
# Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job
# needs access to `/scratch/ab12/` and `/g/data/yz98/`
 
# Run application
map -profile mpirun -np $PBS_NCPUS ./mpi_program >& map_output # To create a profile to be viewed later.
ddt -offline mpirun -np $PBS_NCPUS ./mpi_program >& ddt_output # Generate debug report for offline session.
NCI arm licence allows users to run arm-forge up to 128 CPUs.

To run the job you would use the PBS command:

$ qsub arm_forge_job.sh

At the end of the job completion map will generate mpi_program_${PBS_NCPUS}p_..._<datetime>.map file and ddt will generate mpi_program_${PBS_NCPUS}p_..._<datetime>.html file. You need to install and use the arm-forge remote client in your local computer to view and analyse the mpi_program_${PBS_NCPUS}p_..._<datetime>.map file. You can download it from https://developer.arm.com/tools-and-software/server-and-hpc/downloads/arm-forge#remote-client.

More detailed MAP user guide can be found here: https://www.arm.com/products/development-tools/hpc-tools/cross-platform/forge/map

Interactive Profiling/Debugging 

Login to Gadi with X11 (X-Windows) forwarding. Add the -Y option for Linux/Mac/Unix to your SSH command to request SSH to forward the X11 connection to your local computer. For Windows, we recommend to use MobaXterm (http://mobaxterm.mobatek.net) as it automatically uses X11 forwarding.

You can find informationa bout MobaXterm and X-forwarding in our connecting to Gadi guide. 

Start an interactive PBS job with the following command on Gadi. It requests 4 CPUs, 10 GiB memory, and 30 GiB local disk on a compute node on Gadi from the normal queue for exclusive access for 30 minutes against the project a00. It also requests the system to enter the working directory once the job is started.

To change the number of CPU cores, memory, or jobfs required, simply modify the appropriate PBS resource requests in the qsub command below according to the information available in our queue structure guide. Note that if your application does not work in parallel, setting the number of CPU cores to 1 and changing the memory and jobfs accordingly is required to prevent the compute resource waste.

Also note that you must include -l storage=scratch/ab12+gdata/yz98 to the qsub command below if the job needs access to /scratch/ab12/ and /g/data/yz98/. Details can be found in our PBS guide.

$ qsub -I -X -P a00 -q normal -l ncpus=4,mem=10GB,jobfs=30GB,walltime=00:30:00,wd

When the interactive job starts on Gadi, execute the followings commands:

# Load modules, always specify version number.
$ module load openmpi/4.1.1
$ module load arm-forge/21.0
 
# Interactive MAP profiling
$ map mpirun -np $PBS_NCPUS ./mpi_program # This will open X windows on the compute node
 
# Interactive DDT debugging
$ ddt mpirun -np $PBS_NCPUS ./mpi_program # This will open X windows on the compute node
$ ddt -connect mpirun -np $PBS_NCPUS ./pi_program # Have to have a remote client running
Python Program Debugging 

To debug Python scripts, start the Python interpreter that will execute the script under DDT. To get line level resolution, rather than function level resolution, you must also insert %allinea_python_debug% before your script when passing arguments to Python. For example:

# Load modules, always specify version number.
# Only Python 3.5 - 3.8 are supported.
# Only openmpi up to 4.0.5 are supported.
$ module load python3/3.8.5
$ module load openmpi/4.0.5
$ module load arm-forge/21.0
 
$ ddt python3 %allinea_python_debug% ./python_serial.py
$ ddt mpirun -np $PBS_NCPUS python3 %allinea_python_debug% ./python_mpi.py

When the debug window opens, you must do the following:

  1. When using MPI, select the Python frame from within Stacks view.
  2. Otherwise, press Play/Continue and DDT will stop at the first line of your Python script.

More detailed DDT user guide can be found here: https://www.arm.com/products/development-tools/hpc-tools/cross-platform/forge/ddt

For UM users 


UM users can use ARM tools via the workflow tools – Rose:

https://github.com/metomi/rose

Within rose you have got a job launch script (essentially a wrapper for mpirun / exec) that takes a number of environment variables to specify arguments etc to tasks.

To use ddt (for example):

§ Set ROSE_LAUNCHER=ddt for the UM task. 

§ Add --connect mpirun to the start of your ROSE_LAUNCHER_PREOPTS for the UM task.

Make sure to add modules load arm-forge/arm-reports (with specific version), and launching the front-end debugger/profiler.

Note: if it was to use ARM MAP, they would just use "ROSE_LAUNCHER=map” and “ROSE_LAUNCHER_PREOPTS=--profile mpirun" instead.

Authors: Mohsin Ali
  • No labels