Linaro Forge HPC Tools...

Overview

Linaro Forge is a tool which provides the following supports on Gadi:

Linaro DDT for parallel high-performance application debugging,
Linaro MAP for performance profiling and optimization advice, and
Linaro Performance Reports for summarizing and characterizing both scalar and MPI application performance.

Linaro Forge supports many parallel architectures and models, including MPI, GPUs and OpenMP. It provides cross-platform support for the latest compilers and C++ standards, and Intel, 64-bit Arm, AMD, OpenPOWER, NVIDIA GPU, AMD GPU and Intel Xe-HPC GPU hardware.

Linaro Forge provides native remote clients for Windows, Mac OS X, and Linux. A remote client can be used to connect to the cluster and run, debug, profile, edit, and compile the application files.

More information is available at: https://www.linaroforge.com/download-documentation

On this page

How to use

You can check the versions installed in Gadi with a module query:

$ module avail linaro-forge

We normally recommend using the latest version available and always recommend to specify the version number with the module command:

$ module load linaro-forge/24.0.2

For more details on using modules see our software applications guide.

Make sure to submit a job with some jobfs disk request as all the Linaro Forge tools will create lots of output in your jobfs.

Compiler Flags

The recommended set of compilation flags for profiling are:

Arm Compiler for Linux: -g1 -O3 -fno-inline-functions -fno-optimize-sibling-calls
Cray Fortran: -G2 -O3 -h ipa0
Cray Clang C and C++: -g1 -O3 -fno-inline-functions -fno-optimize-sibling-calls
GNU: -g1 -O3 -fno-inline -fno-optimize-sibling-calls
Intel: -debug minimal -O3 -fno-inline -no-ip -no-ipo -fno-omit-frame-pointer -fno-optimize-sibling-calls
NVIDIA HPC: -g -O3 -Meh_frame -Mnoautoinline

When compiling the program that you want to debug, you must add the debug flag to your compile command. For most compilers this is -g.

Do not compile your program with optimisation flags while you are debugging it. Compiler optimisations can "rewrite" your program and produce machine code that does not necessarily match your source code.

Offline Profiling/Debugging

An example PBS job submission script named linaro_forge_job.sh is provided below. It requests 48 CPUs, 128 GiB memory, and 400 GiB local disk on a compute node on Gadi from the normal queue for its exclusive access for 30 minutes against the project a00. It also requests the system to enter the working directory once the job is started.

This script should be saved in the working directory from which the analysis will be done. To change the number of CPU cores, memory, or jobfs required, simply modify the appropriate PBS resource requests at the top of the job scrip files according to the information available in our queue structure guide.

Note that if your application does not work in parallel, setting the number of CPU cores to 1 and changing the memory and jobfs accordingly is required to prevent the compute resource waste.

#!/bin/bash

#PBS -P a00
#PBS -q normal
#PBS -l ncpus=48
#PBS -l mem=128GB
#PBS -l jobfs=400GB
#PBS -l walltime=00:30:00
#PBS -l wd

# Load modules, always specify version number.
module load openmpi/4.1.7
module load python3/3.12.1
module load linaro-forge/24.0.2

# Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job
# needs access to `/scratch/ab12/` and `/g/data/yz98/`

# Run application
# ---------------
# Generate only a profile map file. This map file can be used to view and generate profile reports.
map --profile mpirun -np $PBS_NCPUS ./mpi_program >& map_output
map --profile mpirun -np $PBS_NCPUS python3 ./python_mpi.py >& python_map_output

# Generate a profile map file, a text profile report file and an html profile report file.
# It also shows a profile summary in the terminal.
map --profile --report=txt,html,summary mpirun -np $PBS_NCPUS ./mpi_program >& map_output
map --profile --report=txt,html,summary mpirun -np $PBS_NCPUS python3 ./python_mpi.py >& python_map_output

# Generate a text profile report and an html profile report
perf-report <program_name>.map

# Generate a text profile report and an html profile report
perf-report mpirun -np $PBS_NCPUS ./mpi_program >& map_output
perf-report mpirun -np $PBS_NCPUS python3 ./python_mpi.py >& python_map_output

# Generate a debug report for offline session
ddt --offline mpirun -np $PBS_NCPUS ./mpi_program >& ddt_output
ddt --offline mpirun -np $PBS_NCPUS python3 %allinea_python_debug% ./python_mpi.py >& python_ddt_output

NCI Linaro Forge licence allows users to run ddt and map up to 256 CPUs, and perf-report up to 4000 CPUs.

Normally you run Linaro Forge Performance Reports simply by putting perf-report in front of the command you wish to measure, but for some programs like "bowtie2" is actually a perl script that calls several different programs. Before running the alignment, we just edit the bowtie2 script and add perf-report to the command that it runs:

my $cmd = "$align_prog$debug_str --wrapper basic-0 ".join(" ", @bt2_args);

like this:

my $cmd = "perf-report $align_prog$debug_str ...

To run the job you would use the PBS command:

$ qsub linaro_forge_job.sh

At the end of the job completion, map and perf-report will generate the following files:

For non-Python program: mpi_program_${PBS_NCPUS}p_${PBS_NNODES}n_${OMP_NUM_THREADS}t_<datetime>.{map,html,txt}
For Python program: python3_allinea_ddt_trace_py_${PBS_NCPUS}p_${PBS_NNODES}n_${OMP_NUM_THREADS}t_<datetime>.{map,html,txt}

and ddt will generate the following files:

For non-Python program: mpi_program_${PBS_NCPUS}p_${PBS_NNODES}n_${OMP_NUM_THREADS}t_<datetime>.html
For Python program: python3<.XY>_${PBS_NCPUS}p_${PBS_NNODES}n_${OMP_NUM_THREADS}t_<datetime>.html

You can copy the HTML file to your local computer disk and open using your favourite web browser on your local computer to view the graphical HTML webpage report.

You need to install and use the linaro-forge remote client in your local computer to view and analyse the performance of your application by loading the <program_name>.map file. You can also view and export the performance report using the Reports menu.

Since you have loaded linaro-forge/24.0.2 version, you have to have a remote client of version 24.0.* installed, configured and run on the local machine. You have to configure the remote client by adding /apps/linaro-forge/24.0.2 as Remote Installation Directory and <username>@gadi.nci.org.au as Host Name.

A remote client can be downloaded from https://www.linaroforge.com/download-documentation.

Interactive Profiling/Debugging

Login to Gadi with X11 (X-Windows) forwarding. Add the -Y option for Linux/Mac/Unix to your SSH command to request SSH to forward the X11 connection to your local computer. For Windows, we recommend to use MobaXterm (http://mobaxterm.mobatek.net) as it automatically uses X11 forwarding.

You can find information about MobaXterm and X-forwarding in our Connecting to Gadi guide.

Start an interactive PBS job with the following command on Gadi. It requests 4 CPUs, 10 GiB memory, and 30 GiB local disk on a compute node on Gadi from the normal queue for exclusive access for 30 minutes against the project a00. It also requests the system to enter the working directory once the job is started.

To change the number of CPU cores, memory, or jobfs required, simply modify the appropriate PBS resource requests in the qsub command below according to the information available in our queue structure guide. Note that if your application does not work in parallel, setting the number of CPU cores to 1 and changing the memory and jobfs accordingly is required to prevent the compute resource waste.

Also note that you must include -l storage=scratch/ab12+gdata/yz98 to the qsub command below if the job needs access to /scratch/ab12/ and /g/data/yz98/. Details can be found in our PBS guide.

$ qsub -I -X -P a00 -q normal -l ncpus=4,mem=10GB,jobfs=30GB,walltime=00:30:00,wd

When the interactive job starts on Gadi, execute the followings commands:

# Load modules, always specify version number.
$ module load openmpi/4.1.7
$ module load python3/3.12.1
$ module load linaro-forge/24.0.2

# Interactive MAP profiling (use either 1 or 2 below)
# ---------------------------------------------------
# 1. This will open X windows on the compute node
$ map mpirun -np $PBS_NCPUS ./mpi_program
$ map mpirun -np $PBS_NCPUS python3 ./python_mpi.py

# 2. This will open windows on remote client running on the local computer
$ map --connect mpirun -np $PBS_NCPUS ./mpi_program
$ map --connect mpirun -np $PBS_NCPUS python3 ./python_mpi.py

# Interactive DDT debugging (use either 1 or 2 below)
# ---------------------------------------------------
# 1. This will open X windows on the compute node
$ ddt mpirun -np $PBS_NCPUS ./mpi_program
$ ddt mpirun -np $PBS_NCPUS python3 %allinea_python_debug% ./python_mpi.py

# 2. This will open windows on remote client running on the local computer
$ ddt --connect mpirun -np $PBS_NCPUS ./mpi_program
$ ddt --connect mpirun -np $PBS_NCPUS python3 %allinea_python_debug% ./python_mpi.py

For UM Users

UM users can use Linaro Forge tools via the workflow tools – Rose (https://github.com/metomi/rose).

Within rose you have got a job launch script (essentially a wrapper for mpirun / exec) that takes a number of environment variables to specify arguments etc to tasks.

To use DDT (for example):

Set ROSE_LAUNCHER=ddt for the UM task.
Add --connect mpirun to the start of your ROSE_LAUNCHER_PREOPTS for the UM task.

Make sure to add modules load linaro-forge (with specific version), and launching the front-end debugger/profiler.

Note: If it was to use map, they would just use ROSE_LAUNCHER=map and ROSE_LAUNCHER_PREOPTS=--profile mpirun instead.

Authors: Mohsin Ali

Page tree