Linaro Acquires Arm Forge Software Tools Business in January 2023.
Arm provides two separated modules on Gadi:
More information:
You can check the versions installed in Gadi with a module
query:
$ module avail arm-reports $ module avail arm-forge
We normally recommend using the latest version available and always recommend to specify the version number with the module
command:
$ module load arm-reports/20.0.2 $ module load python3-as-python # To fix a wrapper error in arm-reports $ module load arm-forge/21.0
For more details on using modules see our software applications guide.
With arm-reports
, you can generate the performance report of an HPC application.
An example PBS job submission script named arm_reports_job.sh
is provided below. It requests 48 CPUs, 128 GiB memory, and 400 GiB local disk on a compute node on Gadi from the normal
queue for its exclusive access for 30 minutes against the project a00
. It also requests the system to enter the working directory once the job is started.
This script should be saved in the working directory from which the analysis will be done. To change the number of CPU cores, memory, or jobfs required, simply modify the appropriate PBS resource requests at the top of the job script files according to the information available in our queue structure guide.
Note that if your application does not work in parallel, setting the number of CPU cores to 1 and changing the memory and jobfs accordingly is required to prevent the compute resource waste.
#!/bin/bash #PBS -P a00 #PBS -q normal #PBS -l ncpus=48 #PBS -l mem=128GB #PBS -l jobfs=400GB #PBS -l walltime=00:30:00 #PBS -l wd # Load modules, always specify version number. module load openmpi/4.1.1 module load arm-reports/20.0.2 module load python3-as-python # To fix a wrapper error in arm-reports # Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job # needs access to `/scratch/ab12/` and `/g/data/yz98/` # Run application perf-report mpirun -np $PBS_NCPUS ./mpi_program >& profile_output
To run the job you would use the PBS command:
$ qsub arm_reports_job.sh
At the end of the job completion, you will see two output files: mpi_program_${PBS_NCPUS}p_..._<datetime>.txt
and mpi_program_${PBS_NCPUS}p_..._<datetime>.html
.
You can copy this HTML file to your local computer disk and open using your favourite web browser on your local computer to view the graphical HTML webpage report. A sample webpage report with description is available at https://developer.arm.com/documentation/101137/2002/Interpreting-performance-reports.
Normally you run arm Performance Reports simply by putting perf-report
in front of the command you wish to measure, but for some programs like "bowtie2" is actually a perl
script that calls several different programs. Before running the alignment we just edit the "bowtie2
" script and add perf-report
to the command that it runs:
my $cmd = "$align_prog$debug_str --wrapper basic-0 ".join(" ", @bt2_args);
like this:
my $cmd = "perf-report $align_prog$debug_str ...
More detailed Performance Reports user guide can be found here:
https://developer.arm.com/docs/101137/latest/introduction
You can also see performance reports with examples on various application characterisations here:
https://developer.arm.com/products/software-development-tools/hpc/documentation/characterizing-hpc-codes-with-arm-performance-reports
With arm-forge
, you can profile with arm MAP or debug with arm DDT.
Compile the application/program as normal but with the -g
option added. Add the -O0
flag with the Intel compiler. We also recommend that you do not run with optimisation turned on, flags such as -fast
.
# Load modules, always specify version number. $ module load openmpi/4.1.1 $ module load arm-forge/21.0 $ mpicc -g -o mpi_program mpi_program.c
To collect performance data, MAP uses two small libraries: MAP sampler (map-sampler) and MPI wrapper (map-sampler-pmpi) libraries. These must be used with your program. There are somewhat strict rules regarding linking order among object codes and these libraries (please read the User Guide for detailed information). But if you follow the instructions printed by MAP utility scripts, then it is very likely your code will run with MAP.
DDT is a parallel debugger which can be used to debug serial, OpenMP and MPI codes.
TotalView users will find DDT has very similar functionality and an intuitive user interface. All of the primary parallel debugging features from TotalView are available with DDT.
An example PBS job submission script named arm_forge_job.sh
is provided below. It requests 48 CPUs, 128 GiB memory, and 400 GiB local disk on a compute node on Gadi from the normal
queue for its exclusive access for 30 minutes against the project a00
. It also requests the system to enter the working directory once the job is started.
This script should be saved in the working directory from which the analysis will be done. To change the number of CPU cores, memory, or jobfs required, simply modify the appropriate PBS resource requests at the top of the job scrip files according to the information available in our queue structure guide.
Note that if your application does not work in parallel, setting the number of CPU cores to 1 and changing the memory and jobfs
accordingly is required to prevent the compute resource waste.
#!/bin/bash #PBS -P a00 #PBS -q normal #PBS -l ncpus=48 #PBS -l mem=128GB #PBS -l jobfs=400GB #PBS -l walltime=00:30:00 #PBS -l wd # Load modules, always specify version number. module load openmpi/4.1.1 module load arm-forge/21.0 # Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job # needs access to `/scratch/ab12/` and `/g/data/yz98/` # Run application map -profile mpirun -np $PBS_NCPUS ./mpi_program >& map_output # To create a profile to be viewed later. ddt -offline mpirun -np $PBS_NCPUS ./mpi_program >& ddt_output # Generate debug report for offline session.
To run the job you would use the PBS command:
$ qsub arm_forge_job.sh
At the end of the job completion map
will generate mpi_program_${PBS_NCPUS}p_..._<datetime>.map
file and ddt
will generate mpi_program_${PBS_NCPUS}p_..._<datetime>.html
file. You need to install and use the arm-forge remote client in your local computer to view and analyse the mpi_program_${PBS_NCPUS}p_..._<datetime>.map
file. You can download it from https://developer.arm.com/tools-and-software/server-and-hpc/downloads/arm-forge#remote-client.
More detailed MAP user guide can be found here: https://www.arm.com/products/development-tools/hpc-tools/cross-platform/forge/map
Login to Gadi with X11 (X-Windows) forwarding. Add the -Y
option for Linux/Mac/Unix to your SSH command to request SSH to forward the X11 connection to your local computer. For Windows, we recommend to use MobaXterm (http://mobaxterm.mobatek.net) as it automatically uses X11 forwarding.
You can find information about MobaXterm and X-forwarding in our Connecting to Gadi guide.
Start an interactive PBS job with the following command on Gadi. It requests 4 CPUs, 10 GiB memory, and 30 GiB local disk on a compute node on Gadi from the normal
queue for exclusive access for 30 minutes against the project a00
. It also requests the system to enter the working directory once the job is started.
To change the number of CPU cores, memory, or jobfs required, simply modify the appropriate PBS resource requests in the qsub
command below according to the information available in our queue structure guide. Note that if your application does not work in parallel, setting the number of CPU cores to 1 and changing the memory and jobfs
accordingly is required to prevent the compute resource waste.
Also note that you must include -l storage=scratch/ab12+gdata/yz98
to the qsub
command below if the job needs access to /scratch/ab12/
and /g/data/yz98/
. Details can be found in our PBS guide.
$ qsub -I -X -P a00 -q normal -l ncpus=4,mem=10GB,jobfs=30GB,walltime=00:30:00,wd
When the interactive job starts on Gadi, execute the followings commands:
# Load modules, always specify version number. $ module load openmpi/4.1.1 $ module load arm-forge/21.0 # Interactive MAP profiling $ map mpirun -np $PBS_NCPUS ./mpi_program # This will open X windows on the compute node # Interactive DDT debugging $ ddt mpirun -np $PBS_NCPUS ./mpi_program # This will open X windows on the compute node $ ddt -connect mpirun -np $PBS_NCPUS ./mpi_program # Have to have a remote client running
To debug Python scripts, start the Python interpreter that will execute the script under DDT. To get line level resolution, rather than function level resolution, you must also insert %allinea_python_debug%
before your script when passing arguments to Python. For example:
# Load modules, always specify version number. # Only Python 3.5 - 3.8 are supported. # Only openmpi up to 4.0.5 are supported. $ module load python3/3.8.5 $ module load openmpi/4.0.5 $ module load arm-forge/21.0 $ ddt python3 %allinea_python_debug% ./python_serial.py $ ddt mpirun -np $PBS_NCPUS python3 %allinea_python_debug% ./python_mpi.py
When the debug window opens, you must do the following:
More detailed DDT user guide can be found here: https://www.arm.com/products/development-tools/hpc-tools/cross-platform/forge/ddt
UM users can use ARM tools via the workflow tools – Rose:
https://github.com/metomi/rose
Within rose you have got a job launch script (essentially a wrapper for mpirun / exec) that takes a number of environment variables to specify arguments etc to tasks.
To use ddt (for example):
§ Set ROSE_LAUNCHER=ddt for the UM task.
§ Add --connect mpirun to the start of your ROSE_LAUNCHER_PREOPTS for the UM task.
Make sure to add modules load arm-forge/arm-reports (with specific version), and launching the front-end debugger/profiler.
Note: if it was to use ARM MAP, they would just use "ROSE_LAUNCHER=map” and “ROSE_LAUNCHER_PREOPTS=--profile mpirun" instead.