Analysis of the cputime, system time and IO time of a serial application can provide basic performance information, which allows users/developers to understand the performance of their programs and provide solution to improve the performance. Some general profiler supports parallel jobs, such as HPCToolKit and OpenSpeedShop, which can be very useful to analyze performance of parallel applications.
This document describes how to use general performance analysis tools which are available on NCI NF compute systems. For further help with using performance profilers and tracers, please send email to help@nci.org.au .
Contents
HPCToolkit is an integrated suite of tools for measurement and analysis of program performance on computers ranging from multicore desktop systems to the large scale supercomputers. HPCToolkit provides accurate measurements of a program’s work, resource consumption, and inefficiency, correlates these metrics with the program’s source code, works with multilingual, fully optimized binaries, has very low measurement overhead, and scales to large parallel systems. HPCToolkit’s measurements provide support for analyzing a program execution cost, inefficiency, and scaling characteristics both within and across nodes of a parallel system.
Measurement of application performance takes two different forms depending on whether your application is dynamically or statically linked. To monitor a dynamically linked application, simply use hpcrun to launch the application. To monitor a statically linked application, link your application using hpclink.
$ hpcrun [options] prog.exe [arguments]
$ mpirun hpcrun [options] prog.exe [arguments]
$ hpclink <linker> -o prog.exe <linker-arguments>
If no options is specified to hpcrun, walltime will be measured for prog.exe. Otherwise, please specify PAPI events to be measured for prog.exe. A available list of PAPI events can be retrieved by running following command:
$ hpcrun -L prog.exe
A sample PBS job script for using hpcrun with measurements passed through environment variables is like following:
#PBS -q normal #PBS -l ncpus=32 #PBS -l walltime=1:00:00 #PBS -l vmem=16GB #PBS -wd module load openmpi/1.6.5 module load hpctoolkit export HPCRUN_EVENT_LIST="WALLCLOCK@5000" mpirun -np 32 hpcrun prog.exe
A sample PBS job script for using hpcrun with measurements passed as option is like following:
#PBS -q normal #PBS -l ncpus=32 #PBS -l walltime=1:00:00 #PBS -l mem=16GB module load openmpi/1.6.5 module load hpctoolkit mpirun -np 32 hpcrun -e WALLCLOCK@5000 prog.exe
In the above example, 5000 is a sample rate for each individual measurement. With larger number of the sample rate, the sample frequency is lower, and associate overhead of HPCToolkit is lower. In general, the overhead of HPCToolKit is around 1% to 3%.
Some other useful measurements include:
and more…, please refer to hpcrun -L prog.exe for a complete list of measurable events, or the PAPI Preset Events list.
Note: the available measurement events are different between different systems. Please make sure the event is available and measurable using hpcrun -L prog.exe.
To measure multiple events at once, following format of event options or environment variable can be used:
-e WALLCLOCK@5000 -e PAPI_LD_INS@4000001 -e PAPI_SR_INS@4000001
export HPCRUN_EVENT_LIST="WALLCLOCK@5000;PAPI_LD_INS@4000001;PAPI_SR_INS@4000001"
hpcrun will generate a directory named as follow in your jobs directory.
$ hpctoolkit-<prog.exe>-measurements-<jobid>
Please follow the following sequence to parse the raw measurements in hpctoolkit-<prog.exe>-measurements-<jobid>.
$ hpcstruct prog.exe
This will generate a prog.exe.hpcstruct file which contains the code structure of prog.exe.
For serial program:
$ hpcprof -S prog.exe.hpcstruct -I <source code directory>/'*' hpctoolkit-<prog.exe>-measurements-<jobid>
For parallel (MPI/OpenMP) program use either:
$ hpcprof --force-metric --metric=<metrics option> -S prog.exe.hpcstruct -I <source code directory>/'*' hpctoolkit-<prog.exe>-measurements-<jobid>
Options for -M includes:
sum: show (only) summary metrics (Sum, Mean, StdDev, CoefVar, Min, Max)
thread: show only thread metrics
sum+: enables to show both thread and summary metrics.
Please refer hpcprof --help for more details.
or:
$ hpcprof-mpi -S prog.exe.hpcstruct -I <source code directory>/'*' hpctoolkit-<prog.exe>-measurements-<jobid>
A graphical presentable database will be generated after hpcprof{-mpi} executed. It is a directory with name like:
$ hpctoolkit-<prog.exe>-database-<jobid>
To visualise the HPCToolKIt profile data on Raijin, you need to login to Raijin with a X display, eg. using ssh -Y. The detailed sample instruction on Raijin is listed below.
$ ssh -Y raijin $ module load hpctoolkit $ hpcviewer hpctoolkit-<prog.exe>-database-<jobid>
Two different metric is presented: inclusive and exclusive, denoted by “I” and “E” respectively in the metric panel of hpcviewer.
attributed strictly to this call site.
OpenSpeedShop (OSS) is a community effort by The Krell Institute with current direct funding from DOE’s NNSA and Office of Science. It is building on top of a broad list of community infrastructures, most notably Dyninst and MRNet from UW, libmonitor from Rice, and PAPI from UTK. OpenSpeedShop is an open source multi platform Linux performance tool which is initially targeted to support performance analysis of applications running on both single node and large scale platforms.
OpenSpeedShop is explicitly designed with usability in mind and is for application developers and computer scientists. The base functionality include:
In addition, OpenSpeedShop is designed to be modular and extensible. It supports several levels of plug-ins which allow users to add their own performance experiments.
OpenSpeedShop development is hosted by the Krell Institute. The infrastructure and base components of OpenSpeedShop are released as open source code primarily under LGPL.
To use OSS, please load module as follows:
$ module load openspeedshop
OSS provides different profiling options, called experiments, for specific performance analysis.
iot: similar to io, except that more information is gathered, such as bytes moved, file names, etc. Notes: this is a tracing-like experiment.
mpit: records each MPI function call event with specific data for display using a GUI or a command line interface (CLI). Notes: this is a tracing-like experiment.
There are some convenience commands provided by OSS for each above experiments:
osspcsamp: for pcsamp
ossusertime: for usertime
osshwc: for hwc, similar to HPCToolKit
osshwcsamp: for hwcsamp, similar to HPCToolKit
osshwctime: for hwctime, similar to HPCToolKit
ossio: for io
ossiot: for iot
ossmpi: for mpi
ossmpit: for mpit
ossmpiotf: for mpiotf
ossfpe: for fpe
A sample PBS job script is as shown as below:
#PBS -q normal #PBS -l ncpus=32 #PBS -l walltime=2:00:00 #PBS -l mem=32GB module load openmpi/1.6.5 module load openspeedshop export OPENSS_RAWDATA_DIR=/short/$PROJECT/$USER/tmp OSS_Cmd "mpirun -n 32 mpi_prog.exe"
The OPENSS_RAWDATA_DIR need to be given a shared file system path. We recommend use /short/$PROJECT/$USER/tmp. The OSS_Cmd is one of the OSS Commands listed in above section, such as ossmpi, ossio, etc.
A .openss profile data file will be generated after job completion. It is usually named as follows:
mpi_prog.exe-<OSS experiment name>-openmpi.openss
$ openss -cli -f mpi_prog.exe-<OSS experiment name>-openmpi.openss openss>> expview
$ openss -f mpi_prog.exe-<OSS experiment name>-openmpi.openss
For detailed OSS commands and viewer usage, please refer to OpenSpeedShop User Guide, or OpenSpeedShop cheat sheet.
The gprof profiler provides information on the most time-consuming subprograms in your code. Profiling the executable prog.exe will lead to profiling data being stored in gmon.out which can then be interpreted by gprof as follows:
$ ifort -p -o prog.exe prog.f $ ./prog.exe $ gprof ./prog.exe gmon.out
For the GNU compilers do
$ gfortran -pg -o prog.exe prog.f $ gprof ./prog.exe gmon.out
Compilation:
$ mpif90 -pg -g -o prog.exe prog.f
PBS script:
... $ mpirun /apps/pgprof/parallel_gprof prog.exe
The code of parallel_gprof
export GMON_OUT_PREFIX=gmon.out.$PBS_JOBID.$OMPI_COMM_WORLD_RANK