Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Analysis of the cputime, system time and IO time of a serial application can provide basic performance information, which allows users/developers to understand the performance of their programs and provide solution to improve the performance. Some general profiler supports parallel jobs, such as HPCToolKit and OpenSpeedShop, which can be very useful to analyze performance of parallel applications.   

This document describes how to use general performance analysis tools which are available on NCI

...

compute systems. For further help with using performance profilers and tracers, please send email to help@nci.org.au .

 

...

 

HPCTOOLKIT

Contents

  1. General Performance Analysis Tools
    1. HPCToolKit
      1. Usage
      2. Sampling Frequency and Measurements
      3. Profile Data Parse
      4. Graphical Viewer
    2. Open|SpeedShop
      1. Usage
      2. Profile Data Viewing
    3. gprof
      1. gprof for sequential programs
      2. gprof for parallel programs
  2. Useful Links

Table of Contents

HPCToolkit

HPCToolkit is an integrated suite of tools for measurement and analysis of program performance on computers ranging from multicore desktop systems to the large scale supercomputers. HPCToolkit provides accurate measurements of a program’s work, resource consumption, and inefficiency, correlates these metrics with the program’s source code, works with multilingual, fully optimized binaries, has very low measurement overhead, and scales to large parallel systems. HPCToolkit’s measurements provide support for analyzing a program execution cost, inefficiency, and scaling characteristics both within and across nodes of a parallel system.   

...

Usage

 

Load HPCToolkit module

Code Block
languagebash
themeRDark
module load hpctoolkit

Collect Profile Measurements

Measurement of application performance takes two different forms depending on whether your application is dynamically or statically linked. To monitor a dynamically linked application, simply use hpcrun to launch the application. To monitor a statically linked application, link your application using hpclink.   

  • Dynamically linked binaries

...

  •  

...

    • To monitor a sequential or multithreaded application, use:   

...

    • Code Block
      languagebash
      themeRDark
      hpcrun [options]

...

    •  prog.

...

    • exe [arguments]

...

    •  
    • To monitor an MPI application, use:   

...

    • Code Block
      languagebash
      themeRDark
      mpirun hpcrun [options]

...

    •  prog.

...

    • exe [arguments]

...

    •  
  • Statically linked binaries

...

    • To link

...

    • hpcrun’s monitoring code into prog.exe, use:   

...

    • Code Block
      languagebash
      themeRDark
      hpclink <linker> -o prog.exe <linker-arguments>

If no options is specified

...

to hpcrun, walltime will be measured for prog.exe. Otherwise, please specify PAPI events to be measured for prog.exe. A available list of PAPI events can be retrieved by running following command:   

...

Code Block
languagebash
themeRDark
hpcrun -L prog.exe

A sample PBS job script for using hpcrun with measurements passed through environment variables is like following:   

...

Code Block
languagebash
themeRDark
#PBS -q normal
#PBS -l ncpus=32
#PBS -l walltime=1:00:00
#PBS -l 

...

mem=16GB
#PBS -l wd

module load openmpi/1.6.5
module load hpctoolkit

export HPCRUN_EVENT_LIST="WALLCLOCK@5000"
mpirun -np 32 hpcrun prog.exe

A sample PBS job script for using hpcrun with measurements passed as option is like following:      

...

Code Block
languagebash
themeRDark
#PBS -q normal
#PBS -l ncpus=32
#PBS -l walltime=1:00:00
#PBS -l mem=16GB
#PBS -l wd

module load openmpi/1.6.5
module load hpctoolkit

mpirun -np 32 hpcrun -e WALLCLOCK@5000 prog.exe

 

SAMPLING FREQUENCY AND MEASUREMENTS

Sampling Frequency and Measurements

In the above example, 5000 is a sample rate for each individual measurement. With larger number of the sample rate, the sample frequency is lower, and associate overhead of HPCToolkit is lower. In general, the overhead of HPCToolKit is around 1% to 3%.   

Some other useful measurements include:   

  • WALLCLOCK: Walltime spent on each functions, or outstanding instructions.  
  • PAPI_FP_INS: Floating point instructions (x87)  
  • PAPI_VEC_SP: Single precision vector/SIMD instructions  
  • PAPI_VEC_DP: Double precision vector/SIMD instructions  
  • PAPI_LD_INS: Load instructions  
  • PAPI_SR_INS: Store instructions  
  • PAPI_BR_INS: Branch instructions  
  • and more…, please refer

...

  • to hpcrun -L prog.

...

Note: the available measurement events are different between different systems. Please make sure the event is available and measurable

...

using hpcrun -L prog.exe.  

To measure multiple events at once, following format of event options or environment variable can be used:  

  • -e WALLCLOCK@5000 -e PAPI_LD_INS@4000001 -e PAPI_SR_INS@4000001 

  • export HPCRUN_EVENT_LIST="WALLCLOCK@5000;PAPI_LD_INS@4000001;PAPI_SR_INS@4000001"   

PROFILE DATA PARSE

...

Profile Data Parse

hpcrun will generate a directory named as follow in your jobs directory

...

Code Block
languagebash
themeRDark
hpctoolkit-<prog.exe>-measurements-<jobid>

Please follow the following sequence to parse the raw measurements

...

in hpctoolkit-<prog.exe>-measurements-<jobid>  

...

Recovering Program Structure

...

Code Block
languagebash
themeRDark
hpcstruct prog.exe

This will generate a prog.exe.hpcstruct file which contains the code structure of prog.exe  

...

Parse the Raw Measurements

For serial program:    

...

Code Block
languagebash
themeRDark
hpcprof -S prog.exe.hpcstruct -I <source code directory>/'*' hpctoolkit-<prog.exe>-measurements-<jobid>

For parallel (MPI/OpenMP) program use either:   

...

Code Block
languagebash
themeRDark
hpcprof --force-metric --metric=<metrics option> -S prog.exe.hpcstruct -I <source code directory>/'*' hpctoolkit-<prog.exe>-measurements-<jobid>

Options

...

for --metric (or -M) includes 

  • sum: show (only)

...

  • sum over threads/processes metrics (default)
  • stats: show (only) sum, mean, standard dev, coef of var, min, and max over threads/processes metrics
  • thread: show only thread

...

  • metrics

...

Please

...

refer hpcprof --

...

help for more details.   

or:   

...

Code Block
languagebash
themeRDark
hpcprof-mpi -S prog.exe.hpcstruct -I <source code directory>/'*' hpctoolkit-<prog.exe>-measurements-<jobid>
Note
Note that hpcprof-mpi does not compute 'thread'.

A graphical presentable database will be generated

...

after hpcprof{-mpi} executed. It is a directory with name like:    

...

Code Block
languagebash
themeRDark
hpctoolkit-<prog.exe>-database-<jobid>

...

Graphical Viewer

GRAPHICAL VIEWER

To visualise the HPCToolKIt profile data on Raijin, you need to login to Raijin with a X display,

...

e.g. using ssh -Y. The detailed sample instruction on Raijin is listed below

...

:   

...

Code Block
languagebash
themeRDark
ssh -Y 

...

abc123@raijin.nci.org.au
module load hpctoolkit

...

hpcviewer hpctoolkit-<prog.exe>-database-<jobid>

Two different metric is presented: inclusive and exclusive, denoted by “I” and “E” respectively in the metric panel of hpcviewer.   

  • “I” indicates the inclusive measurement: represents the sum of all costs attributed to this call site and any of its descendants.   
  • “E” indicates the exclusive measurements: only represents the sum of all costs  

attributed strictly to this call site.   

...

Open|

...

Speedshop

OpenSpeedShop (OSS) is a community effort by The Krell Institute with current direct funding from DOE’s NNSA and Office of Science. It is building on top of a broad list of community infrastructures, most notably Dyninst and MRNet from UW, libmonitor from Rice, and PAPI from UTK. OpenSpeedShop is an open source multi platform Linux performance tool which is initially targeted to support performance analysis of applications running on both single node and large scale platforms.   

OpenSpeedShop is explicitly designed with usability in mind and is for application developers and computer scientists. The base functionality include:   

  • Sampling Experiments  
  • Support for Callstack Analysis  
  • Hardware Performance Counters  
  • MPI Profiling and Tracing  
  • I/O Profiling and Tracing  
  • Floating Point Exception Analysis   

In addition, OpenSpeedShop is designed to be modular and extensible. It supports several levels of plug-ins which allow users to add their own performance experiments.   

OpenSpeedShop development is hosted by the Krell Institute. The infrastructure and base components of OpenSpeedShop are released as open source code primarily under LGPL.   

USAGE

To use OSS, please load module as follows:    

...

Usage

Load openspeedshop module

Code Block
languagebash
themeRDark
module load openspeedshop

 

Experiments types

OSS provides different profiling options, called experiments, for specific performance analysis.   

  • pcsamp: periodic sampling the program counters give a low-overhead view of where the time is being spent in the user application.  
  • usertime: periodic sampling the call path allows the user to view inclusive and exclusive time spent in application routines. It also allows the user to see which routines called which routines. Several views are available, including the “hot” path.  
  • hwc: PAPI hardware events are counteed at the machine instruction, source line, and function levels.  
  • hwcsamp: similar to hwc, except that sampling is based on time not PAPI event overflows. Also, up to six events maybe sampled during the same experiments.  
  • hwctime: similar to hwc, except that call path sampling is also included.  
  • io: accumulated wall-clock duration of I/O system calls: read, readv, write, writev, open, close, dup, pipe, create and others.  
  • iot: similar to io, except that more information is gathered, such as bytes moved, file names, etc. Notes: this is a tracing-like experiment 

  • mpi: captures the time spent in and the number of times each MPI function is called. Trace formation option displays the data for each call, showing its start and end time.  
  • mpit: records each MPI function call event with specific data for display using a GUI or a command line interface (CLI). Notes: this is a tracing-like experiment 

  • mpiotf: similar to mpit, except writes MPI calls trace to Open Trace Format (OTF) files to allow viewing with Vampir or converting to formats of other tools.  
  • fpe: find where each floating-point exception occurred. A trace collects each with its exception type and the call stack contents. These measurements are exact, not statistical.    

OSS Commands Matching for Each Experiment

There are some convenience commands provided by OSS for each above experiments:   

  • osspcsamp: for pcsamp  

  • ossusertime: for usertime  

  • osshwc: for hwc, similar to HPCToolKit  

  • osshwcsamp: for hwcsamp, similar to HPCToolKit  

  • osshwctime: for hwctime, similar to HPCToolKit  

  • ossio: for

...

  • io 

  • ossiot: for

...

  • iot 

  • ossmpi: for

...

  • mpi 

  • ossmpit: for mpit  

  • ossmpiotf: for

...

  • mpiotf 

  • ossfpe: for fpe   

Sample PBS Job Script for OSS

A sample PBS job script is as shown as below:   

Code Block
languagebash
themeRDark
#PBS -q normal
#PBS -l ncpus=32
#PBS -l walltime=2:00:00
#PBS -l mem=32GB
#PBS -l wd

module load openmpi/1.6.5
module load openspeedshop

export OPENSS_RAWDATA_DIR=/short/$PROJECT/$USER/tmp
OSS_Cmd "mpirun -n 32 mpi_prog.exe"

The OPENSS_RAWDATA_DIR need to be given a shared file system path. We recommend use /short/$PROJECT/$USER/tmp.

...

Make sure that the directory is available.

Note
Note that the OSS_Cmd is one of the OSS Commands listed in above section, such

...

as ossmpi, ossio, etc.   

PROFILE DATA VIEWING

...

Profile Data Viewing

An .openss profile data file will be generated after job completion. It is usually named as follows:   

Code Block
languagebash
themeRDark
mpi_prog.exe-<OSS experiment name>-openmpi.openss

...

Interactive Command Line

...

:

Code Block
languagebash
themeRDark
openss -cli -f mpi_prog.exe-<OSS experiment name>-openmpi.openss
openss>> expview

 

Interavitve GUI

...

:

Code Block
languagebash
themeRDark
openss -f mpi_prog.exe-<OSS experiment name>-openmpi.openss

For detailed OSS commands and viewer usage, please refer to OpenSpeedShop User Guide

...

 or OpenSpeedShop cheat sheet  

GPROF

 

GPROF FOR SEQUENTIAL PROGRAMS

Gprof

Gprof for Sequential Programs

The gprof profiler provides information on the most time-consuming subprograms in your code. Profiling the executable prog.exe will lead to profiling data being stored in gmon.out which can then be interpreted by gprof as follows:   

Usage

...

For the Intel compilers do:

Code Block
languagebash
themeRDark
ifort -p -o prog.exe prog.f

...

./prog.exe

...

gprof ./prog.exe gmon.out

For the GNU compilers

...

do:

Code Block
languagebash
themeRDark
gfortran -pg -o prog.exe prog.f

...

gprof ./prog.exe gmon.out

 

GPROF FOR PARALLEL PROGRAMS

 

Gprof For Parallel Programs

Usage

Compilation:    

...

Code Block
languagebash
themeRDark
mpif90 -pg -g -o prog.exe prog.f

PBS script:     

Code Block
languagebash
themeRDark
...

...

mpirun /apps/pgprof/parallel_gprof prog.exe

The code of parallel_

...

gprof:

Code Block
languagebash
themeRDark
export GMON_OUT_PREFIX=gmon.out.$PBS_JOBID.$OMPI_COMM_WORLD_RANK

...

USEFUL LINKS