Contents
Determining exactly how well an MPI application is performing on current HPC systems is a challenging task. Analysis of the cputime, system time and IO time of a serial application can provide basic performance information but for a parallel application, the (wasted) time spent waiting on communication is not visible from “outside the application”. MPI performance analysis tools provide insight into this “internal” computation versus communication behaviour and, as a result, understanding of the application’s parallel performance. They can reveal potential issues such as load imbalance, synchronization contentions and much more. As well as pointing out the limitations of an MPI application, access to this profiling information can assist user in optimizing the application to achieve greater scalability.
MPI performance analysis is normally performed at two levels. The first level is called MPI summary profiling or simply MPI profiling, which aggregates statistics at run time and provides performance overview of the whole job execution. The second level, called MPI tracing, collects the MPI event history of an application execution and provides fine grained information for each MPI function call (every message passed) along the execution timeline.
This document describes how to use MPI performance analysis tools including profilers and tracers which are available on NCI compute systems.
MPI Profiling
A MPI profiler aggregates “whole run” statistics at run time, e.g. total amount of time spent in MPI, total number of messages or bytes sent, etc. As this information is available on a per-rank basis, issues such as load imbalance are exposed.
Typically the overhead of collecting this summary profiling data is very low (~1%) and the volume of profiling data collected is also very low. During runtime, information collection is local to each process and simply involves updating counters each time an MPI call is made. The profiling library only invokes communication during report generation, typically at the end of the run, to merge results from all of the tasks into one output file. As a result, it is feasible to include the use of an MPI profiler in all production runs.
Note that (currently) no profiling information will be produced if the execution does not complete normally (i.e. does not call MPI_Finalize()
).
On NCI NF compute systems, two different lightweight MPI profilers are installed. They are IPM and mpiP. Both of these tools require minimal actions to invoke – we recommend that you use them regularly. Note that their use is only applicable to Open MPI applications.
IPM
The following versions of IPM are available: Using IPM does not require code recompilation. Instead, A simple PBS job script using the IPM profiler with an MPI executable (prog.exe) is as shown below: Currently, IPM is available for User can also define the IPM log directory and log file name via setting following environment variables in the PBS job script before mpirun, an good example of IPM log directory and file name are: NOTE: For some applications, defining IPM_LOGDIR in .bashrc or .cshrc is compulsory to successfully generate IPM profile data file. These applications usually have following features: more to be added … By default IPM produces a summary of the performance information for the application on stdout. IPM also generates an XML data file which will be named something like (if user hasn’t define IPM_LOGFILE environment variable): your_username.1231369287.321103.0 eg. jxc900.1231369287.321103.0 The XML data file can be used to generate a graphical webpage in one of two ways. Use lightweight browser on NCI machine To visualize the IPM XML data on Raijin, you need to login to Raijin with an X display, eg. using ssh -X or ssh -Y, or with VNC. The detailed sample instruction on Raijin is listed below. Use your favourite browser on your laptop/desktop Alternatively, the IPM XML data file can be parsed to HTML format. The detailed sample instruction on raijin are: The ipm_parse command will generate a directory containing parsed IPM profile data with graphs. The directory will be named something like: You can secure copy the directory to your local disk. A sample instruction is as follows. Then you can view it with your favourite web browser, e.g. firefox, on your desktop: IPM can be integrated with hardware performance counters and profile useful information such as GFlops, cache misses, etc. PAPI is used for this purpose. Currently, IPM-HPM are only avaialble for Open MPI version 1.4.3. To usage IPM with HPM, please do following: A sample PBS job script will be similar to following: The environment variable For Raijin: Customize Your Own PAPI Eventsets You can also customize your own PAPI eventsets for Both PAPI pre-define events, and native hardware events can be used. Message sizes rounding to power of 2 This feature significantly reduces the memory usage and postprocessing time during both runtime and parsing time. IPM Check-pointing This feature allow IPM to checkpoint profile status in These files can be merged (using ipm/0.983-nci
openmpi/1.6.5
or lessipm/0.983-cache
openmpi/1.6.5
or lessL1 L2 L3
cache missesipm/2.0.2
openmpi/1.7.*
and 1.8.*
ipm/2.0.5
openmpi/1.10.2
Usage
LD_PRELOAD
is used to dynamically load the IPM library (libipm.so
) as a wrapper to the MPI runtime. #!/bin/bash
#PBS -l ncpus=2
module load openmpi
module load ipm
mpirun prog.exe > output
openmpi
version 1.4.1 and above. export IPM_LOGDIR=/short/$PROJECT/ipm_logs
export IPM_LOGFILE=$PBS_JOBID.$USER.$PROJECT.`date +%s`
Profile Data
Graphical Parser and Viewer
ssh -X raijin
module load openmpi
module load ipm
ipm_view IPM_XML_file
ssh raijin
module load openmpi
module load ipm
module load ploticus
ipm_parse -html <IPM_XML_file>
a.out_1_your_username.1231369287.321103.0_ipm_${jobid}
scp -r a.out_1_your_username.1231369287.321103.0_ipm_${jobid} user@your_local_machine:path_to_store/
firefox path_to_store/a.out_1_your_username.1231369287.321103.0_ipm_${jobid}/index.html
Integration with Hardware Performance Counters (HPM)
module load ipm/0.983-cache
#!/bin/bash
#PBS -l ncpus=2
module load openmpi
module load ipm/0.983-cache
export IPM_HPM=PAPI_FP_OPS,PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_L2_TCA
mpirun prog.exe > output
Pre-defined PAPI Eventsets
IPM_HPM
can be selected from following predefined PAPI event sets for Raijin. PAPI_FP_OPS,PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_L2_TCA
PAPI_FP_OPS,PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_L2_TCA,PAPI_L2_TCM
PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_L2_STM,PAPI_L2_LDM,PAPI_L2_TCM,PAPI_L2_DCM
PAPI_TOT_CYC,PAPI_TOT_INS,PAPI_L3_TCM,PAPI_L3_LDM,PAPI_SR_INS,PAPI_LD_INS
MEM_LOAD_RETIRED:OTHER_CORE_L2_HIT_HITM,MEM_UNCORE_RETIRED:OTHER_CORE_L2_HITM
PAPI_FP_OPS,PAPI_FP_INS,PAPI_DP_OPS,PAPI_VEC_DP
PAPI_FP_OPS,PAPI_FP_INS,PAPI_SP_OPS,PAPI_VEC_SP
PAPI_FP_OPS,PAPI_RES_STL,PAPI_TOT_CYC,PAPI_TOT_INS
PAPI_FP_OPS,PAPI_TLB_DM,PAPI_TLB_IM
PAPI_L1_DCA,PAPI_L2_DCM
IPM_HPM
environment variable via using papi_event_chooser
command. module load papi
papi_event_chooser -help
NCI Customised IPM Settings
export IPM_ROUNDED=yes
export IPM_CHECKPOINT=yes
export IPM_CHKPT_INTERVAL=3600
$IPM_CHKPT_INTERVAL
seconds intervals. The checkpointed profile data will be stored in a.out_1_your_username.1231369287.321103.0_ipm_${jobid}.${rank}
files. cat
command) into a single XML file, and parsed with ipm_parse
.
mpiP
Using mpiP does not require code recompilation. Instead, A sample PBS job script using mpiP profiler with MPI executable ( Currently, mpiP is available for openmpi version 1.3.3 and above. The mpiP profiler generates a text based output file named something like: To visualise the mpiP profile data on Raijin, you need to login to Raijin with a X display, eg. using Usage
LD_PRELOAD
is used to dynamically load the mpiP library (libmpiP.so
) as a wrapper to the MPI runtime. prog.exe
) is as shown below:#!/bin/bash
#PBS -l ncpus=2
module load openmpi
module load mpiP
mpirun prog.exe > output
Profile Data
prog.exe.${np}.25972.1.mpiP
Graphical Viewer
ssh -X
or ssh -Y
, or with VNC. The detailed sample instruction on raijin is listed below.ssh -X raijin
module load openmpi
module load mpiP
mpirun prog.exe
mpipview prog.exe.${np}.XXXXX.1.mpiP
mpipview
is able to map MPI callsites in the profile data to source code. This requires the MPI program to be compiled with -g
option and linked with libunwind
, as follows.module load openmpi
module load mpiP
mpicc -g -o prog.exe prog.c -lmpiP -lm -lbfd -liberty -lunwind
mpirun prog.exe
Cooperation with General Profilers
Due to MPI profilers only profile for MPI function calls, it is not sufficient to reveal other details of the application. To get a better knowledge of users program, for example:
which portion of the user program spent the most time,
what is the memory behaviour of this program, including number of load/store instructions, cache misses, etc.,
how many bus transactions has been made in this program,
it is necessary to use a general purpose profiler.
MPI Tracing
An MPI tracer collects an event history. It is common to display such event history on a timeline display. Tracing data can provide much interesting detail, but data volumes are large and the overhead of collection may be non-trivial. Often the collection of traces has to be limited in both duration and number of cpus to be feasible. The use of MPI tracing is strongly encouraged during the development or tuning of parallel applications but should not be used in production runs.
MPI I/O Profiling
Darshan
Darshan is designed to capture an accurate picture of application I/O behaviour, including properties such as patterns of access within files, with minimum overhead. It's developed by Argonne National Laboratory and the latest version installed is 3.0.1 (May 2016). The latest version (version 3.0.1) has been built against different versions of Open MPI. As for example, version 3.0.1-ompi-1.10.2 has been built against Open MPI version 1.10.2. It is recommended to use the matching version of Open MPI and Darshan to have the best and accurate profile results. Current default version is 2.3.1. Darshan was originally developed on IBM Blue Gene/P series supercomputers deployed at Argonne Computing Facility and it is portable across a wide variety of systems. Load Open MPI version 1.10.2 and Darshan built against the same version of Open MPI and run Darshan using, To generate pdf summary report from logs If DARSHAN_LOG is set, it will be used. Otherwise it will go to /short/public/darshan_logs/[year]/[month]/[day]/Usage
module load openmpi/1.10.2
module load darshan/3.0.1-ompi-1.10.2
export DARSHAN_LOG=/logdir # (must be absolute path)
mpirun # your exe
darshan-job-summary.pl /logdir/***.darshan