View Source

Contents

Determining exactly how well an MPI application is performing on current HPC systems is a challenging task. Analysis of the cputime, system time and IO time of a serial application can provide basic performance information but for a parallel application, the (wasted) time spent waiting on communication is not visible from “outside the application”. MPI performance analysis tools provide insight into this “internal” computation versus communication behaviour and, as a result, understanding of the application’s parallel performance. They can reveal potential issues such as load imbalance, synchronization contentions and much more. As well as pointing out the limitations of an MPI application, access to this profiling information can assist user in optimizing the application to achieve greater scalability.

MPI performance analysis is normally performed at two levels. The first level is called MPI summary profiling or simply MPI profiling, which aggregates statistics at run time and provides performance overview of the whole job execution. The second level, called MPI tracing, collects the MPI event history of an application execution and provides fine grained information for each MPI function call (every message passed) along the execution timeline.

This document describes how to use MPI performance analysis tools including profilers and tracers which are available on NCI compute systems.

A MPI profiler aggregates “whole run” statistics at run time, e.g. total amount of time spent in MPI, total number of messages or bytes sent, etc. As this information is available on a per-rank basis, issues such as load imbalance are exposed.

Typically the overhead of collecting this summary profiling data is very low (~1%) and the volume of profiling data collected is also very low. During runtime, information collection is local to each process and simply involves updating counters each time an MPI call is made. The profiling library only invokes communication during report generation, typically at the end of the run, to merge results from all of the tasks into one output file. As a result, it is feasible to include the use of an MPI profiler in all production runs.

Note that (currently) no profiling information will be produced if the execution does not complete normally (i.e. does not call MPI_Finalize()).

On NCI NF compute systems, two different lightweight MPI profilers are installed. They are IPM and mpiP. Both of these tools require minimal actions to invoke – we recommend that you use them regularly. Note that their use is only applicable to Open MPI applications.

Cooperation with General Profilers

Due to MPI profilers only profile for MPI function calls, it is not sufficient to reveal other details of the application. To get a better knowledge of users program, for example:

which portion of the user program spent the most time,
what is the memory behaviour of this program, including number of load/store instructions, cache misses, etc.,
how many bus transactions has been made in this program,

it is necessary to use a general purpose profiler.

An MPI tracer collects an event history. It is common to display such event history on a timeline display. Tracing data can provide much interesting detail, but data volumes are large and the overhead of collection may be non-trivial. Often the collection of traces has to be limited in both duration and number of cpus to be feasible. The use of MPI tracing is strongly encouraged during the development or tuning of parallel applications but should not be used in production runs.