Debugging for Parallel MPI programs is much different from transitional memory debug for serial programs. This is because the non-deterministic nature of all parallel programs. Therefore, it is essential to obtain knowledge of both message queues for debugging MPI programs and stack trace for memory examination.
On NCI compute systems, there are two debug/inspection tools installed, Padb and Totalview. This document describes how to use these parallel program inspection/debug tools. For further help with using performance profilers and tracers, please send email to email@example.com.
Padb (Parallel Application Debugger) is a Job Inspection Tool for examining and debugging parallel programs, primarily it simplifies the process of gathering stack traces on compute clusters however it also supports a wide range of other functions. Padb supports a number of parallel environments and it works out-of-the-box on the majority of clusters. It’s an open source, non-interactive, command line, script-able tool intended for use by programmers and system administrators alike.
Current latest version is padb/3.3. Type the following to load it:
Show current active jobs under PBS:
Target a specific jobid, and reports is process state:
Target a specific jobid, and report its MPI message queue, stack traceback, etc.
Target a specific jobid, and report its stack trace for a given MPI process (rank):
Target a specific jobid, and report its stack trace including information about parameters and local variables for a given MPI process (rank):
MPI Message Queue
Target a specific jobid, and reports its MPI message queues:
Process Progress Watch
Target a specific jobid, and report its MPI process progress over a period of time:
For more detailed usage please refer to PADB’s “Mode of operation” web page, http://padb.pittman.org.uk/modes.html, or PADB’s help information:
Totalview can be used to debug parallel MPI or OpenMP programs. Introductory information and userguides on using Totalview are available from this site.
First load module to use Totalview:
Compile code with the
-g option. For example, for an MPI program:
Start Totalview. For example, to debug an MPI program using 4 MPI processes:
Note that to ensure that Totalview can obtain information on all variables compile with no optimisation. This is the default if
-g is used with no specific optimisation level.
Totalview shows source code for mpirun when it first starts an MPI job. A GUI like the following is generated. Click on GO and all the processes will start up and you will be asked if you want to stop the parallel job. At this point click YES if you want to insert breakpoints. The source code will be shown and you can click on any lines where you wish to stop.
If your source code is in a different directory from where you fired up Totalview you may need to add the path to Search Path under the File Menu. Right clicking on any subroutine name will “dive” into the source code for that routine and break points can be set there.
The procedure for viewing variables in an MPI job is a little complicated. If your code has stopped at a breakpoint right click on the variable or array of interest in the stack frame window. If the variable cannot be displayed then choose “Add to expression list” and the variable will appear listed in a new window. If it is marked “Invalid compilation scope” in this new window right click again on the variable name in this window and chose “Compilation scope”. Change this to “Floating” and the value of the variable or array should appear. Right clicking on it again and chosing “Dive” will give you values for arrays. In this window you can chose “Laminate” then “Process” under the View menu to see the values on different processors.
Under the Tools option on the top toolbar of the window displaying the variable values you can choose Visualize to display the data graphically which can be useful for large arrays.
It is also possible to use Totalview for memory debugging, showing graphical representations of outstanding messages in MPI or setting action points based on evaluation of a user defined expression. See the Totalview User Guide for more information