Page tree
Skip to end of metadata
Go to start of metadata

Contents

Debugging for Parallel MPI programs is much different from transitional memory debug for serial programs. This is because the non-deterministic nature of all parallel programs. Therefore, it is essential to obtain knowledge of both message queues for debugging MPI programs and stack trace for memory examination.   

On NCI compute systems, the following two debug/inspection tools are installed:

padb

Padb (Parallel Application Debugger) is a Job Inspection Tool for examining and debugging parallel programs, primarily it simplifies the process of gathering stack traces on compute clusters however it also supports a wide range of other functions. Padb supports a number of parallel environments and it works out-of-the-box on the majority of clusters. It’s an open source, non-interactive, command line, script-able tool intended for use by programmers and system administrators alike.   

Current latest version is padb/3.3. Type the following to load it:

module load padb/3.3

Usage

Common Usage

Show current active jobs under PBS:

padb --show-jobs

Target a specific jobid, and reports is process state:

padb <jobid> --proc-summary

Target a specific jobid, and report its MPI message queue, stack traceback, etc.

padb --full-report=<jobid>

Stack Trace

Target a specific jobid, and report its stack trace for a given MPI process (rank):

padb <jobid> --stack-trace --tree --rank <MPI rank id>

Target a specific jobid, and report its stack trace including information about parameters and local variables for a given MPI process (rank):

padb <jobid> --stack-trace --tree --rank <MPI rank id> -O stack-shows-locals=1 -O stack-shows-params=1

MPI Message Queue

Target a specific jobid, and reports its MPI message queues:

padb <jobid> --mpi-queue

Process Progress Watch

Target a specific jobid, and report its MPI process progress over a period of time:

padb <jobid> --mpi-watch --watch -O watch-clears-screen=no

For more detailed usage please refer to PADB’s “Mode of operation” web page, http://padb.pittman.org.uk/modes.html, or PADB’s help information:

padb -h

 

 

Totalview

 

Totalview is an all purpose debugger, particularly suitable for MPI, OpenMP and threads code. Totalview is a product of Roguewave Software.

Usage

To set up your paths to use Totalview load the module

module load totalview/8.7.0-3

For more details on modules see our modules help guide.

Totalview can either be run interactively or under the batch system. Please remember to add the PBS flag -l software=totalview to your job submission script.

Single Node Jobs

To debug serial or OpenMP processes, compile as normal but with the -g option added and give your executable name as an argument to totalview. If your executable takes arguments, put them after a -a:

cc -g -o job.exe job.c 
totalview ./job.exe -a {arguments for job.exe}

Alternatively you can set arguments in the Startup Parmameter box that appears first. Set break points in the source by clicking on the line numbers. Press G over the totalview source window to start running or set through the code using S or NYou can also use the arrows at the top of the window.

MPI Jobs

To use totalview on VU consider the following example 4 processor mpi program:

mpicc -g -o mpiprog mpiprog.c
mpirun --debug -n 4 mpiprog -a {arguments for mpiprog}

Press G over the totalview mpirun window or click the green Go arrow. This starts up mpirun which initiates the MPI processes. Totalview then acquires these and asks if you would like to stop them immediately at startup. Answer yes so that you can set breakpoints.

You should now see a totalview debugging window with the source code displayed. If you are not seeing the code then you should add the correct paths to File->Search Path in the top toolbar.  

Set breakpoints by clicking on the line number. 

From here on, you can use the coloured buttons to step through the code or run it. If you wish to look at individual threads then choose the options under Thread in the top toolbar. You can also use capital letter options (menu under Process) to control all threads and processes simultaneously, such as S or N. Lower case options only control the "current" thread/process. At the end of the debugging of a single job execution the debugger ends up in an assembler view of mpirun. Just type G again to get back to the main debug window.

More details on how to use Totalview, in particular on how to view variables, can be found in the NCI NF userguide and from the ETNUS Totalview web site.

MPI Jobs on more than 16 cpus

To use Totalview on more than 16 cpus you need to replace the default use of rsh to start up Totalview proesses on different nodes. To do this you start Totalview as for smaller MPI jobs then choose File ->Preferences and choose Launch StringsThen in Enable single debug server launch change %C to /opt/pbs/default/bin/pbs_tmrsh and remove both the double quotes. Click okay then continue as above.

Memory Debugging

If you want to use Totalview for memory debugging then you need to relink your code with the following options:-L$TVLIB -ltvheap_64 -Wl,-rpath,$TVLIB.

If a program is crashing and you want to use the debugger for traceback, ensure that you compile with the compile flag -g and unset your shell corefile limit. This should be done in the batch script before the program executable.

csh syntax: limit coredumpsize unlimited
bash syntax: ulimit -c unlimited

Then run totalview using the coredump file.

Compiling with -g but higher optimization levels than -O2 may hide some variable information from TotalView and it may not be able to display current values.

This version of Totalview is licensed with 1024 tokens so can run jobs with up to approximately 1020 processors.  We are licensed for the Replay Engine which does reverse debugging.