MP_Lite: Debugging help

One of the frustrating things about most message-passing libraries is their lack of feedback when something goes wrong. Error messages, when returned, are very cryptic if useful at all.

MP_Lite has many ways of helping the user to debug a problem, whether it is in the application code or a problem with the hardware/software environment itself. I consider the user-friendly nature of MP_Lite to be at least as important as the improvements to the communication performance. This can go a long way toward cutting down the debugging time during code development.

Status file

The mpstat command will print out the status file for a run that will provide a single line description of the current state of each process. Handshaking information will be shown during a TCP startup which may help to diagnose a problem if one machine is unreachable and needs to be rebooted. If a node is waiting past the panic time of 1 minute for a message, then a stalled warning will be posted for that node. If a node is waiting past the bail time of 5 minutes for a message to arrive, then a bailed error will be posted and an abort signal logged that other nodes can read next time they check the status file. In this way, a stalled message on one node may result in a graceful abort of the run, with the error logged in the status file and the .nodeX log files along with a dump of the pending messages in each queue on each node.

Log files

Each process generates a .nodeX log file in the current working directory where X is the relative node number (0 .. nprocs-1). Handshaking information is written to these log files as the processes start up, warning and error messages are logged during the run, and some statistical information may be logged at the end of a run.

While the mpstat command provides an overview of the status of all the nodes, the individual log files provide more detailed information about each process. As mentioned above, if a run is aborted due to a message time-out, or a control-C or kill command is issued, the current state of the incoming and outgoing message queues will be dumped to these log files if possible. This can be very helpful in determining the state of the message-passing system that caused the lockup.

Some statistical information is printed to the log files at the end of runs on some systems. This can be useful in optimizing code by providing a clue to whether data is being buffered, for instance. If this is the case, then preposting receives may prevent buffering of data and increase the communication rate, as well as decreasing the memory usage.

Simple debugging

As explained in the timing section, MP_Lite has two functions that are useful for timing a section of code, which may be an entire subroutine or function. The MP_Enter() function is put at the beginning of the section of interest, and the matching MP_Leave() function goes at the end. A common use for these is to put them at the beginning and end of all major subroutines and functions in a code in order to get a breakdown of the time spent in each section.

These statements can also be compiled to print a statement to the screen each time they are encountered in a run. To do this, you may need to edit the all: tcp line of the makefile to choose the appropriate system type if other than tcp. Then simply type make debug1 to compile in the debugging statements.

When the code is run, an MP_Sync() is done at the start of each MP_Enter('text') function to provide a barrier synchronization, then node 0 prints the statement -->ENTER text to the screen. The matching MP_Leave(0.0) function will print -->LEAVE text to the screen. This provides a very quick method for determining the section of code where the problem is occuring, which should at least help narrow down the problem.

Advanced debugging of the message-passing system

You should only need to use this if you are doing development work with MP_Lite to expand its functionality. There are built in debugging statements in most of the modules that can be turned on by modifying the all: tcp line of the makefile if necessary and then using make debug2 or make debug3 to compile the MP_Lite library. The debug2 option prints information to the .nodeX log files, while the debug3 option just prints out more detailed information.

A lot of information is printed out, so the log files can become large very quickly. Deciphering the data is clearly not for the faint of heart. One method that can help is to turn the debugging on only for the section of code where a bug is suspected. This can be done using the MP_Set() function as illustrated below. MP_Set() with a variable of debug and a value of 0 to 3 will change the debug level during a run, and therefore control the amount of information that is printed to the log files or the screen.

Fortran debugging functions

CALL MP_Enter('Function name')
CALL MP_Leave(0.0d0)

CALL MP_Set('debug', level)
C debugging functions

MP_Enter("Function name");

MP_Set("debug", level) /* level = 0 .. 3 */