One of the frustrating things about most message-passing libraries is
their lack of feedback when something goes wrong. Error messages, when
returned, are very cryptic if useful at all.
MP_Lite has many ways of helping the user to debug a problem, whether
it is in the application code or a problem with the hardware/software
environment itself. I consider the user-friendly nature of MP_Lite to be
at least as important as the improvements to the communication performance.
This can go a long way toward cutting down the debugging time during code
The mpstat command will print out the status file for a run
that will provide a single line description of the current state of each
process. Handshaking information will be shown during a TCP startup which
may help to diagnose a problem if one machine is unreachable and needs to
be rebooted. If a node is waiting past the panic time of 1 minute for a
message, then a stalled warning will be posted for that node.
If a node is waiting past the bail time of 5 minutes for a message to
arrive, then a bailed error will be posted and an abort signal
logged that other nodes can read next time they check the status file.
In this way, a stalled message on one node may result in a graceful
abort of the run, with the error logged in the status file and the
.nodeX log files along with a dump of the pending messages in each queue
on each node.
Each process generates a .nodeX log file in the current working directory
where X is the relative node number (0 .. nprocs-1).
Handshaking information is written to
these log files as the processes start up, warning and error messages
are logged during the run, and some statistical information may be
logged at the end of a run.
While the mpstat command provides an overview of the status of
all the nodes, the individual log files provide more detailed information
about each process. As mentioned above, if a run is aborted due to a
message time-out, or a control-C or kill command is issued, the current
state of the incoming and outgoing message queues will be dumped to these
log files if possible. This can be very helpful in determining the state
of the message-passing system that caused the lockup.
Some statistical information is printed to the log files at the end
of runs on some systems. This can be useful in optimizing code by
providing a clue to whether data is being buffered, for instance. If
this is the case, then preposting receives may prevent buffering of data
and increase the communication rate, as well as decreasing the memory
As explained in the timing section,
MP_Lite has two functions that are useful for timing a section of
code, which may be an entire subroutine or function.
The MP_Enter() function is put at the beginning of the section of
interest, and the matching MP_Leave() function goes at the end.
A common use for these is to put them at the beginning and end of all
major subroutines and functions in a code in order to get a breakdown
of the time spent in each section.
These statements can also be compiled to print a statement to the
screen each time they are encountered in a run.
To do this, you may need to edit
the all: tcp line of the makefile to choose the appropriate system type
if other than tcp. Then simply type make debug1 to compile in
the debugging statements.
When the code is run, an MP_Sync() is done at the start of each
MP_Enter('text') function to provide a barrier synchronization, then node 0
prints the statement -->ENTER text to the screen. The
matching MP_Leave(0.0) function will print -->LEAVE text
to the screen. This provides a very quick method for determining the
section of code where the problem is occuring, which should at least
help narrow down the problem.
Advanced debugging of the message-passing system
You should only need to use this if you are doing development work
with MP_Lite to expand its functionality. There are built in debugging
statements in most of the modules that can be turned on by modifying the
all: tcp line of the makefile if necessary and then using
make debug2 or make debug3 to compile the MP_Lite library.
The debug2 option prints information to the .nodeX log files, while the debug3
option just prints out more detailed information.
A lot of information is printed out, so the log files can become
large very quickly. Deciphering the data is clearly not for the faint of
heart. One method that can help is to turn the debugging on only for
the section of code where a bug is suspected. This can be done using the
MP_Set() function as illustrated below. MP_Set() with a
variable of debug and a value of 0 to 3 will change the debug
level during a run, and therefore control the amount of information that
is printed to the log files or the screen.
Fortran debugging functions
CALL MP_Enter('Function name')
CALL MP_Set('debug', level)
C debugging functions
MP_Set("debug", level) /* level = 0 .. 3 */