MP_Lite: Tracing the communications

MP_Lite has some built in trace facilities that can dump the start and finish times of each communication to a .trace.X file for each node X. Below is an example of a trace dump from node 0, which is communicating with node 1.

   3.363730 -  3.363903  nd 0  -->  nd 1          4 bytes / 173 usec
   3.363913 -  3.363984  nd 0  <--  nd 1          4 bytes / 71 usec
   3.363994 -  3.367812  nd 0  -->  nd 1     100000 bytes / 3818 usec
   3.367823 -  3.368922  nd 0  <--  nd 1     100000 bytes / 1098 usec
   3.368934 -  3.370115  nd 0  -->  nd 1      22304 bytes / 1181 usec
   3.370127 -  4.747949  nd 0  <--  nd 1      22304 bytes / 1377822 usec

The right arrow --> indicates a send from node 0 to node 1, while the left arrow <-- indicates a receive. The start and end times of the communication are shown, as well as the total time and the number of bytes transferred. In the case above, the 4 byte exchanges are to manually do some handshaking to guarantee a preposted receive before the larger exchanges of data. You can see that one receive took 1.378 seconds, which resulted from a packet being dropped that had to be retransmitted after the time-out period. This trace file helped identify that packets were being dropped by TCP in AIX which eventually lead to a fix.

To use the trace facilities, simply edit the makefile to change the all: tcp line if needed, then type make trace and link the library into your code. Run the code as you normally would, and at completion there should be .trace.X files for each node X. The times may not match up exactly between each node, but the clocks are started at roughly the same time in the MP_Init() function.