MP_Lite: Problems???

ssh/rsh

You must be able to ssh or rsh to all the machines you are trying to run on without being prompted for a password. Simply try ssh other_host ls or the same with rsh to see if you get a list of the files on the home directory on the other machine. If this does not work, you will need to talk with your system administrator or read the manpages on ssh/rsh. You may just need to create a .rhosts file on your home directory that includes a list of the machines that you will be accessing. You might also try pinging the remote machine as a sanity check.

If you are going to use rsh instead of ssh, edit the first few lines in the bin/mprun script to choose rsh. In some cases, we have needed to turn off the X11 and authentication forwarding by using ssh -x -a. Do this if you get .Xauthority errors.

Compilation errors

MP_Lite has been tested on current versions of many flavors of Unix. In order to run on this variety of operating systems, it has to stick pretty close to ANSI standards. If have problems compiling MP_Lite on a system or OS-level not tested, please send me (turner@ameslab.gov) the entire output from the compiler as well as the configuration of the machine (type, OS and level, etc.).

Run-time errors

The first test you should do is to compile MP_Lite and the pong program and run it on two nodes. For two Unix machines, type make tcp for the most robust version then make pong to compile and link pong.c. Use mprun -np 2 -h host1 host2 pong & to run the code on host1 and host2. You may also wish to run in SMP mode with two processes on one machine using mprun -np 2 -smp host1 pong &.

If pong freezes up for any reason, try using the mpstat command or looking in the .nodeX files as described in the debugging section to determine what the problem is. If pong runs to completion, it will provide you with a quick and dirty measure of the communication performance between the two machines.

Bail times

The mplite.h file has a parameter called BAIL_TIME that is set to 5 minutes by default. If a node is waiting more than that amount of time, it will do a graceful exit and notify the other nodes so that they can abort too. If you have a code that waits longer than this for messages to arrive, simply edit the mplite.h file to change the BAIL_TIME then do a make clean and recompile MP_Lite.

Maximum number of nodes

The maximum number of nodes is initially set to 1024 in the mplite.h file. There is nothing wrong with changing this, but if you are trying to run MP_Lite on more than 1024 workstations you would need to consider other limitations such as the fact that large TCP buffers are allocated for each other node. MP_Lite is not designed or tested for huge systems, so be aware of its limitations if you are pushing the envelope.

Maximum number of active messages

Each module has a different maximum number of active messages. This level is set to keep the memory usage down, and can easily be increased if needed. If you receive a warning message in the .nodeX log files, simply edit that module, recompile the MP_Lite library, and relink it to your code.

All modules have the maximum number of active messages set to 10,000 by default. The shmem.c module has circular queue sizes set to 100 that would limit the number of pending messages to or from a given node to 100. The experimental shm.c module has maximum number of SMP process set to 1000. The maximum number of segments is also set to 1000, which would limit the number of active messages to the same.