MP_Lite:
Problems???
ssh/rsh
You must be able to ssh or rsh to all the machines you are trying to
run on without being prompted for a password. Simply try
ssh other_host ls or the same with rsh to see if you get a
list of the files on the home directory on the other machine.
If this does not work, you will need to talk with your system administrator
or read the manpages on ssh/rsh. You may just need to create a .rhosts
file on your home directory that includes a list of the machines that
you will be accessing. You might also try pinging the remote machine
as a sanity check.
If you are going to use rsh instead of ssh, edit the first few
lines in the bin/mprun script to choose rsh. In some cases, we have
needed to turn off the X11 and authentication forwarding by using
ssh -x -a. Do this if you get .Xauthority errors.
Compilation errors
MP_Lite has been tested on current versions of many flavors of Unix.
In order to run on this variety of operating systems, it has to stick
pretty close to ANSI standards. If have problems compiling MP_Lite on a
system or OS-level not tested, please send me (turner@ameslab.gov) the entire
output from the compiler as well as the configuration of the machine
(type, OS and level, etc.).
Run-time errors
The first test you should do is to compile MP_Lite and the pong program
and run it on two nodes. For two Unix machines, type make tcp for
the most robust version then make pong to compile and link pong.c.
Use mprun -np 2 -h host1 host2 pong & to run the code on
host1 and host2. You may also wish to run in SMP mode with
two processes on one machine using mprun -np 2 -smp host1 pong &.
If pong freezes up for any reason, try using the mpstat command
or looking in the .nodeX files as described in the
debugging section to determine what the
problem is. If pong runs to completion, it will provide you with a
quick and dirty measure of the communication performance between the
two machines.
Bail times
The mplite.h file has a parameter called BAIL_TIME that
is set to 5 minutes by default. If a node is waiting more than that amount
of time, it will do a graceful exit and notify the other nodes so that they
can abort too. If you have a code that waits longer than this for messages
to arrive, simply edit the mplite.h file to change the BAIL_TIME then
do a make clean and recompile MP_Lite.
Maximum number of nodes
The maximum number of nodes is initially set to 1024 in the mplite.h
file. There is nothing wrong with changing this, but if you are trying
to run MP_Lite on more than 1024 workstations you would need to consider
other limitations such as the fact that large TCP buffers are allocated
for each other node. MP_Lite is not designed or tested for huge systems,
so be aware of its limitations if you are pushing the envelope.
Maximum number of active messages
Each module has a different maximum number of active messages. This
level is set to keep the memory usage down, and can easily be increased
if needed. If you receive a warning message in the .nodeX log files,
simply edit that module, recompile the MP_Lite library, and relink it to
your code.
All modules have the maximum number of active messages set to 10,000 by
default.
The shmem.c module has circular queue sizes set to 100 that would limit
the number of pending messages to or from a given node to 100.
The experimental shm.c module has maximum number of SMP process
set to 1000. The maximum number of segments is also set to 1000, which would
limit the number of active messages to the same.