This page includes optimization tips for everything from the application
layer to the message passing layer to the hardware in the case of
workstation/PC clusters. The application and message-passing optimization
information should apply to most parallel environments, while the hardware
discussions will be much more specific to a given type of system.
Tuning an application for maximum performance
The MP_Enter() and MP_Leave() functions provide an easy
way to measure the performance of both computationally and communication
intensive sections of a code. MP_Lite does as much as possible to
provide the highest level of performance to the interprocessor communications,
but it can only do as much as the application programmer allows it to.
MP_Lite achieves its high level of performance by streamlining the flow
of the message data, bypassing the usual buffering of the data when at all
possible. If the application programmer uses standard MP_Send() and MP_Recv()
functions, there is no guarantee that the MP_Recv() will be preposted before
the data arrives at the node, which means that it must be buffered,
on the source side for MP_Lite.
This adds an extra memory-to-memory copy, can delay the start of the
actual transfer of data, and leads to extra memory usage.
For large messages, it is usually best to streamline this process by
guaranteeing that an MP_ARecv() is preposted before the MP_Send() occurs,
so that the data will go directly into user space without being buffered.
This can be done by adding a handshaking stage where the destination
node posts its MP_ARecv() then sends a dummy message to the source node
signaling that it is ready to receive the data. The source node blocks on
this dummy message, then sends the large message. This ensures that no
buffering is done, but does double the latency, so it is only useful
for large messages.
MP_Lite has synchronous send and receive functions MP_SSend()
and MP_SRecv() that look exactly the same as their normal counterparts,
but do this handshaking automatically. Look at the code for each in mpcore.c
if you want to see exactly how they work. Since these are synchronous
functions, be careful to avoid lock up conditions. For example, if two
nodes both did MP_SSend() then matching MP_SRecv() functions, they would
both end up being blocked in the MP_SSend() waiting for the other node
to signal them to go ahead with the data transfer.
Other message-passing libraries automatically buffer everything on the
destination side (at least once), so preposting receives does not always
improve much if anything.
To get the maximum performance out of your system, you will need to use
better programming practices, and a library such as MP_Lite that allows
you to take full advantage of these techniques.
The MP_Lite receive functions can use a negative message tag to
indicate a wildcard that will match any tag. If message tags are used,
be careful to receive the messages on the destination node in the same
order that they were sent. Otherwise, these out-of-order messages will
cause the first message to be buffered on the destination side.
The receive functions can also use a negative source that to indicate
a wildcard that will match a message from any source. In this case, only the
number of bytes and the tag are used. While this is convenient at times, it
can be very costly. In the TCP module, this requires that the system search
all TCP buffers for an incoming message instead of just the active ones.
This programming method should be avoided if at all possible, and only
used where performance is not a factor.
Choosing a message-passing library
While MP_Lite offers better performance for some environments
and ease of use, it is not a complete MPI implementation.
If you are starting from scratch, the MP_Lite
syntax is somewhat simpler to use and you may want to look into it.
I suspect most people will stick with the MPI syntax since it is a standard
and will provide the most flexibility if some of the more advance MPI
functionality is needed.
Tuning the OS in workstation clusters
One of the most important lessons that we have learned repeatedly is
that you cannot just slap a cluster together and expect to get good performance.
You have to measure the performance at each stage, and in many cases tune
the system.
NetPIPE
is a great utility for measuring the point-to-point communication performance,
and has been invaluable in tuning the performance and tracking down problems
in various layers of the network.
MP_Lite increases the TCP buffers to their maximum size and uses them
to buffer incoming and outgoing messages. If these buffers are large enough,
no additional buffering, which would require memory copies, is needed.
It is therefore beneficial to increase the maximum TCP buffer size to
around 1 MB if possible.
For Linux, the following lines can be put into the
/etc/sysctl.conf file to increase the maximum TCP buffer sizes.
# Increase the maximum socket buffer sizes
net.core.rmem_max = 524288
net.core.wmem_max = 524288
In general, the faster the network your trying to build, the more
tuning you will need to do. Fast Ethernet cards and drivers are fairly
good, but most of the Gigabit Ethernet cards require some tuning.