MP_Lite: Optimizing a system

This page includes optimization tips for everything from the application layer to the message passing layer to the hardware in the case of workstation/PC clusters. The application and message-passing optimization information should apply to most parallel environments, while the hardware discussions will be much more specific to a given type of system.

Tuning an application for maximum performance

The MP_Enter() and MP_Leave() functions provide an easy way to measure the performance of both computationally and communication intensive sections of a code. MP_Lite does as much as possible to provide the highest level of performance to the interprocessor communications, but it can only do as much as the application programmer allows it to.

MP_Lite achieves its high level of performance by streamlining the flow of the message data, bypassing the usual buffering of the data when at all possible. If the application programmer uses standard MP_Send() and MP_Recv() functions, there is no guarantee that the MP_Recv() will be preposted before the data arrives at the node, which means that it must be buffered, on the source side for MP_Lite. This adds an extra memory-to-memory copy, can delay the start of the actual transfer of data, and leads to extra memory usage.

For large messages, it is usually best to streamline this process by guaranteeing that an MP_ARecv() is preposted before the MP_Send() occurs, so that the data will go directly into user space without being buffered. This can be done by adding a handshaking stage where the destination node posts its MP_ARecv() then sends a dummy message to the source node signaling that it is ready to receive the data. The source node blocks on this dummy message, then sends the large message. This ensures that no buffering is done, but does double the latency, so it is only useful for large messages.

MP_Lite has synchronous send and receive functions MP_SSend() and MP_SRecv() that look exactly the same as their normal counterparts, but do this handshaking automatically. Look at the code for each in mpcore.c if you want to see exactly how they work. Since these are synchronous functions, be careful to avoid lock up conditions. For example, if two nodes both did MP_SSend() then matching MP_SRecv() functions, they would both end up being blocked in the MP_SSend() waiting for the other node to signal them to go ahead with the data transfer.

Other message-passing libraries automatically buffer everything on the destination side (at least once), so preposting receives does not always improve much if anything. To get the maximum performance out of your system, you will need to use better programming practices, and a library such as MP_Lite that allows you to take full advantage of these techniques.

The MP_Lite receive functions can use a negative message tag to indicate a wildcard that will match any tag. If message tags are used, be careful to receive the messages on the destination node in the same order that they were sent. Otherwise, these out-of-order messages will cause the first message to be buffered on the destination side.

The receive functions can also use a negative source that to indicate a wildcard that will match a message from any source. In this case, only the number of bytes and the tag are used. While this is convenient at times, it can be very costly. In the TCP module, this requires that the system search all TCP buffers for an incoming message instead of just the active ones. This programming method should be avoided if at all possible, and only used where performance is not a factor.

Choosing a message-passing library

While MP_Lite offers better performance for some environments and ease of use, it is not a complete MPI implementation. If you are starting from scratch, the MP_Lite syntax is somewhat simpler to use and you may want to look into it. I suspect most people will stick with the MPI syntax since it is a standard and will provide the most flexibility if some of the more advance MPI functionality is needed.

Tuning the OS in workstation clusters

One of the most important lessons that we have learned repeatedly is that you cannot just slap a cluster together and expect to get good performance. You have to measure the performance at each stage, and in many cases tune the system. NetPIPE is a great utility for measuring the point-to-point communication performance, and has been invaluable in tuning the performance and tracking down problems in various layers of the network.

MP_Lite increases the TCP buffers to their maximum size and uses them to buffer incoming and outgoing messages. If these buffers are large enough, no additional buffering, which would require memory copies, is needed. It is therefore beneficial to increase the maximum TCP buffer size to around 1 MB if possible. For Linux, the following lines can be put into the /etc/sysctl.conf file to increase the maximum TCP buffer sizes.

  # Increase the maximum socket buffer sizes
  net.core.rmem_max = 524288
  net.core.wmem_max = 524288

In general, the faster the network your trying to build, the more tuning you will need to do. Fast Ethernet cards and drivers are fairly good, but most of the Gigabit Ethernet cards require some tuning.