MP_Lite: The SHMEM module

SHMEM is the native communication library on the Cray T3E, and is also available on SGI multiprocessor machines. On the Cray T3E, one-sided SHMEM calls can achieve transfer rates of up to 340 MB/sec (2720 Mbps) with a latency of only 2-3 µs. A shmem_put() call allows a node to write data directly into user space on another node, and a shmem_get() function allows it to get data from another node. Both occur without the cooperation of the second node.

The Cray-optimized MPI implementation is written on top of SHMEM. It originally only delivered a maximum of 160 MB/sec with a latency around 20 µs, which provided the motivation for writing this MP_Lite module. The MP_Lite module shmem.c is also written using the same one-sided SHMEM calls, but can deliver around 320 MB/sec with a 12 µs latency. This is achieved by avoiding extra buffering when possible, buffering the data on the source node only if needed to avoid lock-up conditions. The current version of the Cray-optimized MPI library provides a throughput of 300 MB/sec with a 9 µs latency, so there is not much reason to use the MP_Lite module at this point.

Send and Receive queues

Each node has a set of circular send message queues smsg[source].q[QSIZE] where other nodes can post send messages by writing 24 byte headers using shmem_put(). Each header contains a pointer to the data on the source node, the message tag, and the number of bytes. A receive will try to match the tag and number of bytes, then pull the data from the source node using a shmem_get() with the pointer. When a header gets matched and the message pulled from the other node, the header will be deleted and the circular queue is condensed forward to fill the gap. The integer smsg[source].i is used locally on each node to point to the beginning of the active region of the circular queue, and the integer smsg[destination].p is used to keep track of the next element to post to on the remote circular queue.

Each node likewise has a set of circular receive message queues rmsg[destination].q[QSIZE]. Other nodes can prepost receive headers using shmem_put(). A send will first check these preposted receive headers, and if the tag and number of bytes are matched the source node will shmem_put() the data to the destination node. The header is then deleted and the queue is condensed forward.

Passing the messages

When an MP_ASend() is initiated, the source node checks the receive message queue to see if a matching receive has been preposted. If so, the data is sent using shmem_put() to the destination using the pointer from the posted receive header. The send_done signal is posted to the send message queue on the destination node using shmem_put() to signal that the send has been completed. If no receive was preposted, MP_ASend() will do nothing and let MP_Wait() handle it.

If MP_ASend() failed to complete the send, MP_Wait() will check again for a preposted receive, and complete the send as above if one is found. If there still is no matching receive posted, MP_Wait() must malloc() a send buffer, copy the data in, and post the header with a pointer to the send buffer to the send message queue on the destination node. The source node will also store the same header locally in a send buffer log slog[destination].q[QSIZE] that will be used to clean up the buffer after getting a recv_done signal from the destination node. This extra buffering can reduce the transfer rate, but is necessary to avoid a lock-up condition when the application programmer has not ensured that the receive was preposted before the send was initiated.

A blocking MP_Send() simply calls MP_Wait(). As described above, if a matching receive is found it will shmem_put() the data to the destination node, or it will push the data into a send buffer and post a header to the send message queue on the destination node.

When an MP_ARecv() is initiated, the node checks the send message queue for the source node for a matching send, indicating that it is too late to prepost the receive since the source node has already buffered the message data. If a match is found the node uses shmem_get() to pull the data from the send buffer on the source node, then posts a recv_done signal to the receive message queue on the source node to signal that the send buffer can be freed.

If no match is found, the header is preposted to the receive message queue on the source node so that it can handle the data transfer when a matching send header is found. The MP_Wait() for the MP_ARecv() will block in a busy-wait loop until the send_done signal is received, indicating that the source node has completed the transfer.

Unfortunately, it is possible for the destination node to prepost a receive after the source node checks the receive message queue but before the source node buffers the message and posts the buffered send to the destination node. This possibility of preposting the message in the 'gap' necessitates some extra care, and some of the additional handshaking described above. When a receive blocks waiting for the source node to finish the send, it must also check for a matching buffered send to arrive. This indicates that the 'gap' condition has occurred, and the source node has already buffered the data. In this case, the destination node takes control and uses shmem_get() to pull the data down. Then the recv_done signal is sent to the source node, indicating that the send buffer can be freed, but it is sent with a negative number of bytes to indicate that the source node must also unpost the matching preposted receive to clean things up completely.

It gets even more complicated though. Each preposted receive must be treated with suspicion since it may have come in the gap, and therefore may already match a buffered send. Therefore, a preposted receive cannot be used unless there is no matching buffered send that has not been cleared. If this occurs, the preposted receive must be ignored since it cannot be trusted yet, and MP_Wait() may have to do a buffered send instead. This maintains the integrity of the system, but may occasionally result in more messages being buffered.

It would be much easier to write a module that just had the destination node handle all the transfers. However, this would be less efficient than letting both sides transfer data, since this initiates the transfer at the earliest possible time. Allowing both sides to handle the transfer requires some handshaking, which always allows for a gap of some sort where the handshaking communications pass each other in transit. While this approach may sound very complicated, some similar mechanism is needed for any such algorithm to ensure the integrity of the message-passing system. This current approach minimizes the handshaking for the most common case where the receive gets preposted in order to provide the lowest latency. For messages that get pushed to a send buffer, efficiency has already been lost due to the memory-to-memory copy, so if extra handshaking is needed because of a prepost in the gap, there is not much effect on the resulting performance.

Receives posted with an unknown source (-1) will simply cycle through all the send message queues until a matching buffered send is posted. This is clearly not real efficient since it guarantees that the message will be buffered on the source node. More efficient methods would require much more programming than I'm will to do. My basic philosophy is that all receives should specify the source node unless they are in an area where execution time is not important.

Small messages

For small messages of 8 bytes or less, the send replaces the pointer in the header it posts to the send message queue on the destination node with the actual data. When a receive on the destination node matches this header, it can then just copy the data straight from the pointer location. The latency for these small messages is therefore much smaller since only one communication of 24 bytes is needed.

Performance

As discussed above, this module is optimized for the case where the receive is preposted before the send is encountered. If this does not occur, and the message needs to be buffered on the source node, some performance is lost even though the memory copy rate is very high on the Cray T3E. Receives posted with an unknown source always wait for a buffered send to be posted, and therefore should be avoided in time-critical areas. Statistics on the number of messages that go through the send buffers are reported at the end of each run in the .nodeX log files. Also reported is the number of preposted receives that had to be ignored because of a matching buffered send that had not cleared. I have yet to see a single instance of this in a real code. It needs to be there to protect the integrity of the system, but obviously is not being encountered in real situations.

Small messages of 8 bytes or less are passed with the header, and can take as little as 12 µs. Messages slightly larger require more handshaking and jump to a 19 µs latency.

This module works very well on the Cray T3E, but has not been tested on the SGI multiprocessor machines yet.