MP_Lite:
The SHMEM module
SHMEM is the native communication library on the Cray T3E, and is
also available on SGI multiprocessor machines. On the Cray T3E, one-sided
SHMEM calls can achieve transfer rates of up to 340 MB/sec (2720 Mbps) with
a latency of only 2-3 µs. A shmem_put() call allows a node to write
data directly into user space on another node, and a shmem_get() function
allows it to get data from another node. Both occur without the cooperation
of the second node.
The Cray-optimized MPI implementation is written on top of SHMEM. It
originally only delivered a maximum of 160 MB/sec with a latency around 20 µs,
which provided the motivation for writing this MP_Lite module.
The MP_Lite module shmem.c is also written using the same one-sided SHMEM calls,
but can deliver around 320 MB/sec with a 12 µs latency. This is achieved
by avoiding extra buffering when possible, buffering the data on the
source node only if needed to avoid lock-up conditions.
The current version of the Cray-optimized MPI library provides a throughput
of 300 MB/sec with a 9 µs latency, so there is not much reason to
use the MP_Lite module at this point.
Send and Receive queues
Each node has a set of circular send message queues smsg[source].q[QSIZE]
where other nodes can post send messages by writing 24 byte headers
using shmem_put(). Each header contains a pointer to the data on the
source node, the message tag, and the number of bytes. A receive will
try to match the tag and number of bytes, then pull the data from the
source node using a shmem_get() with the pointer. When a header gets
matched and the message pulled from the other node, the header will
be deleted and the circular queue is condensed forward to fill the gap.
The integer smsg[source].i is used locally on each node to point to the
beginning of the active region of the circular queue, and the integer
smsg[destination].p is used to keep track of the next element to post to
on the remote circular queue.
Each node likewise has a set of circular receive message queues
rmsg[destination].q[QSIZE]. Other nodes can prepost receive headers
using shmem_put(). A send will first check these preposted receive
headers, and if the tag and number of bytes are matched the source node
will shmem_put() the data to the destination node. The header is then
deleted and the queue is condensed forward.
Passing the messages
When an MP_ASend() is initiated, the source node checks the receive
message queue to see if a matching receive has been preposted. If so, the data
is sent using shmem_put() to the destination using the pointer from the
posted receive header. The send_done signal is posted to the send
message queue on the destination node using shmem_put() to signal that the
send has been completed. If no receive was preposted, MP_ASend()
will do nothing and let MP_Wait() handle it.
If MP_ASend() failed to complete the send, MP_Wait()
will check again for a preposted receive, and complete the send as above
if one is found. If there still is no matching receive posted, MP_Wait()
must malloc() a send buffer, copy the data in, and post the header with a
pointer to the send buffer to the send message queue on the destination node.
The source node will also store the same header locally in a send buffer log
slog[destination].q[QSIZE] that will be used to clean up the buffer after
getting a recv_done signal from the destination node.
This extra buffering can reduce the transfer rate, but is necessary to
avoid a lock-up condition when the application programmer has not ensured
that the receive was preposted before the send was initiated.
A blocking MP_Send() simply calls MP_Wait(). As described
above, if a matching receive is found it will shmem_put() the data to the
destination node, or it will push the data into a send buffer and post
a header to the send message queue on the destination node.
When an MP_ARecv() is initiated, the node checks the
send message queue for the source node for a matching send,
indicating that it is too late to prepost the receive since
the source node has already buffered the message data. If a match is found
the node uses shmem_get() to pull the data from the send buffer on the
source node, then posts a recv_done signal to the receive message
queue on the source node to signal that the send buffer can be freed.
If no match is found, the header is preposted to the receive message
queue on the source node so that it can handle the data transfer
when a matching send header is found. The MP_Wait() for the MP_ARecv()
will block in a busy-wait loop until the send_done signal is received,
indicating that the source node has completed the transfer.
Unfortunately, it is possible for the destination node to prepost a
receive after the source node checks the receive message queue but before
the source node buffers the message and posts the buffered send to the
destination node. This possibility of preposting the message in the 'gap'
necessitates some extra care, and some of the additional handshaking described
above. When a receive blocks waiting for the source node to finish the
send, it must also check for a matching buffered send to arrive. This
indicates that the 'gap' condition has occurred, and the source node has
already buffered the data. In this case, the destination node takes
control and uses shmem_get() to pull the data down. Then the recv_done
signal is sent to the source node, indicating that the send buffer can be
freed, but it is sent with a negative number of bytes to indicate that
the source node must also unpost the matching preposted receive to clean
things up completely.
It gets even more complicated though. Each preposted receive must
be treated with suspicion since it may have come in the gap, and therefore
may already match a buffered send. Therefore, a preposted receive cannot
be used unless there is no matching buffered send that has not been cleared.
If this occurs, the preposted receive must be ignored since it cannot be
trusted yet, and MP_Wait() may have to do a buffered send instead.
This maintains the integrity of the system, but may occasionally result in
more messages being buffered.
It would be much easier to write a module that just had the
destination node handle all the transfers. However, this would be less
efficient than letting both sides transfer data, since this initiates the
transfer at the earliest possible time. Allowing both sides to handle the
transfer requires some handshaking, which always allows for a gap of some
sort where the handshaking communications pass each other in transit.
While this approach may sound very complicated, some similar mechanism is
needed for any such algorithm to ensure the integrity of the message-passing
system. This current approach minimizes the handshaking for the most common
case where the receive gets preposted in order to provide the lowest latency.
For messages that get pushed to a send buffer, efficiency has already been
lost due to the memory-to-memory copy, so if extra handshaking is needed
because of a prepost in the gap, there is not much effect on the resulting
performance.
Receives posted with an unknown source (-1) will simply cycle through
all the send message queues until a matching buffered send is posted.
This is clearly not real efficient since it guarantees that the message
will be buffered on the source node. More efficient methods would
require much more programming than I'm will to do. My basic philosophy
is that all receives should specify the source node unless they are
in an area where execution time is not important.
Small messages
For small messages of 8 bytes or less, the send replaces the pointer
in the header it posts to the send message queue on the destination node
with the actual data. When a receive on the destination node matches this
header, it can then just copy the data straight from the pointer location.
The latency for these small messages is therefore much smaller since only
one communication of 24 bytes is needed.
Performance
As discussed above, this module is optimized for the case where the
receive is preposted before the send is encountered. If this does
not occur, and the message needs to be buffered on the source node,
some performance is lost even though the memory copy rate is very
high on the Cray T3E. Receives posted with an unknown source always
wait for a buffered send to be posted, and therefore should be avoided
in time-critical areas. Statistics on the number of messages that go
through the send buffers are reported at the end of each run in the
.nodeX log files. Also reported is the number of preposted receives that
had to be ignored because of a matching buffered send that had not cleared.
I have yet to see a single instance of this in a real code. It needs to
be there to protect the integrity of the system, but obviously is not
being encountered in real situations.
Small messages of 8 bytes or less are passed with the header, and can
take as little as 12 µs. Messages slightly larger require more handshaking
and jump to a 19 µs latency.
This module works very well on the Cray T3E, but has not been tested on
the SGI multiprocessor machines yet.