MP_Lite Performance on a dual-Xeon SMP
On most SMP systems, messages are passed through a shared-memory segment.
This requires two copies of the data, one by the source process to the
segment, and a second by the destination process from the shared-memory segment.
Message-passing systems have typically used semaphores in the past
to control which process has access to the segment.
This approach has several shortcomings. First, semaphores are fairly
slow, so they add to the latency of the message-passing system. This
approach is also not scalable for large SMP nodes, since when one process
is altering the shared segment, all others are locked out.
With MP_Lite, we are investigating several mechanisms for improving the
efficiency of SMP message-passing.
The first simply replaces the semaphore lock with a spin-yield lock.
In this approach, a shared variable is used as the lock with processes
doing a yield function (yields control to any other scheduled processes)
between checks of the shared variable. A yield takes about a microsecond
on most machines, producing a much quicker locking mechanism.
This approach does not burden the CPU since a check is made only once
The graph below shows that this approach can beat semaphore based
We are also investigating several lock-free mechanisms that will
For these, the shared-memory segment is divided into zones as shown
below, with each
process managing a zone for its outgoing messages. A process
only writes to another processes zone to mark a message for cleanup
after it is read.
The lock-free approaches produce about the same performance as the
other message-passing systems above on these ping-pong tests.
The real benefit should be seen within real applications on large
SMP nodes, but we still need to prove this.
The cache-based performance is probably better in MP_Lite due to the
minimization of locking mechanisms, enabling a better flow for the
data. It is still unclear whether this will produce better results
for real codes.
The performance for large messages (above the cache size) is directly
a factor of the memory copy rate. Optimized memory copy techniques
can be used on various architectures to greatly improve the performance.
On the Pentium 4 system above, a non-temporal memory copy is used to
improve performance for large messages by 60%.