MP_Lite Performance on a dual-Xeon SMP

On most SMP systems, messages are passed through a shared-memory segment. This requires two copies of the data, one by the source process to the segment, and a second by the destination process from the shared-memory segment. Message-passing systems have typically used semaphores in the past to control which process has access to the segment.

This approach has several shortcomings. First, semaphores are fairly slow, so they add to the latency of the message-passing system. This approach is also not scalable for large SMP nodes, since when one process is altering the shared segment, all others are locked out.

With MP_Lite, we are investigating several mechanisms for improving the efficiency of SMP message-passing. The first simply replaces the semaphore lock with a spin-yield lock. In this approach, a shared variable is used as the lock with processes doing a yield function (yields control to any other scheduled processes) between checks of the shared variable. A yield takes about a microsecond on most machines, producing a much quicker locking mechanism. This approach does not burden the CPU since a check is made only once every microsecond. The graph below shows that this approach can beat semaphore based methods.

We are also investigating several lock-free mechanisms that will scale better. For these, the shared-memory segment is divided into zones as shown below, with each process managing a zone for its outgoing messages. A process only writes to another processes zone to mark a message for cleanup after it is read.

The lock-free approaches produce about the same performance as the other message-passing systems above on these ping-pong tests. The real benefit should be seen within real applications on large SMP nodes, but we still need to prove this.

The cache-based performance is probably better in MP_Lite due to the minimization of locking mechanisms, enabling a better flow for the data. It is still unclear whether this will produce better results for real codes.

The performance for large messages (above the cache size) is directly a factor of the memory copy rate. Optimized memory copy techniques can be used on various architectures to greatly improve the performance. On the Pentium 4 system above, a non-temporal memory copy is used to improve performance for large messages by 60%.