MP_Lite: The SMP module

The TCP module can be used to pass messages between processes on SMP systems. However, the message data is copied many times as it passes through the TCP/IP stack, and the performance is therefore limited by the saturation of the main memory bus. Communication rates between processes on SMP nodes are often very similar to the rates between nodes for fast networks, since both communication paths go through the TCP stack.

Shared-memory segments

Most message-passing systems use a shared memory segment to transfer the messages between processors. This totally bypasses the OS and the multiple memory-to-memory copies that it performs. Messages can be copied into a shared-memory segment by the source process, then the receiving process can copy the data to its own memory space. This approach is a 2-copy transfer done entirely within the context of a complete message-passing system.

Message-passing implementations use many different methods to transfer small messages efficiently, to minimize the time that the shared-memory segment is locked, to notify the destination process when a new message has been posted, and to maximize the memory copy rates.

Minimizing Latency

LAM MPI provides the lowest latency of 1 us by passing small messages through pre-allocated mailboxes set up between each pair of processes. This completely avoids the use of locking mechanisms. MP_Lite and MPICH2 do not have the same special system for small messages, but still maintain a low 2 us latency by simply minimizing the number of locks needed, and using spin_yield functions as locks rather than more expensive semaphores.

Maximizing Data Flow

Minimizing the number of locks, and the amount of time that a given process needs to lock others out from accessing the shared-memory segment, are the keys to providing an efficient SMP message-passing system in the intermediate range. MP_Lite has 2 versions currently under development. The shm_lockfifo.c module uses a fast spin_yield function as a locking mechanism. It uses a uniques shared-memory FIFO queue between each pair of processes to pass the header information. Processes therefore only need to lock the primary shared-memory segment during allocation and de-allocation of space for messages, and not for traversing the message queues. The graph above shows that this provides very good performance in the intermediate range.

A Scalable Approach

MP_Lite also has a shm_lockfree.c module that is still in development. This method divides the shared-memory segments into region, with a separate region for the outgoing message from each process. Since each process manages its own region, and the only time another process writes to that region is to mark a cleanup flag when a message has been read, there is no need for locks at all. This should provide for a much more scalable system, but it has not yet been tested on large SMP systems.

Large Message Sizes

The performance for message-sizes larger than the cache size is a direct function of the memory copy rate. While the memcpy function does a reasonable job on most systems, more highly tuned memory copy functions are available that can provide as much as twice the performance on some systems. The graph above shows that the non-temporal memory copy routine in MP_Lite can increase the speed by 60% on Pentium 4 systems. We are trying to accumulate many different optimized memory copy routines and package them together to make it easier for other message-passing libraries to use. Unfortunately, many are copy protected so the best we will be able to do is to point people toward where the source code is available.

Mixed Environments of Distributed SMP Nodes

The examples above are for communications entirely on an SMP node. The same methods can be used in a mixed environment where there are many SMP nodes in a distributed system. Messages can be passed between SMP processes through a shared-memory segment, while messages between SMP nodes are passed via TCP or another module. Currently, MP_Lite can operate in a mixed mode using both TCP and SMP messages, but only if a wildcard is not used for the source of a receive. This is a bit tricky to handle, since the system must test both TCP socket buffers and the SMP shared-memory segment for a matching message, and it just hasn't been implemented yet.

Current Research

A Linux kernel_copy.c module is under developement that will provide a 1-copy mechanism on Linux systems. This has been done previously within the BIP-SMP project, but never made it into most MPI implementations. Once the kernel_copy.c module is developed, the MP_Lite shmem.c module will be adapted to test the system in a message-passing environment, then we will work with the developers of full MPI implementations to get this method into practical use.