MP_Lite:
The SMP module
The TCP module can be used to pass messages between processes on
SMP systems. However, the message data is copied many times as it
passes through the TCP/IP stack, and the performance is therefore
limited by the saturation of the main memory bus.
Communication rates between
processes on SMP nodes are often very similar to the rates between
nodes for fast networks, since both communication paths
go through the TCP stack.
Shared-memory segments
Most message-passing systems use
a shared memory segment to transfer the messages between
processors. This totally bypasses the OS and the multiple
memory-to-memory copies that it performs. Messages can be copied
into a shared-memory segment by the source process, then the receiving
process can copy the data to
its own memory space. This approach is a 2-copy transfer
done entirely within the context of a complete message-passing system.
Message-passing implementations use many different methods to
transfer small messages efficiently, to minimize the time that the
shared-memory segment is locked, to notify the destination process
when a new message has been posted, and to maximize the memory copy rates.
Minimizing Latency
LAM MPI provides the lowest latency of 1 us by
passing small messages through pre-allocated mailboxes set up between
each pair of processes.
This completely avoids the use of locking mechanisms.
MP_Lite and MPICH2 do not have the same special system for small messages,
but still maintain a low 2 us latency by simply minimizing the number of
locks needed, and using spin_yield functions as locks rather than more
expensive semaphores.
Maximizing Data Flow
Minimizing the number of locks, and the amount of time that a given
process needs to lock others out from accessing the shared-memory
segment, are the keys to providing an efficient SMP message-passing
system in the intermediate range.
MP_Lite has 2 versions currently under development.
The shm_lockfifo.c module uses a fast spin_yield function as a locking
mechanism. It uses a uniques shared-memory FIFO queue between each
pair of processes to pass the header information. Processes therefore
only need to lock the primary shared-memory segment during allocation
and de-allocation of space for messages, and not for traversing the
message queues. The graph above shows that this provides very good
performance in the intermediate range.
A Scalable Approach
MP_Lite also has a shm_lockfree.c module that is still in development.
This method divides the shared-memory segments into region, with a
separate region for the outgoing message from each process.
Since each process manages its own region, and the only time another
process writes to that region is to mark a cleanup flag when a
message has been read, there is no need for locks at all.
This should provide for a much more scalable system, but it has
not yet been tested on large SMP systems.
Large Message Sizes
The performance for message-sizes larger than the cache size is
a direct function of the memory copy rate.
While the memcpy function does a reasonable job on most systems,
more highly tuned memory copy functions are available that can provide
as much as twice the performance on some systems.
The graph above shows that the non-temporal memory copy routine
in MP_Lite can increase the speed by 60% on Pentium 4 systems.
We are trying to accumulate many different optimized memory copy routines
and package them together to make it easier for other message-passing
libraries to use. Unfortunately, many are copy protected so the best
we will be able to do is to point people toward where the source code
is available.
Mixed Environments of Distributed SMP Nodes
The examples above are for communications entirely on an SMP node.
The same methods can be used in a mixed environment where there
are many SMP nodes in a distributed system.
Messages can be passed between SMP processes through a shared-memory
segment, while messages between SMP nodes are passed via TCP or
another module.
Currently, MP_Lite can operate in a mixed mode using both TCP and
SMP messages, but only if a wildcard is not used for the source of
a receive. This is a bit tricky to handle, since the system must
test both TCP socket buffers and the SMP shared-memory segment for
a matching message, and it just hasn't been implemented yet.
Current Research
A Linux kernel_copy.c module is under developement that will provide
a 1-copy mechanism on Linux systems. This has been done previously
within the BIP-SMP project, but never made it into most MPI
implementations. Once the kernel_copy.c module is developed,
the MP_Lite shmem.c module will be adapted to test the system in
a message-passing environment, then we will work with the developers
of full MPI implementations to get this method into practical use.