Message-Passing using MPI

This section will cover the very basics of using MPI for two-sided communications. The MPI standard defines a much larger set of functions than is presented here, but most are not needed unless you are writing very advanced programs. I will also present these functions in a manner that promotes good programming practices that will produce efficient code for any architecture.

Initialization and clean up

A typical MPI run involves many processors, each running the same program but operating on a different part of the data. It is possible to use a master/slave approach as well, where there is one master node that divies out the workload and collects the results. While this approach sounds tempting, it is more difficult to use in general because you need to maintain two separate programs, and also launch the master and slaves separately. I strongly urge you not to take this approach.

The table below shows the f77 and C syntax for the initialization and cleanup functions needed in every MPI program. The MPI_Init() function simply initializes the MPI system. The MPI_Comm_size() function returns the number of nodes that the code is running on, as specified in the mpirun command. The MPI_Comm_rank() function returns the relative node number, between 0 and the number of nodes minus 1, assigned to each of the nodes. This rank is then used in the rest of the program to determine which data each node will work with, etc.

f77
INCLUDE "mpif.h"

CALL MPI_Init( MPI_COMM_WORLD, ierror )
CALL MPI_Comm_size( MPI_COMM_WORLD, nprocs, ierror)
CALL MPI_Comm_rank( MPI_COMM_WORLD, myproc, ierror)
C
#include "mpi.h"

MPI_Init( MPI_COMM_WORLD, &argc, &argv );
MPI_Comm_size( MPI_COMM_WORLD, &nprocs );
MPI_Comm_rank( MPI_COMM_WORLD, &myproc );

The mpif.h file needs to be included at the beginning of each f77 subroutine or function that uses MPI commands. For C, the mpi.f file needs to be included at the beginning of each file that has MPI function calls.

MPI_COMM_WORLD is an MPI communicator, signifying that the command is taking place within the full set of nodes. Simply specify MPI_COMM_WORLD where needed and forget about using communicators unless you are dealing with multiple levels of parallelism.

nprocs is an integer, the number of processors or nodes.

myproc is an integer, the relative node number for each node.

ierror is the return value for each f77 call, providing a way to signal the programmer if an error has occurred (usually ignored). In C, ierror is the return value of the function itself.

Basic sends and receives

The basic 2-sided communication involves a send function on the source node and a receive function on the destination node. There are several types of send and receive functions, each useful for different situations. Unfortunately, the MPI standard does not exactly define how each must be implemented, so you can get varying results depending on the MPI implementation, the architecture you are running on, or the size of the data being transferred. This is clearly a bad situation, but we must make the best of it. To write a portable code, it is therefore necessary to program for the worst case for each function.

f77
CALL MPI_Send(buf, count, MPI_Datatype, dest, tag, MPI_COMM_WORLD, ierror)
CALL MPI_Recv(buf, count, MPI_Datatype, source, tag, MPI_COMM_WORLD, &status, ierror)
C
MPI_Send(&buf, count, MPI_Datatype, dest, tag, MPI_COMM_WORLD);
MPI_Recv(&buf, count, MPI_Datatype, source, tag, MPI_COMM_WORLD, &status);

The syntax of the basic MPI_Send() and MPI_Recv() functions are shown above. buf points to the first element of the data to be sent, or where the data to be received is to be put. In C, this is a pointer, in Fortran it is just the variable or first array element to be sent or received. count is the number of elements of type MPI_Datatype to be transferred to the destination dest from source.
Again just use MPI_COMM_WORLD and ignore ierror in most cases.

MPI_Datatypes for C MPI_Datatypes for f77
MPI_DOUBLE

MPI_FLOAT
MPI_INT
MPI_BYTES MPI_CHAR
MPI_DOUBLE_PRECISION
MPI_DOUBLE_COMPLEX
MPI_REAL
MPI_INTEGER
MPI_CHARACTER

The MPI_Send() function sends a block of data to the destination node dest. The block starts at the address of buf and with the size calculated from count and the size of the MPI_Datatype specified, which may vary between machines. This function will block at least until the data has been sent or copied to a send buffer, so that the data can not be trampled before it is sent. Depending on the implementation, and possibly the message size, the MPI_Send() may also block until the MPI_Recv() process has started on the destination node. This is an unfortunate situation, since it will force synchronization between the two nodes in some circumstances and not in others. For complete portability, you should therefore program for the worst case and assume that synchronization may occur.

The standard MPI_Send() function may choose whether to use a send buffer for performance reasons, or to avoid blocking. Most of the time, for small messages, this is done automatically. However, for large messages the buffers may not be large enough, and the implementation may choke on them. In this case, you will need to manually allocate a larger buffer and use buffered versions of the send and receive functions. If this need arises, look at the manpages for MPI_Bsend() and MPI_Brecv() in places like the MPICH documentations.

An MPI_Recv() function will block until a message is received with a matching size, message tag, and from the specified source. A source of MPI_ANYSOURCE can be used to accept a message from any source that has the correct size and message tag. A message tag of MPI_ANYTAG is a wildcard that matches any tag.

The message tag provides a secondary method for choosing between messages that come from the same source, allowing out-of-order reception if desired. While this may be convenient, out-of-order reception requires extra buffering which is inefficient, so use this only in areas that are not time-critical.

Using the MPI_ANYSOURCE wildcard may also be convenient at times, but it can also lead to inefficiency. In time-critical areas, my advice is simply to always specify the the source and destination, and not to rely on message tags (set them to 0). This provides the message-passing system with all the information needed to streamline the transfer as much as possible.

Asynchronous communications

It is at times beneficial to use asynchronous communications that initiate the send or receive but do not block on their completion. This provides greater flexibility, and can produce more efficient code. As shown below, asynchronous communications have the same syntax as their blocking counterparts, except that their name includes an 'i' which stands for 'immediate', and they each have an associated message id that is used in conjunction with the MPI_Wait() function to force a block on completion.

f77
CALL MPI_Isend(buf, count, MPI_Datatype, dest, tag, MPI_COMM_WORLD, msg_id, ierror)
CALL MPI_Irecv(buf, count, MPI_Datatype, source, tag, MPI_COMM_WORLD, msg_id, ierror)
CALL MPI_Wait( msg_id, status, ierror)
C
MPI_Isend(&buf, count, MPI_Datatype, dest, tag, MPI_COMM_WORLD, &msg_id);
MPI_Irecv(&buf, count, MPI_Datatype, source, tag, MPI_COMM_WORLD, &msg_id);
MPI_Wait( &msg_id, &status);

Asynchronous communications have two primary purposes, both geared toward achieving higher performance. They can be used to try to overlap communications with computations when the system has this capability. For example, a source node would initiate an MPI_Isend() as soon as the data was available, then continue performing computations while letting a message coprocessor or NIC handle the data transfer hopefully without affecting the main processor greatly. The source node would block with the MPI_Wait() only when it needed to reuse the memory space being transferred, giving the communication hardware the best chance of transferring the data before the block occurs. The same is true for the destination node, which can post an MPI_Irecv() as soon as possible, giving the communication hardware the best chance to handle the communications behind the scene before the destination node blocks with an MPI_Wait().

Asynchronous communications are also valuable for bypassing communication buffers, and therefore increasing the effective throughput. If a receive is posted before the matching send is initiated, the data will not be buffered on the source node and can flow directly into the application memory on the destination node. In order to guarantee that the receive is preposted, some sort of local synchronization is needed. This can be accomplished with a simple handshaking as shown below, where the destination node preposts the receive then sends a dummy token to the source node to signal that it is ready to receive the data. This essentially doubles the effective latency, but for large messages the latency will not be important while bypassing the send buffer can improve the effective throughput enormously and also cut down the memory usage.

                       node 0 (source node)
     INT token, n=1000;
     DOUBLE array[1000];

        /* Block on the destination's go signal to avoid buffering */

     MPI_Recv( &token, 1, MPI_INT,    1, 0, MPI_COMM_WORLD, NULL);
     MPI_Send(  array, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD);

                       node 1 (destination node)
     INT token, n=1000, msg_id???;
     MPI_STATUS_TYPE status???;
     DOUBLE array[1000];

        /* Prepost the receive and send the go signal to the source node */

     MPI_Irecv( array, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &msg_id);
     MPI_Send( &token, 1, MPI_INT,    0, 0, MPI_COMM_WORLD);

  /* ... perform computations ... */

     MPI_Wait( &msg_id, &status);

Performance and portability

While MPI has many variants of the send and receive functions, the methods demonstrated above provide both the best performance and portability. Basic MPI_Send() and MPI_Recv() functions should be used when blocking is necessary, and the messages are short. Otherwise, the basic send may behave differently depending on the implementation and message size.

For long messages, preposting receives as demonstrated above guarantees the minimum amount of buffering. Using asynchronous communications also allows communications to occur while the computations continue on most systems. Both of these are very important for achieving the maximum throughput for a long message. The only drawback is that the message latency is doubled, but for long messages this is irrelevant.

Using just a few of the commands, along with good programming practices, provides the best performance in a portable manner. If you think you need more complex functionality, check out the full documentations for MPI.

MPICH - click on 'manual pages' for full MPI documentation

Global functions

There are also many useful global communications that add to the point-to-point communications above. These functions operate across all nodes, and most result in a rough synchronization upon completion.

f77
CALL MPI_Bcast( array, count, MPI_Datatype, root_node, MPI_COMM_WORLD, ierror)

CALL MPI_Allreduce( array, result, count, MPI_Datatype, MPI_Op, MPI_COMM_WORLD, ierror)

CALL MPI_Barrier(MPI_COMM_WORLD, ierror)
C
MPI_Bcast(&array, count, MPI_Datatype, root_node, MPI_COMM_WORLD);

MPI_Allreduce( &array, result, count, MPI_Datatype, MPI_Op, MPI_COMM_WORLD);

MPI_Barrier(MPI_COMM_WORLD);

This table illustrates some of the most common functions. The MPI_Bcast() function broadcasts data from one node to all others. One common use for this is for global I/O functions, where node 0 would open a file, read in the data, and broadcast it to all other nodes.

The MPI_Allreduce() function performs a global reduction of the data across the nodes. Typical uses are to do a summation, maximum, or minimum of a variable or each element of an array. These correspond to the MPI_SUM, MPI_MAX, and MPI_MIN MPI_Op operator types.

Memory overflow problems may arise when operating on very large arrays. Since these functions use binary exchanges to process the data, a node that comes in late may have log2(nprocs) copies of other arrays waiting in receive buffers. Good implementations will avoid this by forcing synchronization at the beginning of the operation, or by pipelining and synchronizing the operations throughout. If you run into any memory problems like this, try putting an MPI_Barrier() function before each large global operation to force synchronization.

Timing

All codes should have timing functions embedded. These will help the user to understand where the time is being spent, pointing out where to concentrate optimization efforts, and ultimately lead to more efficient use of computer time.

Below are the basic MPI timing routines. These return an absolute time in seconds since the program started. Only the relative time should be used, so bracket the area of code of interest and take the difference in time as illustrated below. After summing the time contributions of all compute and communication intensive sections of the code, simply dump an analysis of the timing results at the end of each run.

    DOUBLE PRECISION t0, t_code
    t_code = 0.0d0

    t0 = MPI_Wtime()

c   ... bracket the CPU or communication intensive code of interest ...

    t_code = t_code + ( t0 - MPI_Wtime() )
   

    double t0, t_code=0, MPI_Wtime();
    t0 = MPI_Wtime();

/*  ... bracket the CPU or communication intensive code of interest ...*/

    t_code += ( t0 - MPI_Wtime() );

Running MPI codes

If you MPI implementation has been set up correctly, then compiling and running your code should be fairly easy. Unfortunately, this is not always the case. The MPI standard does not specify a default location, library name, or subdirectory structure.

Vendor-specific MPI implementations will often have the MPI libraries and include files in standard places where the compilers will automatically find them. In this case, you do not need to manually specify a path to the MPI library or include files at compilation time.

Portable implementation such as MPICH may have their own compiler wrappers set up to automatically link the correct MPI libraries. Simply use the mpicc and mpif77 compilers instead of the native cc and f77 compilers.

If these were present or not set up correctly, you will need to track down the MPI library yourself and provide the correct path and name at link time. Try using which mpirun to find where the MPI stuff is located, then dig around any 'lib' directory you can find. The names and paths can vary, but are usually something like libmpi.a or libmpich.a, with the mpi.h and mpif.h include files being in the ./inc subdirectory.

For complete documentation on the mpirun command, do a 'man mpirun'. Its usage will vary depending on the type of parallel system you are running on. Below are some examples for a Unix cluster running LAM/MPI and MPICH. The -O parameter for LAM/MPI is for homogeneous systems, providing much better performance.

                      Unix cluster running LAM/MPI

    mpif77 test.f -o test.x
    mpicc test.c -o test.x

    cat lamhosts
       wakka.lo1
       rikku.lo1

    lamboot -v -b lamhosts

    mpirun -np 2 -O test.x

                      Unix cluster running MPICH

    mpif77 test.f -o test.x
    mpicc test.c -o test.x

    mpirun -np 4 -p4pg ???
    mpirun -np 4 -machinefile test.x


Links to more advanced topics
MPICH - full MPI documentation

Ames Laboratory | Condensed Matter Physics | Disclaimer | ISU Physics