This section will cover the very basics of using MPI for
two-sided communications. The MPI standard defines a much larger
set of functions than is presented here, but most are not needed
unless you are writing very advanced programs. I will also
present these functions in a manner that promotes good programming
practices that will produce efficient code for any architecture.
Initialization and clean up
A typical MPI run involves many processors, each running
the same program but operating on a different part of the
data. It is possible to use a master/slave approach as well,
where there is one master node that divies out the workload
and collects the results. While this approach sounds tempting,
it is more difficult to use in general because you need to maintain
two separate programs, and also launch the master and slaves separately.
I strongly urge you not to take this approach.
The table below shows the f77 and C syntax for the initialization
and cleanup functions needed in every MPI program. The MPI_Init()
function simply initializes the MPI system.
The MPI_Comm_size() function returns the number of nodes that
the code is running on, as specified in the mpirun command.
The MPI_Comm_rank() function returns the relative node number,
between 0 and the number of nodes minus 1, assigned to each of the
nodes. This rank is then used in the rest of the program to
determine which data each node will work with, etc.
f77
INCLUDE "mpif.h"
CALL MPI_Init( MPI_COMM_WORLD, ierror )
CALL MPI_Comm_size( MPI_COMM_WORLD, nprocs, ierror)
CALL MPI_Comm_rank( MPI_COMM_WORLD, myproc, ierror)
|
C
#include "mpi.h"
MPI_Init( MPI_COMM_WORLD, &argc, &argv );
MPI_Comm_size( MPI_COMM_WORLD, &nprocs );
MPI_Comm_rank( MPI_COMM_WORLD, &myproc );
|
The mpif.h file needs to be included at the beginning of
each f77 subroutine or function that uses MPI commands.
For C, the mpi.f file needs to be included at the beginning
of each file that has MPI function calls.
MPI_COMM_WORLD is an MPI communicator, signifying that the
command is taking place within the full set of nodes.
Simply specify MPI_COMM_WORLD where needed and forget about using
communicators unless you are dealing with multiple levels of parallelism.
nprocs is an integer, the number of processors or nodes.
myproc is an integer, the relative node number for each node.
ierror is the return value for each f77 call, providing a way
to signal the programmer if an error has occurred (usually ignored).
In C, ierror is the return value of the function itself.
Basic sends and receives
The basic 2-sided communication involves a send function on
the source node and a receive function on the destination node.
There are several types of send and receive functions, each useful
for different situations. Unfortunately, the MPI standard does not
exactly define how each must be implemented, so you can get varying
results depending on the MPI implementation, the architecture you
are running on, or the size of the data being transferred.
This is clearly a bad situation, but we must make the best of it.
To write a portable code, it is therefore necessary to program for the
worst case for each function.
f77
CALL MPI_Send(buf, count, MPI_Datatype, dest, tag, MPI_COMM_WORLD, ierror)
CALL MPI_Recv(buf, count, MPI_Datatype, source, tag, MPI_COMM_WORLD, &status, ierror)
|
C
MPI_Send(&buf, count, MPI_Datatype, dest, tag, MPI_COMM_WORLD);
MPI_Recv(&buf, count, MPI_Datatype, source, tag, MPI_COMM_WORLD, &status);
|
The syntax of the basic MPI_Send() and MPI_Recv() functions
are shown above. buf points to the first element of the
data to be sent, or where the data to be received is to be put. In C, this
is a pointer, in Fortran it is
just the variable or first array element to be sent or received.
count is the number of elements of type
MPI_Datatype to be transferred to the destination
dest from source.
Again just use MPI_COMM_WORLD and ignore ierror in most cases.
| MPI_Datatypes for C |
MPI_Datatypes for f77 |
MPI_DOUBLE
MPI_FLOAT
MPI_INT
MPI_BYTES MPI_CHAR
|
MPI_DOUBLE_PRECISION
MPI_DOUBLE_COMPLEX
MPI_REAL
MPI_INTEGER
MPI_CHARACTER
|
The MPI_Send() function sends a block of data to the
destination node dest. The block starts at the address
of buf and with the size calculated from count
and the size of the MPI_Datatype specified, which may vary
between machines.
This function will block at least until the data has been sent or copied
to a send buffer, so that the data can not be trampled before it is sent.
Depending on the implementation, and possibly the message size, the
MPI_Send() may also block until the MPI_Recv() process
has started on the destination node. This is an unfortunate situation,
since it will force synchronization between the two nodes
in some circumstances and not in others. For complete portability,
you should therefore program for the worst case and assume that
synchronization may occur.
The standard MPI_Send() function may choose whether to use
a send buffer for performance reasons, or to avoid blocking.
Most of the time, for small messages, this is done automatically.
However, for large messages the buffers may not be large enough,
and the implementation may choke on them. In this case, you will
need to manually allocate a larger buffer and use buffered versions
of the send and receive functions. If this need arises, look at the
manpages for MPI_Bsend() and MPI_Brecv() in places like
the MPICH documentations.
An MPI_Recv() function will block until a message is received
with a matching size, message tag, and from the specified source.
A source of MPI_ANYSOURCE can be used to accept a message from
any source that has the correct size and message tag.
A message tag of MPI_ANYTAG is a wildcard that
matches any tag.
The message tag provides a secondary method for choosing between
messages that come from the same source, allowing out-of-order reception
if desired. While this may be convenient, out-of-order reception requires
extra buffering which is inefficient, so use this only in areas that
are not time-critical.
Using the MPI_ANYSOURCE wildcard may also be convenient at times,
but it can also lead to inefficiency. In time-critical areas, my advice is
simply to always specify the the source and destination, and not to rely on
message tags (set them to 0). This provides the message-passing system
with all the information needed to streamline the transfer as much as possible.
Asynchronous communications
It is at times beneficial to use asynchronous communications that
initiate the send or receive but do not block on their completion.
This provides greater flexibility, and can produce more efficient
code. As shown below, asynchronous communications have the same syntax as
their blocking counterparts, except that their name includes an 'i' which
stands for 'immediate', and they each have an associated message id
that is used in conjunction with the MPI_Wait() function to
force a block on completion.
f77
CALL MPI_Isend(buf, count, MPI_Datatype, dest, tag, MPI_COMM_WORLD, msg_id, ierror)
CALL MPI_Irecv(buf, count, MPI_Datatype, source, tag, MPI_COMM_WORLD, msg_id, ierror)
CALL MPI_Wait( msg_id, status, ierror)
|
C
MPI_Isend(&buf, count, MPI_Datatype, dest, tag, MPI_COMM_WORLD, &msg_id);
MPI_Irecv(&buf, count, MPI_Datatype, source, tag, MPI_COMM_WORLD, &msg_id);
MPI_Wait( &msg_id, &status);
|
Asynchronous communications have two primary purposes, both
geared toward achieving higher performance. They can be used to
try to overlap communications with computations when the system
has this capability. For example, a source node would initiate an
MPI_Isend() as soon as the data was available, then continue
performing computations while letting a message coprocessor
or NIC handle the data transfer hopefully without affecting the
main processor greatly. The source node would block with the
MPI_Wait() only when it needed to reuse the memory space
being transferred, giving the communication hardware the best
chance of transferring the data before the block occurs.
The same is true for the destination node, which can post an
MPI_Irecv() as soon as possible, giving the communication
hardware the best chance to handle the communications behind the
scene before the destination node blocks with an MPI_Wait().
Asynchronous communications are also valuable for bypassing
communication buffers, and therefore increasing the
effective throughput. If a receive is posted before the matching
send is initiated, the data will not be buffered on the source node
and can flow directly into the application memory on the destination
node. In order to guarantee that the receive is preposted, some sort
of local synchronization is needed. This can be accomplished with
a simple handshaking as shown below, where the destination node
preposts the receive then sends a dummy token to the source node
to signal that it is ready to receive the data. This essentially
doubles the effective latency, but for large messages the latency
will not be important while bypassing the send buffer can improve
the effective throughput enormously and also cut down the memory
usage.
node 0 (source node)
INT token, n=1000;
DOUBLE array[1000];
/* Block on the destination's go signal to avoid buffering */
MPI_Recv( &token, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, NULL);
MPI_Send( array, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD);
node 1 (destination node)
INT token, n=1000, msg_id???;
MPI_STATUS_TYPE status???;
DOUBLE array[1000];
/* Prepost the receive and send the go signal to the source node */
MPI_Irecv( array, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &msg_id);
MPI_Send( &token, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
/* ... perform computations ... */
MPI_Wait( &msg_id, &status);
Performance and portability
While MPI has many variants of the send and receive functions,
the methods demonstrated above provide both the best performance
and portability. Basic MPI_Send() and MPI_Recv()
functions should be used when blocking is necessary, and the
messages are short. Otherwise, the basic send may behave differently
depending on the implementation and message size.
For long messages, preposting receives as demonstrated above
guarantees the minimum amount of buffering. Using asynchronous
communications also allows communications to occur while the computations
continue on most systems. Both of these are very important for
achieving the maximum throughput for a long message. The only
drawback is that the message latency is doubled, but for long
messages this is irrelevant.
Using just a few of the commands, along with good programming
practices, provides the best performance in a portable manner.
If you think you need more complex functionality, check out the
full documentations for MPI.
MPICH - click on 'manual pages'
for full MPI documentation
Global functions
There are also many useful global communications that add to
the point-to-point communications above. These functions operate
across all nodes, and most result in a rough synchronization upon
completion.
f77
CALL MPI_Bcast( array, count, MPI_Datatype, root_node, MPI_COMM_WORLD, ierror)
CALL MPI_Allreduce( array, result, count, MPI_Datatype, MPI_Op, MPI_COMM_WORLD, ierror)
CALL MPI_Barrier(MPI_COMM_WORLD, ierror)
|
C
MPI_Bcast(&array, count, MPI_Datatype, root_node, MPI_COMM_WORLD);
MPI_Allreduce( &array, result, count, MPI_Datatype, MPI_Op, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
|
This table illustrates some of the most common functions.
The MPI_Bcast() function broadcasts data from one node to
all others. One common use for this is for global I/O functions,
where node 0 would open a file, read in the data, and broadcast
it to all other nodes.
The MPI_Allreduce() function performs a global reduction
of the data across the nodes.
Typical uses are to do a summation, maximum, or minimum
of a variable or each element of an array.
These correspond to the MPI_SUM, MPI_MAX,
and MPI_MIN MPI_Op operator types.
Memory overflow problems may arise when operating on very large arrays.
Since these functions use binary exchanges to process the data, a node
that comes in late may have log2(nprocs) copies of other
arrays waiting in receive buffers.
Good implementations will avoid this by forcing synchronization at the
beginning of the operation, or by pipelining and synchronizing the
operations throughout. If you run into any memory problems like this,
try putting an MPI_Barrier() function before each large global
operation to force synchronization.
Timing
All codes should have timing functions embedded. These will help
the user to understand where the time is being spent, pointing out
where to concentrate optimization efforts, and ultimately lead to
more efficient use of computer time.
Below are the basic MPI timing routines. These return an absolute
time in seconds since the program started. Only the relative time should
be used, so bracket the area of code of interest and take the difference
in time as illustrated below. After summing the time contributions
of all compute and communication intensive sections of the code,
simply dump an analysis of the timing results at the end of each
run.
DOUBLE PRECISION t0, t_code
t_code = 0.0d0
t0 = MPI_Wtime()
c ... bracket the CPU or communication intensive code of interest ...
t_code = t_code + ( t0 - MPI_Wtime() )
double t0, t_code=0, MPI_Wtime();
t0 = MPI_Wtime();
/* ... bracket the CPU or communication intensive code of interest ...*/
t_code += ( t0 - MPI_Wtime() );
Running MPI codes
If you MPI implementation has been set up correctly, then
compiling and running your code should be fairly easy.
Unfortunately, this is not always the case.
The MPI standard does not specify a default location,
library name, or subdirectory structure.
Vendor-specific MPI implementations will often have the
MPI libraries and include files in standard places where the compilers
will automatically find them. In this case, you do not need
to manually specify a path to the MPI library or include files
at compilation time.
Portable implementation such as MPICH may have their own
compiler wrappers set up to automatically link the correct
MPI libraries. Simply use the mpicc and mpif77 compilers
instead of the native cc and f77 compilers.
If these were
present or not set up correctly, you will need to track down the MPI
library yourself and provide the correct path and name at
link time. Try using which mpirun to find where the
MPI stuff is located, then dig around any 'lib' directory
you can find. The names and paths can vary, but are usually
something like libmpi.a or libmpich.a, with the mpi.h and
mpif.h include files being in the ./inc subdirectory.
For complete documentation on the mpirun command,
do a 'man mpirun'. Its usage will vary depending on the
type of parallel system you are running on. Below are some examples
for a Unix cluster running LAM/MPI and MPICH. The -O parameter for
LAM/MPI is for homogeneous systems, providing much better performance.
Unix cluster running LAM/MPI
mpif77 test.f -o test.x
mpicc test.c -o test.x
cat lamhosts
wakka.lo1
rikku.lo1
lamboot -v -b lamhosts
mpirun -np 2 -O test.x
Unix cluster running MPICH
mpif77 test.f -o test.x
mpicc test.c -o test.x
mpirun -np 4 -p4pg ???
mpirun -np 4 -machinefile test.x
Links to more advanced topics
MPICH - full MPI documentation
Ames Laboratory |
Condensed Matter Physics |
Disclaimer |
ISU Physics