Parallelization Strategy


The easiest way to parallelize a program is to find someone
who has already done it and steal their code.

1) Compile and run the scalar code

2) Add initialization, clean up, and timing routines

3) Track the memory utilization and major I/O

4) Determine a parallelization strategy

5) Add in communication commands

6) Tune the performance

If you design the code properly, it should run on an arbitrary number of processors, including a single processor. In most cases the parallel code should run as well on one processor as the scalar code. This allows you to maintain a single code that will run on both scalar and multiprocessor systems.

Distribute memory

For most codes, you will want to attack larger problems when you have access to the more processing power that mutliprocessor machines offer. Therefore, the memory size usually scales along with the CPU power.

The parallelization strategy is often driven by the need to spread the memory requirements across the nodes as much as by the need to distribute the CPU load. Determine where the memory is being used in the code, and whether that data can be replicated on each node or if it must be distributed across the nodes.

Load balance

You obviously want to balance the CPU demands so that each node is doing approximately the same amount of work. This is usually not a problem, as it flows from the natural division of the data as determined by the memory, etc. In some cases, such as solving problems on irregular meshes, more effort may be necessary to insure proper load balance. In cases where systems may fluctuate greatly, it may even be necessary to balance the load dynamically by shifting data between steps.

At times, it may be necessary to change an algorithm completely to find one that is more naturally parallel. Before starting, be sure to determine whether this will increase the overall operation count significantly. You don't want a more naturally parallel algorithm that is only beneficial when run on hundreds of nodes.

It is also tempting to try a master/slave approach where there is one master program that divides up the work and handles all I/O for the slave nodes. In general, this is a poor approach since two codes will need to be maintained, launched at startup, etc. It also can either waste the machine that the master program is running on, or overload it if a slave process is operating on the same node.

Local and global input/output

Each input statement must be analyzed to determine whether it should be handled locally or globally. If the input variable is to be the same on all nodes, then a global input is method is needed. One common method is to have node 0 handle all global file access. Node 0 opens each input file, reads the contents, then broadcasts the data to each node using MPI_Bcast() functions. This is cumbersome, but efficient since the data is being read from disk only once. From a programming perspective, it is easier to have each node open the same file and read in its own data, but this requires nprocs accesses to the same data and is very inefficient. Unfortunately, MPI does not provide decent global file access functions that are easy to use while maintaining efficiency.

Global output is easy. If the same nodes all have the same data, only node 0 actually needs to open and write the data to the file.

Local output, where each node writes different data to a common file, can be handled in several ways. Probably the easiest is to simply have all nodes send their data to node 0 to have it write the data out to a file. Local input can be handled in the same manner, where node 0 reads all data in and sends it to the appropriate nodes. Other methods require passing around file handles, which can provide both convenience and efficiency. However, there is nothing like this in the MPI standard yet.

Local input/output can also be done to a local disk. For systems that support this, each node must simply open its own file on local scratch space and write/read the data to it.

The network topology

The network topologies vary greatly between MPP systems, from fully connected to 2D and 3D meshes to fat trees. Because of this diversity in topologies, application programmers commonly either optimize their code for the topology that they will run on the most, or put no effort into this at all. The result of this is that most codes run well on small numbers of processors, but have difficulty achieving decent performance above 32-64 nodes. The reason is that any global communications will invariably result in contention for some common communication channels.

The approach that I recommend is to optimize your code for a virtual topology that can be mapped onto many actually topologies. In this way, you assume fewer actual connections than there may be, but you know where they are and can therefore minimize the bottlenecks during global communications.

For example, there are many codes that simulate 3D spaces. If you choose a 3D mesh topology, the Cray T3E will support this but many others won't. The 3D simulation space can be divided into columns of a 2D mesh, which can be mapped onto most network topologies. This is a portable method of optimizing for the network topology which helps the scaling of any code greatly.


Links to more advanced topics


Ames Laboratory | Condensed Matter Physics | Disclaimer | ISU Physics