Local and global input/output
Each input statement must be analyzed to determine whether it
should be handled locally or globally. If the input variable is
to be the same on all nodes, then a global input is method is needed.
One common method is to have node 0 handle all global file access.
Node 0 opens each input file, reads the contents, then broadcasts
the data to each node using MPI_Bcast() functions. This is
cumbersome, but efficient since the data is being read from disk
only once. From a programming perspective, it is easier to have
each node open the same file and read in its own data, but this
requires nprocs accesses to the same data and is very
inefficient. Unfortunately, MPI does not provide decent global
file access functions that are easy to use while maintaining
efficiency.
Global output is easy. If the same nodes all have
the same data, only node 0 actually needs to open and write the data
to the file.
Local output, where each node writes different data to a common
file, can be handled in several ways. Probably the easiest is to
simply have all nodes send their data to node 0 to have it write
the data out to a file. Local input can be handled in the same
manner, where node 0 reads all data in and sends it to the appropriate
nodes. Other methods require passing around file handles, which
can provide both convenience and efficiency. However, there is
nothing like this in the MPI standard yet.
Local input/output can also be done to a local disk. For systems that
support this, each node must simply open its own file on local scratch
space and write/read the data to it.
The network topology
The network topologies vary greatly between MPP systems, from
fully connected to 2D and 3D meshes to fat trees. Because of this
diversity in topologies, application programmers commonly either
optimize their code for the topology that they will run on the
most, or put no effort into this at all. The result of this is
that most codes run well on small numbers of processors, but have
difficulty achieving decent performance above 32-64 nodes.
The reason is that any global communications will invariably result
in contention for some common communication channels.
The approach that I recommend is to optimize your code for a
virtual topology that can be mapped onto many actually topologies.
In this way, you assume fewer actual connections than there may be,
but you know where they are and can therefore minimize the bottlenecks
during global communications.
For example, there are many codes that simulate 3D spaces. If you
choose a 3D mesh topology, the Cray T3E will support this but many others
won't. The 3D simulation space can be divided into columns of a 2D mesh,
which can be mapped onto most network topologies. This is a portable
method of optimizing for the network topology which helps the scaling
of any code greatly.
Links to more advanced topics
Ames Laboratory |
Condensed Matter Physics |
Disclaimer |
ISU Physics