1994 Gordon Bell Prize Winners

Allan H. Karp
Hewlett-Packard Laboratories

Michael Heath
University of Illinois

Don Heller
Ames Laboratory

Horst Simon
Silicon Graphics

The Gordon Bell Prize significant achievements in the application of parallel processing to scientific and engineering problems. In 1994, finalists were names for work in two categories:

No entry was received in the third category of compiler-generated speedup, which measures how well compiler writers are doing at easing the programming of parallel processors.

Gordon Bell, an independent consultant in Los Altos, California, is sponsoring $2,000 in prizes each year for 10 years to promote practical parallel-processing research. He was so pleased to see the performance measured in tenths of a teraflop (trillions of floating-point operations per second) that he doubled the size of the two performance awards. This is the seventh year of the prize, which Computer administers. The winners were announced November 17 at the Supercomputing '94 meeting in Washington, D.C.

RESULTS

The performance winner modeled problems in structural mechanics; the price/performance winner simulated quantum mechanical interactions. Other interesting--and impressive--entries simulated isotropic turbulence, modeled vapor deposition, computed the internal mobility of an organic molecule, and ran nonfloating-point applications, such as gene sequencing and image processing.

Performance Winner. David Womble, David Greenberg, Stephen Wheat, and Robert Benner of Sandia National Laboratories, Marc Ingber of the University of New Mexico, and Greg Henry and Satya Gupta of Intel Scalable Systems Division modeled problems in structural mechanics using the boundary element method. They were awarded a $2,000 prize for performance for solving several difficult problems at more than 0.14 Tflops on a 1,904-node Intel Paragon.

Price/performance winner. Stefan Goedecker of the Cornell Theory Center and Luciano Colombo of the Università di Milano will divide $500 for their work demonstrating a price/performance of 3 Gflops per million dollars. They simulated the quantum mechanical interactions among 216 silicon atoms on a cluster of eight Hewlettt-Packard workstations at a price/performance that is nearly twice that of the previous best on a general -purpose system.

Honorable mention. An honorable mention award of $1,000 for performance was presented to a large group from the Numerical Wind Tunnel (NWT) project in Japan. They achieved nearly 0.12 Tflops on their 140-processor system simulating isotropic turbulence. They also demonstrated impressive sustained performance with other computational fluid dynamics codes using fewer processors. The award was shared by H. Miyoshi, Foundation for Promotion of Material Science and Technology of Japan; M. Fukuda, T. Iwamiya, T. Nakamura, M. Tuchiya, M. Yoshida, K. Yamamoto, Y. Yamamoto, S. Ogawa, Y. Matsuo, and T. Yamane, National Aerospace Laboratory; M. Takamura, M. Ikeda, S. Okada, Y. Sakamoto, T. Kitamura, and H. Hatama, Fujitsu Limited; and M. Kishimoto, Fujitsu Laboratories Limited.

Finalists. John Shadid, Scott Hutchinson, Harry Moffat, Bruce Hendrickson and Robert Leland of Sandia National Laboratories, along with Garry Hennigan of New Mexico State University, were also named finalists. They submitted a model of a chemically reacting flow running at over 65 Gflops on a 1,904-processor Intel Paragon. They modeled vapor deposition with a three-dimensional model of 19 chemical species in a reactor discretized into 176,000 finite element nodes. Most of the time was spent solving a sparse linear system with over 4 million unknowns.

Impressive entries. There were three other entries, and the decision not to put them on the list of finalists was a difficult one. Yu Hu and Lennart Johnsson of Harvard followed the motion of 60 million self-gravitating particles in only 2.5 microseconds per particle, per time step on a 512-node CM-5. Peter Arbenz and Hans Peter Luthi of the Swiss Federal Institute of Technology used two 16-processor Cray C90s in two cities linked over the Internet to compute the internal mobility of an organic molecule consisting of 63 atoms. Their measured speed of 16 Gflops is not as impressive as the speedup they achieved of 29.5 on 32 processors.

A completely different entry was submitted by Paul Keltcher of Penn State. He described the Micro Grain Array Processor, a custom-built machine capable of running such nonfloating-point applications as cellular automata, gene sequencing, and image processing at rates up to 0.8 Tops (trillions of operations) per second.

It is clear from this year's competition that the large machines are not yet dinosaurs nearing extinction. Although the price/performance category was won by a cluster of workstations, the large machines came within a factor of two by this measure. On the other hand, the large machines ran up to 175 times faster in gigaflops than the cluster. While a factor of two in price/performance is interesting, finishing a job in an hour instead of a week is certain to get some people's attention.

PERFORMANCE WINNER

An airplane made up of a novel, composite material crashes in Switzerland; the new speakers for your stereo sound great but are smaller than your dog; you get a speeding ticket even though you slowed down as soon as your radar detector went off. Problems like these wee solved by this year's winning entry in the performance category submitted by David Womble, David Greenberg, Stephen Wheat, and Robert Benner of Sandia National Laboratories, Marc Ingber of the University of New Mexico, and Greg Henry and Satya Gupta of Intel Scalable Systems Division.

Same numerical method. While the three problems submitted are unrelated, they were all solved using the same numerical method. A key component of these problems, and of a large number of other scientific and engineering applications, is the solution of a partial differential equation (PDE). We are used to seeing finite difference and finite element solutions to these PDEs. There is also a less familiar approach that has certain advantages, the boundary element method (BEM), which was used in this entry.

The basic idea of the BEM is that, for many problems, knowing the solution on an arbitrary surface allows us to compute the solution at any point in space. It is based on a simple idea; any changes in the interior of a domain can be felt outside only by their effect on the surface. For example, the energy produced inside the sun is exactly balanced by the energy flowing through its surface.

Let's say we want to understand how radar reflects off a complex surface such as an F-15 or a Porsche. We can use this knowledge to design a radar system that improves airport safety or makes policy radar more sensitive. The physics involves computing how the surface responds to the incoming radar beam. The desired solution is the intensity of the reflected radar back at the detector.

Finite element method. First, look at solving this problem with the more familiar FEM. We discretize the space surrounding the object, say an airplane, into small volumes and choose a set of basis functions, an approximation of the true solution over each finite element with coefficients to be determined. These basis functions are often a linear interpolation among the vertices of the element. The physical properties of the element tell us how it responds to outside influences. We use a special type of finite element far from the airplane to approximate the boundary condition that applies at infinity.

The solution of the equation constraints the values of the solution in each finite element. For example, we expect the value of the solution on the interface of two elements to be the same no matter which of the two elements is used for the computation. If we assign row and column numbers to each finite element, we construct a sparse matrix that will have nonzero elements only in those few columns corresponding to elements sharing a boundary with the one element represented by the row index. In this way, the matrix embodies the "continuity condition." Solving the linear system consisting of this matrix with a righthand side representing the boundary conditions gives the remaining unknowns. It is then a simple matter to use the model to predict the strength of the reflected signal.

One problem is that the finite element matrices are quite large because a volume is being discretized. Another problem is that solving linear systems with sparse matrices is more complicated than solving those with dense matrices. Of course, the sparse matrices require orders of magnitude less computation for problems of a given size. A method involving dense matrices is competitive only if the matrix is much smaller, as it is in the boundary element method.

Boundary element method. To understand how to solve a problem with the boundary element method, we need to look at the problem a little differently. Say we want to know the response of a single point to the incoming radar. In many cases there is a known formula that solves the problem--the fundamental solution. Next we look at the solution with two points, then three, and so on. Eventually, we will have covered the surface of the airplane with points.

The only thing we don't know after this boundary discretization is the relative strengths of all the point sources. It is possible to write down the equation for these strengths as a linear system with a dense matrix and the right-hand side derived from the boundary conditions. The matrix is dense because the solution due to any point reflector exists at every point is space, even at the other points on the surface. However, the size of the matrix is the number of points used to represent the surface, not the volume of space surrounding the airplane.

The method just described is not the BEM; it is Trefftz method. The difficulty with Trefftz method is that the computed solution does not always converge to the exact result as the number of points increases. A slight change to the formulation, though, makes it possible to guarantee convergence for many problems. Instead of approximating the surface as a collection of points, the BEM divides the surface into patches. The matrix coefficients become integrals of the derivative of the fundamental solution over the patch. In most cases these integrals must be evaluated numerically.

Limits to the method. A key part of the BEM is the need for a fundamental solution, one that is available only for a limited set of problems. Fortunately, these are problems of great interest, such as heat flow, electromagnetic response, acoustics, and elastic deformations. Even here, though, there are limits, since fundamental solutions are not often known unless the medium is homogeneous.

In spite of these limits, there are reasons to use the BEM. The size of the linear system is proportional to the surface area of the system being studied, not its volume. The BEM is also simpler when studying problems external to the domain, such as radar scattering off an airplane, since the fundamental solution embodies the boundary condition at infinity, which must be approximated with other methods.

The winning entry presents the solution to three problems. All problems were solved on an Intel Paragon, a distributed-memory parallel processor consisting of 1,904 nodes connected in a rectangular grid. Each node has two Intel I860 microprocessors, one for computing and one for communication. Since each compute processor can sustain up to 50 Mflops on matrix computations, the best performance is purported to be around 95 Gflops. As we'll see, it is possible to do substantially better.

Applications. The first application is the modeling of a plastic rod strengthened with short fibers. An accurate solution for such composite materials depends on modeling hundreds of individual fibers, each represented by a number of patches. The linear system resulting from a boundary element formulation has more than 56,000 unknowns and was solved in less than 20 minutes at a rate of over 0.1 Tflops on 1,888 nodes of an Intel Paragon.

The second application, which also ran at about 0.1 Tflops on 1,904 nodes, involves radar scattering from a fighter plane. Two methods were used to solve the problem, one for low frequencies and one for high.

The third application involves computing the sound waves reflected from a "localized collection of objects." One might infer that we are referring to a school of fish so we can distinguish the signal from that of a submarine. This problem with 38,000 unknowns was solved in 17 minutes at a rate of 0.143 Tflops, more than twice the rate of last year's winner. This result is all the more impressive when compared against the best performance expected, 95 Gflops.

Solutions. The high performance required several innovations and attention to the detailed behavior of the node processors. First of all is the use of SUNMOS (Sandia University of New Mexico Operating System), an operating system developed by Sandia National laboratories and the University of New Mexico, which contributes in several ways. SUNMOS makes it possible to use both I860 processors on a node for computation, whereas the OSF/1 version shipped with the Paragon uses one for computation and one for communication. A second improvement is the smaller size of SUNMOS, which leaves more memory for the application. Third, SUNMOS provides a more efficient communication layer, allowing, for example, broadcaster messages that don't require the sender to wait until the broadcast is complete.

Algorithmic improvements. Algorithmic improvements also helped. Vector-vector operations, which run at 15 Mflops/node, were replaced by special versions of matrix-matrix operations that use the cache of the I860 efficiently and run at 80 Mflops/node. When dealing with complex matrices, the library routines use the Winograd form of complex multiplication that has more additions and fewer multiplications than the standard form. This choice makes better use of the I860 floating-point processor, which performs additions faster than multiplications, without altering the total number of arithmetic operations. These changes make it possible to compute certain kernels at a rate of 115 Mflops/node.

Speeding up the solution of the linear system was not enough. It was also important to improve the computation of the matrix elements. Each matrix element is the result of a Gaussian quadrature over the surface element and requires between 100 and 30,000 floating-point operations, depending on how strongly the element feels the effects of other parts of the surface. Although they account for about one percent of the total number of floating-point operations, they would take so long if not optimized that the overall performance would be reduced dramatically. In the problems submitted, the optimization keeps them to between 3 and 30 percent of the solution time.

PRICE/PERFORMANCE WINNER

Most people don't realize it, but there are relatively simple facts we don't know about some common materials, such as carbon and silicon. For example, we cannot know from laboratory measurements how well carbon conducts heat at the temperatures and pressures found in the cores of hot stars, information important for our understanding of the formation of heavy elements. It is also very difficult to use experimental data to derive general laws describing how defects move in silicon crystals. Stefan Goedecker of Cornell University and Luciano Colombo of the Università de Milano have been using molecular dynamics simulations to study such systems.

Behavior simulation. In their winning entry, Goedecker and Colombo simulate the behavior of a large number of silicon atoms, 216 to be exact. That doesn't sound like so many, you say. Why didn't they use more atoms? After all, there have been papers reporting simulations of 1,000,000 atoms or 8,000,000 stars. The difficulty is computing the force between the atoms.

Say we want to compute the force between two silicon atoms. When they are isolated in space, they are electrically neutral because the 14 electrons symmetrically surround the nucleus and screen its positive charge. When another atom comes close enough, the electrons from one atom repel the electrons in the other. Since the electrons now have a higher probability of being found on the side of the nucleus opposite the other atom, the positive charge of the nucleus is no longer shielded. In fact, viewed from one atom, the other appears to have a positive charge; viewed from the side opposite the other atom, it appears to be negatively charged.

Simple approximation. Such two-particle interactions can be approximated using a force law that depends only on the separation of the two atoms. It is this simple approximation that makes it possible to follow millions of particles for thousands of time steps. While such an approximation is suitable for following the motion of the atoms, it tells us nothing about the distribution of electrons around the nuclei--just what we need to calculate the properties of interest. Also, there are situations in which the actual forces computed from these laws of classical mechanics are quite different from those computed with the more accurate quantum mechanical relations.

A detailed calculation of the force on the atoms that takes account of the electron distribution is quire complex to compute. In the most direct method, at each time step we would have to solve the Schrödinger equation for each electron in each atom, for each possible orbit, for each electron. Even limiting the calculation to the valence electrons (those responsible for chemical interactions) isn't sufficient. Since this approach is totally impractical for systems with more than a few electrons, researchers have made approximations.

The 1990 Bell Prize winner for price/performance made the "coherent potential approximation," which assumes each electron is moving through a uniform medium with a single defect that represents the average behavior of all the defects. The hard part is including the effect of the moving electron on the properties of the defect. The approach is based on density functional theory, which finds the forces between particles by minimizing a particular representation of the energy. This method must either solve a very large system of partial differential equations or minimize a complicated function at each time step.

Tight-binding approximation. All these ab initio methods, those that compute the forces directly from the laws of quantum mechanics, are hundreds of times too slow to solve the problems Goedecker and Colombo are interested in. They chose to use the tight-binding approximation instead. Tight binding is a semi-empirical method, one that depends on parameters computed from more exact solutions of simpler problems. Such methods fall somewhere between the ab initio methods and those that approximate forces as a simple function of the separation of two particles.

In its simplest form, the tight-binding approximation assumes that the electrons are held so close to their respective atoms that their distribution around the nucleus is affected only by their nearest neighbor atoms. The force between any two atoms is assumed to depend only on some power of the distance between them and the rates at which electrons move between different orbits. These parameters, which can come from experiments or ab initio calculations, are fed into the tight-binding model and replace the detailed calculation that represents the behavior of the individual electrons.

Two difficulties. Even with these simplifications, it would take a prohibitive amount of computer time to follow the motions of a few hundred atoms for long enough to produce interesting results. The difficulty is two-fold. In quantum mechanics, the force on a particle is computed from the Hamiltonian, a matrix that embodies conservation of energy. The tight-binding approximation uses the eigenvalues of the Hamiltonian matrix is some of its key parameters. For many-body systems, the number of rows in the Hamiltonian matrix is proportional to the number of particles. Unfortunately, the computer time needed to find all the eigenvalues of a matrix increases as the cube of the number of rows; doubling the number of particles increases the computer time needed by a factor of eight. If this problem isn't bad enough, the standard method for computing eigenvalues is not easily amenable to parallelization.

Two observations. Goedecker and Colombo make two key observations. First of all, they show that there is a matrix related to the Hamiltonian that is diagonal under a particular change of coordinates. In this coordinate system, a key component of the force is known to be the trace of the matrix, that is, the sum of its diagonal elements. It isn't even necessary to make this change of coordinates. The rules of linear algebra can be used to show that the trace of the matrix doesn't vary when the coordinates are changed in this way. Hence, there is no need to diagonalize the matrix; this contribution to the force is the sum of the diagonal elements of the full matrix.

Next, they show how to approximate this matrix with a polynomial that depends only on the largest and smallest eigenvalues of the Hamiltonian matrix and the temperature of the system. They save a lot of computation, since finding these two eigenvalues is much simpler than finding all of them. On the other hand, the polynomial is not a simple one; one model of liquid carbon used 50 terms in the approximation. The good news is that these changes to the algorithm save one order in the scaling; the computer time now increases only as the square of the number of particles.

The next step is to look at the entries in the new matrix. For many interesting systems, such as insulators or metallic liquids, these entries fall off exponentially away from the diagonal. Those terms that are so small that they can't affect the results don't need to be computed. The result is that there is no need to worry about the interaction of atoms sufficiently far apart. In the problem submitted, only the effects of atoms within 8 angstroms of each other were included. The computer time to follow a system is now proportional to the number of atoms.

The winning entry studied liquid silicon at a temperature of 2,000 degrees Kelvin to determine such thermodynamic properties as heat conductivity. The model consists of 216 atoms. The total force due to atoms more than 8 angstroms apart is so small that the computed energy per atom was nearly identical to that computed by more exact methods.

New/old methods compared. Goedecker and Colombo compared their new method running on a cluster of Hewlett-Packard workstations (HP9000/735) connected with a FDDI network to a standard algorithm running on a NEC SX3. The supercomputer running the old method at 1.2 Gflops takes about 5 seconds per time step; the new method on the workstation running at 780 Mflops also takes about 5 seconds per time step. Because the new method does not vectorize well, the SX3 runs the new code at only about 50 Mflops. Since the cluster costs under $250,000, it runs at over 3 Gflops per million dollars.

As is usually the case, achieving such high performance (half of the theoretical peak of the cluster) requires attention to detail. The key part of the algorithm is the multiplication of a large, sparse matrix times a vector. The matrix has 4 rows for each atom and nonzero entries for those atoms within the cut-off distance. Index arrays help avoid any unnecessary computation. Since accessing the main memory on workstations is often a bottleneck, the calculations were arranged to keep the data for a given column of the matrix in the cache memory of the machine.

One optimization required a rather deep understanding of the machine and compiler. The HP architecture has a single operation that does a multiplication and an independent addition. Try as he might, Goedecker could not get the compiler to generate this combined operation; it insisted on using separate addition and multiplication instructions. He finally resorted to fooling the compiler. He added some unnecessary lines in the source code. The compiler then generated the desired compound operations, and he edited the assembly statements corresponding to the added source code.

PERFORMANCE HONORABLE MENTION

Legend leads us to believe that the Wright brothers designed their planes by the seat of their pants and flew their experiments. If one crashed, they put some bandages on the pilot and tried something else another day. In fact, they weren't nearly so foolhardy; they tested their airfoils first in a primitive wind tunnel of their own design. Even so, they were never sure of the craft's flight characteristics until a pilot took it aloft.

This approach worked for a number of years, but before long planes flew high enough and fast enough that pilots couldn't walk away from designers' mistakes. Without better tools, it was difficult to try out novel ideas. By the 1930s, people were using scale models to test designs. Soon thereafter, there were wind tunnels where flight characteristics could be measured. Such things as lift, drag, and stability could be tested before going to the expense of building the aircraft and risking someone's life.

Although new designs can be tried in a wind tunnel, the expense is high. Building a model can cost tens of thousands of dollars and each measurement thousands of dollars more. Furthermore, it takes a long time to build a model, and designers can wait months to get time in one of the big wind tunnels. Even then, it is difficult for engineers to try enough ideas for the kind of studies they like to do. Additionally, engineers would like to know things about the flow that are difficult or impossible to measure. For example, they want to know the distribution of combustion products inside a jet engine.

Simple yet hard. Enter the Numerical Wind Tunnel. The idea is simple--model the flight characteristics in the computer to eliminate (or at least reduce) the number of wind-tunnel tests. The implementation is hard.

Computational fluid dynamics (CFD) is littered with dimensionless quantities. One of the most significant of these is the Reynolds number. Formally, the Reynolds number is a length scale, such as the wing span, divided by the viscosity of the fluid. In practice, a low Reynolds number means that the eddies surrounding the plane aren't much smaller than the plane. A high Reynolds number means that the eddies are very small and there are a large number of them. The more eddies there are surrounding the plane, the finer the mesh needed to model the flow. Real planes flying at cruising speeds have Reynolds numbers well above 10,000,000, requiring hundreds of millions of mesh points to follow the flow. Modeling this system would take hundreds to thousands of hours on even the fastest supercomputers. Modeling this system in a wind tunnel is even more difficult, since most can't reach Reynolds numbers this high.

Since the exact problem is intractable, approximations must be used. One trick is to note that the very small eddies occurring at high Reynolds numbers don't need to be followed in detail. It is possible to get good results by averaging a large number of them. These averages give good estimates for such things as the viscosity of the turbulent gas. Simulations that make this approximation are called Reynolds-Averaged Navier-Stokes (RANS) models. (Navier-Stokes is the name given to the equation that models viscous flow).

Even with this approximation, the compute requirements are staggering. Typical models require tens of hours even when running at over 1 Gflops. An accurate model of a complete aircraft would need 1,000 hours of CPU time. The Numerical Wind Tunnel (NWT) is a parallel computer designed to make it possible to run some of these models in 10 minutes or less and the largest problems in less than a day. Such rapid turnaround makes parameter studies feasible.

Powerful processors. The design team did not take the same approach used for the Intel Paragon or Thinking Machines' CM-5. Instead of using thousands of processors with modest performance, the NWT consists of 140 powerful vector processors connected by a full crossbar switch. Each processor is a supercomputer in its own right, capable of computing at 1.7 Gflops and having 256 Mbytes of memory. Communication between any pair of processors proceeds at over 800 Mbytes/second. The machine is programmed in Fortran 77 with directives to describe the data and process distributions.

The Bell Prize judges are not bashful about soliciting entries to the competition. When one of us contacted the people at the Numerical Wind Tunnel project in Japan, they responded that they were not interested in gigaflops; they had real problems to solve. The response was music to our ears. It took some doing, but we finally convinced them that we really are interested in the application of parallel processing to real problems.

A large number of applications have been studies on the NWT, including Reynolds-Averaged Navier-Stokes (RANS), simulations of turbulence to determine the parameters to feed into RANS, and chemically reacting flow inside engines, among others. The engineers pick the number of processors based on total compute time, problem size, and natural degrees of parallelization. Often, only a few processors are needed, so the megaflops rate is modest. However, a few very large simulations have been run. One model using a three-dimensional Navier-Stokes solver requiring 60 million grid points ran at 0.12 Tflops. A smaller problem with 1 million grid points would have taken 20 hours on a single processor but finished in 20 minutes on 64 processors of the NWT.

The Numerical Wind Tunnel has also been used to do direct numerical simulation of isotropic turbulence for Reynolds numbers up to 2,000. The model used a cube with 512 points on a side, one of the largest ever applied to this problem. The goal of the simulation is to provide parameters for the Reynolds-averaged models. A key test is finding how much energy is tied up in eddies of different sizes. The simulations show the turbulence evolving to the asymptotic behavior predicted by theory. The simulations also show that the most turbulent regions appear to be connected into tubes. One theory predicted that they would be connected into sheets. Even running at better than 50 Gflops it took more than 24 hours to model a single configuration, and a large number of runs are needed to provide the parameters for a RANS computation.

ONE OTHER FINALIST

The entry honored as this year's fourth finalist was submitted by John Shadid, Scott Hutchinson, Harry Moffat, Bruce Hendrickson, and Robert Leland of Sandia National Laboratories and Garry Hennigan of New Mexico State University. They modeled the complex flow inside a chemical vapor deposition container at more than 65 Gflops on a 1,904-node Intel Paragon.

Chemical vapor deposition (CVD) is used in a number of manufacturing processes. The most notable is building the layers of computer chips, but CVD is also used to strengthen metals and coat lenses. The key advantage of CVD over other approaches is uniformity; the thickness can be controlled to an accuracy of a single molecular layer. Achieving this accuracy is difficult, particularly for those high-temperature applications that involve chemical reactions. Details of the recirculating flow and reaction chemistry determine whether the deposition will meet the required accuracy.

The problem submitted modeled the deposition of SiC at 700 degrees Kelvin. It required following 19 chemical species, over 40 chemical reactions, and the heat flux and fluid flow for a reactor discretized into 176,000 finite elements. Each step of the calculation required the solution of a sparse, linear system of equations with four million unknowns. Realistic three-dimensional models run on the Paragon finished in less that 10 minutes, fast enough to enable engineers to explore a large number of possible configurations.

This group integrated a number of software projects at Sandia (chemical kinetics, sparse matrix solvers, finite element mesh partitioning, SUNMOS). It takes a large, complex, and nimble research team like this one to pull together this combination of large systems, complex modeling, and fast execution.

THE JUDGES

Alan Karp, who chaired the judging committee, is a senior member of the technical staff at Hewlett-Packard Laboratories in Palo Alto, California.

Michael Heath is a professor in the Department of Computer Science and a research scientiest at the National Center for supercomputing Applications at the University of Illinois at Urbana-Champaign, Illinois.

Don Heller is a senior scientist with the Scalable Computing Laboratory, Ames Laboratory, U.S. Department of Energy, at Iowa State University.

Horst Simon is a senior staff member with the Supercomputing Systems Division of Silicon Graphics in Mountain View, California.