Ames Laboratory, Department of Energy, ISU, Ames, Iowa

next up previous
Next: Limitations of Use Up: Future Applications Previous: Buneman-type methods using

Massively parallel compiler optimization

Current commercially-available MPP systems either compile source programs on a single node of the ensemble or use a front-end computer that has a traditional serial architecture. Small-scale parallelism for the phases of compilation can be accomplished via pipelining, but that approach does not scale to thousands of processors.

The Search method described here suggests a strategy for a way of using large numbers of processors during compilation. A small number of processors can do conventional compilation using pipeline parallelism, but basic blocks that look amenable to optimization can be farmed out to the rest of the ensemble and run through the Search process. When conventional compilation is completed, the processors that were given sections to optimize can be polled for the best optimization found so far, and the optimizations can be collected as a final pass. The user might supply a time constraint on compilation, so the search for reduced operation counts can be limited by the user's patience instead of the completion of conventional compilation. Here, 'operation' includes any instruction in the target computer's repertoire, not just arithmetic operations. Also, overlap of instructions and the number of clock cycles for each instruction would have to be used in the cost metric instead of the simplistic operation counts used in Search. It might make more sense to budget clock cycles than operations or instructions for MPP compiler optimization. This is fertile ground for future research.

As a test of the massively parallel compiler optimization concept, we did the following experiment: Given the C expressions

f = a * a * a a * a * b a * b * b b * b * b and
g = a * a * a 3 * a * a * b 2 * a * b * b b * b * b ,

we used the Search program to optimize the calculation of f and g. Then, we tried various C compilers with optimizations enabled.

Unoptimized compilation gave a method to compute f using eight multiplies and three adds. The Search program found that f could be computed with three multiplies and two adds in 0.18 seconds on a 16-processor nCUBE. The Shortcut found by the Search program is (a + b) (a2 + b2). The SUN (Sun Release 4.1) and DEC (Ultrix) compilers were only able to reduce the operation count to seven multiplies and three adds. The Gnu compiler managed six multiplies and three adds.

A naïve computation of g requires 10 multiplications and 3 adds, as found by unoptimized compilation. The SUN and DEC compilers could not find any improvements. The Gnu compiler found a method with nine multiplies and four adds. The optimum method ((a + b)3 - ab2) taking 4 multiplications and 2 adds is found by the Search program in 0.36 seconds on a 64-processor nCUBE.

To test our approach on expressions that appear in real production codes, we have chosen the following subroutine by Sisira Weeratunga of NASA Ames Research Center from the application benchmarks given in the NAS parallel benchmarks [4].

subroutine setiv

do k = 2, nz-1

0zeta = (dfloat(k-1))/(nz-1)

do j = 2, ny-1

eta = (dfloat(j-1))/(ny-1)

do i = 2, nx-1

xi = (dfloat(i-1))/(nx-1)

do m = 1, 5

u(m,i,j,k) = pxi + peta + pzeta pxi peta peta pzeta

pzeta pxi pxi peta pzeta

end do

end do

end do

end do

return

end

We tried to optimize the core of the computation given by the statement inside the loops, which, after a convenient renaming of the variables, is

a + b + c - ab - bc - ca + abc

The table below shows the number of operations used by various compilers in computing this expression.

Table 2
Number of Operations Used by Various Compilers
and the Search Program in Computing a + b + c - ab - bc - ca + abc
CompilerUnoptimizedOptimized
SUN (V 4.1)5 muls, 6 adds4 muls, 6 adds
Ultrix (V 4.2a)5 muls, 6 adds4 muls, 6 adds
Gnu (V 1.36)5 muls, 6 adds4 muls, 6 adds
Search2 muls, 4 adds

An optimal way of computing this expression, as found by the Search program in 24.6 seconds on 32 processors is

c (ab - a - b) - ((ab - a - b) - c)

It appears that even simple algebraic expressions can be improved by a factor of two over current compiler technology using the Search approach.


next up previous
Next: Limitations of Use Up: Future Applications Previous: Buneman-type methods using