Ames Laboratory,
Department of Energy,
ISU,
Ames, Iowa

Besides the speed gained by using the pruning rules, the search may be done in parallel to achieve further speed. The parallel approach is simply that of master-slave load allocation. In the serial version, a depth-first search of the tree is done until a path containing goal expressions is found. In the parallel program, several processors can be used with each processor exploring a subtree of the entire tree. The master-slave approach is inherently suited to Multiple Instruction Multiple Data (MIMD) computers. The primary obstacle to Single Instruction Multiple Data (SIMD) computers is the large number of branching tests from the pruning rules resulting in disparate control flow [3].
For the MIMD approach, one processor is used as the master processor delegating work (subtrees) to the other processors. The master processor does a depth-first search of the entire tree. However, upon reaching a specified number of levels, it delegates the underlying subtree to an idle processor instead of exploring it. The master processor then backtracks and continues the search. Each of the slave processors performs a depth-first search on the subtrees assigned to it and reports the solution to the master, if found. To avoid waiting for work, each slave processor is given its next subtree while it is computing the current one. With this, all the slave processors are busy most of the time. On 256 nodes, efficiencies of 99.8% are typical.
A key parameter in the parallel program is the number of levels the master explores before distributing the subtrees. If this is too small, there might not be enough subtrees to allocate to all the processors. Also, there is greater danger of load imbalance with fewer subtrees. If this is too large, the master processor needs to do most of the work and the slaves remain idle. The choice depends upon the size of the problem and the number of processors used. An appropriate value can easily be found by experimentation. For problems like complex product or matrix product on a system of 16 to 1024 nodes, exploring 3 levels on the master processor seems to be a good choice.
Notice that a superlinear speedup is possible, if we stop as soon as a solution is found. Since the tree is searched in parallel by many processors, there is a good possibility that some processor might get `lucky' and find the solution [19]. We have found that this is indeed the case, most of the time.
