In the 1980s' LINPACK benchmark report [4],
there is the annotation ``(rolled BLAS)''
after many of the numbers in the table.
What does this cryptic comment mean?
Well, the BLAS are Basic Linear Algebra Subroutines [5] like
dot product and vector scaling
for which the LINPACK benchmark supplies Fortran subroutines.
The loop operation
is given two pages of Fortran with the loop unrolled 2-fold,
4-fold, 8-fold, and 16-fold. A portion of this code is
given in Figure 1.
Unrolling is done to avoid the overhead of loop management.
In more recent machines and compilers, there is a higher
performance penalty
associated with unrolled code than there is with rolled code.
So people ``roll'' the unrolled loop back up.
Then LINPACK goes faster.
Figure 1: Unrolled LINPACK Code
What went wrong in these events? The application programmer was guessing both compiler and hardware behavior. What the programmer wanted to accomplish was to multiply a row of a matrix by a constant and then to add the result to another row. Conventional Fortran forces the programmer to write a ``DO'' loop to perform this sequentially, even though the order of the i subscripts does not matter. There is plenty of parallelism in the operation and opportunity for vector arithmetic also. Compilers for vector and parallel computers have to do some extra work figuring out if the serial ordering the programmer specified is safe to ignore. All the unrolling does is make that analysis more difficult.