Ames Laboratory, Department of Energy, Ames, Iowa

next Next: Field Work Proposal 1997. previous Previous: Purpose

High Performance Computing Systems

KC-07-01-02 FWP

20 f. Technical Progress

The major activities of the Scalable Computing Laboratory within the AMS program consist of performance analysis research and the development of algorithms and tools for scientific computing. These two areas have considerable overlap, since application programs make excellent benchmarks and the effort to measure performance often results in the discovery of ways to increase performance.


Technical Progress In FY 1994

Performance Analysis in Language Specifications

Gustafson and Wikstrom prototyped a new approach to program development in which the application programmer is assigned the burden of stating and proving the performance of the program. In response to the end-user performance requirements (answer quality, reliability, total execution time, input data quality), the application programmer describes methods that will guarantee those needs are met based on guarantees from the hardware description and the operating system/compiler. The approach appears to be new and unique; its closest relative is real-time programming, but it extends real-time programming concepts and applies them to scientific computing tasks.

We have designed a novel environment that specifies what must happen in a program to an extraordinary level of performance detail, but does not specify the irrelevant. The data declarations specify the performance of each variable (latency, bandwidth, location) so precisely that rigorous proofs of execution time in clock cycles are possible from the source listing. By specifying local/nonlocal memory references and minimal data dependency, the burden of NP-hard automatic parallelization is lifted from the compiler designer and replaced with the more tractable burden to the application programmer of specifying more about data motion. A number of responsibilities that have become blurred in recent years are clarified by the rules of the environment, freeing the various parties to do what each does best. Specifically, the end user is free to concentrate on the field of science for which the computer is a tool, instead of being pressured to understand details of the inner workings of the hardware and software.

The new paradigm is extremely difficult to use in the sense that the one-time cost of a single solution is high; as a reward, however, the lifetime cost promises to be extremely low. Constructing a mathematically perfect "program" just to solve the quadratic equation consumed several weeks of effort and is only 80% complete. Once complete, the program appears capable of surviving several decades of architecture and performance change and other major changes to the programming environment. The work was done during the summer of 1994 and is expected to lay the groundwork for the "Immortal Code" project that will occupy much of our time and effort over the next several years.

HINT Benchmark

Gustafson and Snell completed their design and public introduction of the HINT benchmark. Many high-speed computers have been measured with the tool, including the Cray T3D, IBM SP-2, Silicon Graphics Power Onyx, nCUBE 2S, Intel Paragon, and MasPar MP-2. The single-number ratings obtainable from HINT curves correlate well with the ratings of performance on the NAS Parallel Benchmarks (computational fluid dynamics application programs), yet take about two orders of magnitude less effort to apply. While we continue to use complete performance curves (QUIPS versus time, where QUIPS is Quality Improvement Per Second) to assist our understanding of system performance, we have also studied the effectiveness of logarithmic integrals of the HINT graphs as scalar summaries of absolute computer capability. These we refer to as "Net QUIPS" values.

The 1840-processor Paragon at Sandia was scrutinized on-site using HINT, to try to understand the difference between SUNMOS and OSF performance. The HINT curve made clear the advantage of SUNMOS; lower latency brought the performance up faster for short times, and because memory use was more economical, the speed was sustainable for longer times.

A minor refinement to HINT allows subroutine call overhead to be measured automatically as a by-product of any HINT run. The driver was refined so that the noise of system interrupts and multiple users is almost completely eliminated except as they affect real-world performance.

Scalability of HINT was studied using the nCUBE 2S and the Intel Paragon, both of which offer a large range of system configurations. The HINT graph for repeated doublings of the number of processors showed equispaced asymptotic performance on a log-log graph, as one would hope. The rise time to that asymptote increased with the number of processors, showing the diameter of the parallel ensemble degrading performance with increased ensemble size. The net result for the scalar Net QUIPS was something slightly less than linear speedup, which in our experience matches application performance for applications scaled according to fixed-time principles.

Benchmarks measure actual speed, well-known to be different from peak or theoretical speed. However, the discrepancy between nominal memory and theoretical memory is not well-understood, and is not the target of any other extant benchmarks. Operating systems often make large sections of memory unavailable to end users for a variety of reasons, causing a difference between nominal and actual storage capacity that is at least as striking as the difference between nominal and actual processing speed. We accomplished a major objective in 1994 by solving this problem with HINT. The HINT graphs now reveal the memory regimes by size and speed. The difference between economical and wasteful operating systems visible both in the graphs and in the Net QUIPS summaries. We have also been able to use a slightly modified version of HINT to concentrate on I/O speed; however, HINT tests only a particular kind of I/O (virtual storage with an unpredictable access pattern) that may not be representative of mainstream I/O-intensive problems. HINT studies performed in 1994 appear to qualitatively match the well-known resistance of large-scale scientific simulation problems to effective paging schemes for virtual memory.

Fixed Time Analysis of CFD Performance

Susan Ying and Anthony Baker have done the first study of computational fluid dynamics codes on parallel computers based on the fixed-time performance analysis principles first developed by Gustafson in 1988. The study tested inviscid models for both subsonic and supersonic cases, with variable numbers of processors of the nCUBE 2S at the SCL. As the number of processors was increased, the problem was scaled to approximate fixed execution time, and the performance was then compared with the time that would have been taken on a single processor. The result was a fair assessment of the scalability of the fluid dynamics model from 1 to 256 processors that corresponds closely to the way different-speed computers are used in practice. Part of the reason for the success of the parallel version was that the I/O was made parallel, not just the computation of timesteps.

Logistic Curves for Composite Performance Comparisons

A model was developed by Gustafson as a result of discussions with William Pardee (consultant) and the Performance Evaluation Workshop held with Japanese performance analysis specialists in 1991. The utility of various parameters such as answer quality, reliability, cost, execution time, cost of program conversion, time for an edit-compile-debug cycle, and display of results are considered as logistic curves ranging from 0 to 1, with rise time and position determined according to the application and the opinion of the user community. By stating the underlying assumption of what constitutes a "good" run explicitly and quantitatively, the difficulty of comparing different computers running different algorithms to solve a problem can be removed. The product of the utility functions also forms a logistic curve, one that seems to reflect the people compose utility from different factors. For example, an answer valid to only one decimal might be unacceptable no matter what the speed, so the product of the near-zero utility of the answer quality and the near-unit utility of the execution time would still be near zero.

Information-Theoretic Operation Complexity Measures

We have made some progress in our quest for a firm theoretical basis for the cost of assembly-level operations in a computer. Gustafson, Aluru and Mulupuru have a working program that finds provably minimal numbers of gate operations to accomplish simple assembly-level tasks. It runs in parallel on the nCUBE 2S. Even the substantial integer computing capability of the nCUBE does not allow us to extend the results to operands of more than a few bits, however, and the approach may have to trade rigor for a reasonable guess to make execution time tractable.

The HINT results have altered our thinking about the value of measures of operation complexity; we have seen too many instances where operations per second correlate inversely with answer quality per second. For example, using a dense matrix method on a sparse matrix improves MFLOPS but increases execution time. It therefore appears that a success in quantifying operation complexity may be neither necessary nor desirable for the understanding and prediction of computer performance.

Performance Analysis of an nCUBE System

Heller has developed a suite of programs that inspects various parts of the nCube/2 and reports their state, including the results of numerous consistency checks of operating system data structures. Some useful information is provided that is not available from any other tools, such as messages in queue and memory allocations. All this is done without reference to application programs, so the tools are best used by the system administrator, or as a simple check on the progress of an application.These tools are ready for distribution to other sites pending agreement with nCube.

Another C or C++ library has been developed that is a set of annotations to an application program to gather and present information about the history of the program. It features logging, 'mini core dumps' at specified intervals, function call tracing, and carefully managed timers, counters and event flags. The package is intended to be simple to use, above the level of ordinary print statements and trace/assert macros, but far below the level of an interactive debugger. It is suitable for use in the development of programs and in production settings where immediate access to a debugger is not always feasible. It will be ready for distribution during the second quarter of 1995 pending further testing at SCL.


Technical Progress Expected In FY 1995

HINT

The
HINT project will proceed in three directions: First, the database will be expanded and automated. Bower will streamline the process by which a HINT run becomes a World Wide Web page containing both the HINT graph and all relevant information concerning the computer system that was tested. It appears possible to gather complete information for several hundred computers per year, independent of the architecture. A particular effort will be made to measure the full range of products from Cray Research, which now sells at least four very different high-end architectures. Second, we will gather data about real application performance and look for correlation with Net QUIPS values to test the hypothesis that HINT is a reliable predictor of relative machine ranking across a wide spectrum of tasks. Third, HINT will be used as a computer engineering tool; for a proposed computer design, HINT simulations can answer questions about whether a design choice will result in higher or lower price-performance. A group of computer engineers in Russia have proposed to construct a parallel system of the newest model Transputers based on HINT, and we expect to collaborate with them.

Commercialization of SCL Techniques

Progress in performance analysis depends heavily on the technology transfer process. We will promote the commercialization of the methods (patented or pending patent) for performance measurement developed by the SCL. While the SCL will continue to gather supercomputer data as a routine background activity that may yield insight into architectural trends in the marketplace, but the less research-oriented collection of consumer computer information will be left to private companies that may be able to sell the information in the form of annual reports.

"Immortal Code"

It is still true that the largest obstacle to widespread acceptance of parallel programming is the lack of a programming model guaranteed to last through many years of hardware advances. While some groups are pursuing C++ and HPF as evolutionary approaches, these environments are very far from solving the parallel porting problem over a long time scale. The new programming model with application programs containing performance requirements will be completed and extended in 1995 by Wikstrom and Gustafson. With a completed example of an "immortal code," we will be in a position to seek funding from other agencies for a major expansion of this work. In 1995 we will concentrate on reducing the immense effort needed to produce provably correct programs that include performance requirements as part of the correctness. There are two things to accomplish with this effort: the model of how to create very-long-shelf-life software, and examples of that model. The examples have considerable intrinsic value, but fall in the Algorithm Development category described elsewhere.

Performance Analysis of an nCUBE System

The succinct analysis of voluminous program state and performance data is an ongoing problem. For example, how to collect performance data per node, reduce it to meaningful summaries, and provide meaningful pointers to odd behavior. Guidance from the programmer can be integrated into application and performance displays or reports.


Technical Progress Expected In FY 1996

The "immortal code" programming model, if successful, will make possible the formation at last of a lasting repository of computer programs. Past attempts to produce libraries of algorithms have failed because they did not effectively deal with the dynamics of computer languages, changing architectures, and other variables. Any effort to make computer programming into a science rests on the assumption that results can be collected, widely understood, and valid for many years as results are in other scientific fields. This is obviously a large and long-term project, but we should be in a good position to begin it in FY '96.

If personnel resources permit, we will experiment with the logistic curve approach described above to see if practitioners of computational physics and chemistry can use it to compare the success of different computational approaches without becoming experts in performance analysis. We suspect that it is an easily learned technique that could become widespread and even standard as part of computational science.

Many of the most important features of computers are difficult to self-test; for example, a computer cannot readily answer how long program conversions take, how often it breaks, or how responsive the operating system is to keystroke input. These things do a great deal to determine the value and generality of computers, yet benchmarks focus only on execution time and correctness of results. One important step out of this rut was provided by Dongarra with a set of loops designed to test the cleverness of compiler optimizers. There are similar dimensions of performance that can be made automatically testable with sufficient effort. We will attempt to create a program that self-tests a computer's capacity in a wider variety of dimensions, such as integer precision, disk latency, error handling, amount of memory, graphic output, etc. as contrasted with the usual tests of floating-point speed measured for scientific benchmarking. The capacity of a computer in these dimensions provides a profile that can be thought of as an "aptitude test." Standard statistical methods can then be brought to bear on subsets of these performance scales with measured performance on business computing, scientific simulation, word processing, real-time control, or any other general category now in use. It should be possible to obtain quantitative information that will resolve such ill-defined terms as "general purpose computer" or "AI computer" by drawing correlations between computers so labeled and statistically significant distinctions in their performance scales.


next Next: Field Work Proposal 1997 previous Previous: Purpose


Contact: Steve Elbert +1-515-294-1307 elbert@AmesLab.gov

The URL for this document is http://www.scl.ameslab.gov
Revised

Page prepared by Maria E. Blanco

[Return] Return to Publications page.