Workshop on Debugging and Performance Tuning for Parallel Computing Systems, Chatham, MA, Oct. 3-5, 1994. Appeared in Debugging and Performance Tuning for Parallel Computing Systems, M.L. Simmons et al., eds., IEEE Computer Society Press, 1996, pp. 233-243.
Production jobs occupy all of a system's memory, I/O capacity and computing cycles for an extended period of time. They are expected to run efficiently and reliably, and will make similar demands on the system and network. When reliability is suspect, as in the first generations of parallel computing, programmers and system managers can take measures to improve their ability to resolve problems. When efficiency is suspect, the same strategies can be adapted for performance monitoring. However, the techniques appropriate for development systems, such as interactive debugging, are often not available in production mode. Advance planning and a few simple but effective tools are required to capture useful information during execution and at termination.
Our main premise is simple:
Production mode limits one's ability to analyze behavior and determine system state.
Consequently, one must plan ahead, and design programs and systems to yield useful information before termination. There are several levels of software support that are needed:
Throughout, we want detailed logs and succinct summaries. In general, the programs and operating system must cooperate to define useful information to be provided in a compact and easily accessible format; this information must be available at all times during execution and must be valid at a point of failure. An agent of the program and OS must accompany the program to help obtain the data; the agent must not fail even if the program does. It would be best if the agent can easily be integrated into other tools. Performance displays that relate to the application itself are enormously useful in the transition from ``early adopter'' programming staff to ``satisfied customer'' users and management. They help find behavioral or methodological bugs as well as coding or performance bugs.
We should start with some disclaimers and announce our prejudices. These are opinions based on personal experience, occasionally on reason and insight. Other opinions exist, and might even have merit, :-) and this is certainly not an exhaustive treatment. We have some definite prejudices:
Much of this is due to our ``early adopter'' mindset, left over from the 1980's, when hardware, system software and application software could be expected to fail with equal (and too high) probability, and such tools as existed were written by the users. For some additional discussion see Ten Steps for Managing Parallel Computing Projects.
Debugging could be as easy as: think, insert print statements, filter the output. It probably isn't, but always try the simple tools first. Performance analysis of a parallel program could be as easy as: run multiple instances of the usual workstation tools, look for anomalies. It probably isn't, but always try the simple tools first. In any event, system and application programs intended for production use should be augmented with tools that can be run by a system operator, whom we view as a knowledgeable person but not a programmer.
The "early adopter'' mindset has some problems and consequences. First is the view that local memory is inadequate. This motivates a small microkernel, distributed OS services, distributed application codes, and the strong belief that the SPMD model may not be adequate. Consequently, debugging must involve several cooperating application programs and i/o drivers, written in at least two different languages, and not all compiled with the right options.
Secondly, the I/O subsystem is inadequate, and even if it wasn't, the system resides as one element of a large network. The network is the source of data, of course, but there are at least two major obstacles to confront: the network causes half the problems, and the network is under someone else's control. Debugging network software written by someone else isn't easy, and in any event shouldn't be done by application programmers. Lesser problems involve job management and accounting; in principle, not hard to implement, but still under someone else's control.
On a message-passing system, everyone complains that the communication is slow, but ultimately it is too slow with regard to program development costs, and not with regard to run-time. Since careful attention must be paid to the ratios of communication to computation and overhead to bandwidth, the machine characteristics intrude on the application very early in the development cycle. Performance debugging often reduces to changing some fundamental assumptions after the program is "finished.'' So, isolating communication into better libraries and using tools such as program generators or HPF to avoid writing communication code altogether is a big help. Single-processor analyses, say from g/prof or multiple instances of a debugger session, tell you very little about the nature of cooperating programs.
There is one additional point to cover: locally modified OS source code. Assuming you're on good terms with the system vendor, you could have access to some very seductive possibilities. On the plus side, you can fix bugs sooner, write new support tools, experiment with new features, and tune the system to fit local conditions. On the other side, you will void the warranty, create new bugs, and always be one release behind if your applications depend on the new features. Read-only source code is very useful, writable source code is dangerous.
At some point, there is a transition from "early adopter'' to "satisfied customer.'' System verification is an essential tool, to demonstrate stability and to analyze instability, quickly. The stability tests are not simply a matter of fulfilling a purchase contract, but are necessary to win over the users and their managers. Performance displays must relate to the application, not just to the program, in the transition from programmers to users.
Now we should consider development mode vs. production mode. There are a number of distinguishing points, but the one that matters to this discussion is programmer vs. operator. When there's a problem at 3 AM, who gets the call? Who is on the spot, and what skill levels are there? How much time is available before the errant job must be removed from the system? If you know in advance that the job will be killed quickly by a non-expert if something's wrong, then you must plan ahead or the bug will never be found. Interactive debuggers can only be run on development systems by developers, and therefore have limited benefit on a production system.
There are some simple but effective defensive measures that should be installed in all application programs subject to an operator's whim. If something looks wrong, an inquiry program will be run. We might call these oblivious inquiries if they tell you nothing about the internal details of the program. For example, an oblivious system inquiry about component utilization would give CPU and communication link ratings, as does Intel's SPV (SPV = system performance visualization, a fancy front panel). An oblivious process inquiry about component utilization would give a breakdown of system and user time, as does the Unix ps command. If several ratings are near 0, the operator could easily be persuaded that the program is broken. In a multi-system configuration, finding the culprit system is already a challenge. Even if everything is functioning normally, the operator should run the inquiry on a regular basis, for reassurance and to develop a notion of normal behavior; this will make it more likely that abnormal behavior will be correctly diagnosed. This is where a non-oblivious process inquiry may be more useful. It is easily determined which application program is running. The programmer should provide to the operator a specific inquiry program for the application. In normal cases, this would give some application-based performance measures or progress reports, and is a good sales technique as well as diagnostic tool.
The usual performance and debugging tools are either count-driven (such as prof, gprof), or trace-driven (such as paragraph). An application-derived agent could be as simple as a debugger script with post-processing. Log files of some type are also useful. Since most count-driven tools depend on normal completion of the program, they have little value as inquiry tools during execution.
Non-performance analysis is related to debugging, but doesn't go quite as far. It's simply a matter of gathering enough information from the system and program to support another run under debugger control. The first level is a system test, to check the hardware; vendor-supplied system tests usually are at such a low and detailed level that they require a system restart. This clearly is not appropriate when there's a half-dead application running. After all, maybe the system is working correctly. The second level is an oblivious system inquiry, including a component inventory, differential node dumps, and quiescence detection. The third level is a set of oblivious process inquiries, including information about memory allocation and requests, differential process dumps, and a record of the last messages sent, received and requested. The final level is a non-oblivious process inquiry, which may run the debugger (scripted or interactive), examine trace-driven data, or run an application-derived agent.
Differential dumps are a simple but useful tool. Consider as analogy two executions of the Unix command od (octal dump) on separate corefiles from the same program, followed by diff (text comparison). Anything reported is interesting, the more so when tied to the source code by a more sophisticated method. Some program variables would be expected to be constant across the processors, some might vary according to the processor number, and others may be drawn at random from a known set. Identifying inconsistencies, or failure to obey calculable relations, such as numerical formulae or set unions, will narrow the set of nodes to be examined in more detail.
Local examination of the nodes should concentrate on resource insanity (linked lists that aren't, application memory leaks, system memory leaks) and resource starvation (buffer space, local memory bus bandwidth, interprocessor channel bandwidth).
A global examination of the nodes should look for cross-node inconsistencies, and message-passing algorithm faults (clogged pipeline, deadly embrace, misplaced messages).
On a large system, it is likely that only a few nodes need to be examined in detail, and the first problem is to quickly and accurately identify a small set of nodes containing the culprits. The next most likely case is that all nodes are involved in the problem, but the situation is symmetric in some way, and it is still sufficient to look at a few nodes.
Here are some examples.
Now, some of these conditions and measurements are transient, and it is more or less a judgement call for the operator to decide how often to repeat the inquiry, and whether to kill the program or restart a server.
Elaboration. If the system includes some diagnostic hardware, such as a power-on self-test or a secondary diagnostic network and console, it would be useful to maintain state information in a user-accessible location via non-root library functions or inquiry commands.
Concerning resource starvation, if a local resource such as buffer space needs to increase with the number of nodes, then it will be difficult to diagnose large production-mode problems on a small development-mode system. Probably a series of instrumented runs on increasingly larger systems will suggest where the problem occurs.
Concerning clogged pipelines, message-passing algorithms can fail if an intermediate node acts on only part of its messages. Either the recipient node (where unconsumed messages reside) should be sped up, or the source node should be throttled. A tool that lists all messages currently in the system is a great help in identifying particular nodes for detailed study, and tracking stray messages.
Concerning deadly embraces, if A < B, and B < A, and C waits as a consequence, we want to know that C is not part of the problem. A and B will show up as an equivalence set.
Launching messages into the void is a nasty problem, if the OS deletes them. The library send routine probably returned an appropriate error code that was ignored. Nevertheless, it would help if the OS or communication library kept a count of such disasters, for access by an inquiry program.
Here we consider ways for the application programmer to make it easier to diagnose problems. Most of these require a conscious decision at the time of program development, that the program will not function properly. This is just common sense. Some information must be gathered during execution and made available for extraction. The information could be maintained in a log file, or could be dumped to a file on termination; depending on the programming language, a signal handler may be required. Log files have the disadvantage that they are usually not up-to-date on abnormal termination. The system-defined process status is usually inadequate, as we require counters for activities and errors, timers for activities, and a record of recent activity. The call stack from a debugger (current routine and args, its caller, etc.) is helpful but again incomplete.
Every program calls a system-defined exit routine, explicitly or not, which cleans up the memory and actually exits. It would be better to first list the messages found in the communication buffer, including unsatisfied message requests, and give some system-maintained performance measurements. An additional feature would allow the programmer to determine some interesting memory locations, say from application performance measurements, a history or logging mechanism, or from interactive debugger experience. These could be enrolled in a standardized data structure, perhaps with separate rosters for ab/normal exit, via a configuration file or calls to library routines. The program could use this roster to produce a time-stamped sequence of process states. An external agent could obtain the configuration file, then the data from enrolled locations, and produce a current status report. On exit from the application, the program could list data from the appropriate roster, and call the system exit routine.
Not all applications would require such a collection of services but we shall assert that All I/O subsystems should be equipped with a library routine and an external agent for device controller / driver / server status, including current status, most-recent activity report, historical log files, and cumulative traffic and error statistics, such as counts, historical averages and ranges, and rates, ESPECIALLY ERROR AND RETRY RATES. Complications can easily occur with multinode I/O subsystems, where one must correlate log files from different programs, and with coupled multivendor subsystems.
Consider a program that originated on an ordinary system and has since been ported to two parallel machines:
serial code programmer A parallel code 1 programmer B parallel code 2 programmer B or C
Which is the "official'' version? Where are new features installed first? How far apart are the versions? Is the distance increasing? Usually, A is busy with new features while B and C are improving performance, so the distance is increasing. This is a real problem, that can be addressed via language features as in HPF or via careful source code management and a preprocessor. However, when it comes time to debug the program, relating problems back to unprocessed source code will be difficult.
There are some problems specific to explicit message-passing codes. When there is a library function interface to the message-passing hardware, one is likely to lose the ability to enforce type checking on both ends of the message. In any event, matching send and receive source code at compile time is generally impossible. Most errors in message-passing systems could have been caught at compile time, though we have no firm evidence for this claim.
A simple technique is to establish message type management. The message type is an integer tag sent as part of the message header to help identify the message contents. Most vendors define ranges of types for user and system messages. One especially terrible idea is for the vendor library code to use message types in the user range. The MPI context field, adapted from Zipcode, solves this problem and others. Another bad idea is to use one message type for multiple purposes. Indeed, we prefer to have each message type pointing to a few lines of code, with certain message types being captured by the OS. The reasoning is that the debugger or exit routine will yield message type information, and if the types indicate more than one usage then not enough information about the program has been revealed.
Message header information in general helps to resolve errors observed away from the source of the error, such as message transmission problems, observed by the recipient, or even worse, by some node neither the source nor recipient. We might even want to time-stamp messages on send and receive, and not have anomalies or misrepresentations of performance, but this is difficult on systems where the local clocks are not coordinated.
We now describe a program nscan for nCUBE systems, developed by the author while at Shell Development Co. The motivation was primarily to check the system state between production jobs, and secondly as a diagnostic tool during job execution. No information about the application program is used. The nCUBE/2 nodes are organized as a hypercube with additional I/O subsystems, themselves 16-node hypercubes. nscan can select all or part of the system, and will report at several levels of detail on the operating system, processes, and memory. The OS microkernel is in assembler, and once one is over that considerable hurdle, not too hard to understand (OK, maybe I am nuts). We use the same OS feature as required by a debugger, a special message type captured by the OS and responding with data from a given address or register. There is no impact on the application processes beyond the OS handling of these queries. If there is no response, the OS is judged to have failed, and an operator reboot is suggested. The output of nscan is text that is easily searched for summaries and warnings.
The following list indicates the information available from nscan.
hardware and OS version numbers physical node number, local memory size last allocation: logical node number, start time current state, processor special registers timers: cpu active, idle message counts: sent, received, transmission errors most recent message received on each channel: size, src, dest, type error counts: invalid interrupts, hard and soft memory errors "interesting" local variables process identifier, program name, parent process current state, working directory timers: cpu active, waiting for message message counts: sent, received messages in queue: size, source, type, type interpretation free space in communication buffer alarm, sleep wakeup times memory regions open file table, controlling I/O node much more low-level data that is seldom useful local memory byte counts: total available, occupied, free allocated and free blocks, queue pointers sanity checks: expected values and ranges consistency (relations among local OS variables) linked lists and queues
A hard memory error (uncorrected ECC) would motivate replacement of the node, as would soft errors (corrected ECC) above a certain threshold. One good feature of the OS, not present initially, is that the ECC counters are preserved across a system reboot that does not cycle electrical power.
One important failing is that the OS does not maintain the source and type of the process' last requested message; this is available only on the runtime stack of a blocked process, and the offset depends on the particular library routine that was invoked. We had previously modified the OS on the nCUBE/1 to account for this and other missing information, but chose not to do so on the nCUBE/2.
Too often, performance measurements are bolted onto a program only after it has been completed, and debugging information is obtained only after the occurrence of a problem. Our primary recommendation is that some judicious choice of state and historical information be maintained and accessible as part of the program design from the beginning. For better or worse (it's getting better), one can expect a parallel program to encounter bugs or performance problems, and it would be prudent to plan ahead. Even without problems, the program and system complexity warrants continuous monitoring.
Thanks to the workshop organizers for their invitation to speak on this topic, and to my former colleagues at Shell Development Co. and Shell Oil Co., where we installed nCUBE systems for seismic data processing, and experimented with many of the ideas described here.
Contact: Don Heller - dheller@cse.psu.edu Computer Science and Engineering Pennsylvania State University 220 Pond Laboratory University Park, PA 16802 (814) 863-1469