Rabbit
A Performance Counters Library
for Intel/AMD Processors and Linux
Don Heller
Associate with the Scalable Computing Laboratory
Ames Laboratory, U.S. D.O.E., Iowa State University
Tour by Examples | Download
the Library
Thesis: All programs should be designed to measure their
own performance.
Precondition: Performance measurement software should be
completely portable.
Preclusions: Performance measurement hardware and operating
system interfaces are completely non-portable.
The role of this library is to read and manipulate Intel or AMD processor
hardware event counters in C under the Linux operating system. The
user interfaces have been made general and machine-independent. There is
necessarily a conflict between the desire for future implementations on
other systems, allowing portable self-measured code in several languages,
and the desire to squeeze as much information as soon as possible from
the particular processor at hand.
If we view performance measurement as a scientific experiment, then
use of the library in the application source code is like building an experimental
apparatus, or attaching sensors to an existing apparatus. A few tests
may confirm that the program behaves according to the programmer's predictions.
But why not leave the measuring capability in the program? There
may be something useful that can be learned from more experience with the
program in the field, and perhaps the program can adjust its own behavior
in response to the measurements.
The simplest summary of the library design is this:
-
Decisions about "when to measure" are placed directly in the program.
-
Decisions about "what to measure" are delayed until runtime.
-
Measurements are taken as cleanly as possible.
-
As many options are granted as the processor and operating system allow.
-
The data structures and interfaces are machine-independent as far as possible.
-
Measurement data is available to the program.
There are only a few principal functions and data types in the library:
-
Exclusive access to the hardware counters is negotiated by pmc_open()
and pmc_close().
-
Command-line or input-file options are taken by pmc_getargs()
and stored in a pmc_control_t structure.
-
Events to be counted are given to the hardware by pmc_select()
from a pmc_counter_t structure.
-
pmc_read() produces a pmc_data_t result, marking a moment
in time.
-
pmc_accumulate() produces the difference of two pmc_data_t
values, and adds the difference into a pmc_counter_t.
The Intel Pentium-series processors include a 64-bit cycle counter, and
two 40-bit event counters, with a list of events and additional semantics
that depend on the particular processor. The AMD Athlon processor
has a 64-bit cycle counter, and four 48-bit event counters, with a different
set of events and semantics similar to the Pentium Pro family. The library
abstracts these details to a type system and compile-time constants allowing
(we hope) further implementations.
Arbitrary programs that do not use the hardware counters directly or
indirectly can be sampled with rabbit, which runs a child process
alongside its own interval timer. For example, 'rabbit sleep
60' is a simple system monitor. rabbit multiplexes
through a list of events, and can manage an output directory with data
files and gnuplot scripts. Summary reports are easily generated.
There are, of course, some flaws and limitations in the library design
and implementation, centered on the process state and the goal of clean
measurements.
Hardware performance counters are defined outside the "architectural"
register set, and they are not saved and restored on process context switches,
either by the hardware or by [unpatched] Linux. The measurements
are therefore attached to the processor, and not to a process or thread.
It is not possible to separate the actions of a daemon, or another user,
from the program under test, but it is possible to separate user code from
system code according to the privilege level. Thus, as always, testing
for program development should be done with as few other processes in the
system as possible. On a dual-processor system, a process context
switch could move the process or thread to the other processor, leaving
the selected performance counters behind. The current implementation
makes no serious attempt to deal with dual processors. The event
counters, being rather short, are prone to overflow at high MHz; it is
no consolation to observe that other processors use even fewer bits.
On some processors, an interrupt can be generated when a counter overflows,
but we do not observe this. The user must ensure that counters are
read frequently enough, and ask if the results seem reasonable. To
negotiate exclusive access to the counters, and to run some privileged
instructions, the /dev/pmc device must be installed. If every access
to the hardware counters goes through the PMC library to /dev/pmc, we can
guarantee clean measurements, but this constraint is not enforceable.
The library's principal data types to be understood are
| pmc_control_t |
complete description of a measurement experiment |
| pmc_event_set_t |
event codes for concurrent measurement |
| pmc_data_t |
raw cycle and event counter readings |
| pmc_counter_t |
elapsed time and accumulated event counts |
pmc_cycle_t
pmc_event_t |
components of pmc_data_t |
pmc_selector_t
pmc_intervals_t
pmc_cycles_t
pmc_events_t |
components of pmc_counter_t |
The library's principal functions are
| pmc_getargs() |
read command-line options |
pmc_open()
pmc_close() |
acquire and release the hardware counters |
| pmc_start() |
mark the start of the experiment |
| pmc_select() |
select the events for the hardware counters |
| pmc_read() |
read the counters |
| pmc_counter_init() |
initialize an accumulator |
| pmc_accumulate() |
accumulate the counters from a time interval |
| pmc_print_results() |
report the results |
For more details, take the Tour by Examples.
At various points along the tour there are philosophical and technical
discussions related to the library design and implementation.
Some related sites and projects, with commentary:
-
Los Alamos National Laboratory, M. Patrick Goda and Michael S. Warren,
pperf.
The software formerly known as perfmon. Modeled on GNU /bin/time,
using Stephan Meyer's /dev/msr.
mperfmon and /dev/msr was the starting point for the present library, but
everything has been rewritten.
-
Parallel Tools Consortium, Performance
Data Standard and API, PerfAPI.
A standardization effort from some partial implementations. The web
site includes an email
archive and extensive
pointers to other sites. Highly recommended if you are willing
to patch Linux.
-
Karen L. Karavanic and Barton P. Miller, Experiment
Management Support for Performance Tuning, SuperComputing'97.
The big picture. Right on!
-
Compaq / Digital Equipment, Alpha processor, Digital Unix or Windows NT.
-
Hewlett-Packard, PA-8000 processor.
-
IBM, Power/PowerPC processors, various systems.
-
E. H. Welbon et al., The
POWER2 Performance Monitor. IBM Journal of Research and Development,
38(5), September 1994. See also U.S. Patent 5,557,548, Method and
system for performance monitoring within a data processing system.
-
Maurice T. Franklin et al., POWER2
Server Performance Analysis, AIXpert, Feb. 1995.
-
Jussi Mäki, POWER2
Hardware Performance Monitor Tools.
-
F. E. Levine and C. P. Roth,
A
programmer's view of performance monitoring in the PowerPC microprocessor,
IBM Journal of Research and Development, vol. 41, no. 3, 1997.
-
Intel, Pentium-series processors.
-
Windows 95, 98 or NT (2000),
VTune.
A comprehensive performance analysis environment, much like prof and pixie
with a good interface, but also allowing use of the performance counters.
[no personal experience]
-
Windows 95,
PCT
- Performance Counter Tool. An older product, superceded by VTune.
[no personal experience]
-
Bruce Greer and Greg Henry, High
Performance Software on Intel Pentium Pro Processors, or Micro-Ops to TeraFLOPS,
SuperComputing'97. Intel's ASCI Red system, with a good introduction
to the Pentium Pro and associated coding techniques. No mention of
the performance counters.
-
Michael S. Warren et al., Pentium
Pro Inside: I. A Treecode at 430 Gigaflops on ASCI Red, II. Price/Performance
of $50/Mflop on Loki and Hyglac, SuperComputing'97. Loki is the
system for which pperf was written.
-
David Mentré, Using
hardware counters with Linux. Includes pointers to other sites.
-
Harald Hoyer, Linux
kernel patches. Per-task data through /proc/<pid>/msr.
Processor-specific library interface. Intel Pentium implementation
only.
-
Silicon Graphics, MIPS R10000 processor, IRIX.
-
Marco Zagha et al., Performance
Analysis Using the MIPS R10000 Performance Counters, SuperComputing'96.
Design and use of the counters; chip layout, OS support, etc.. See
also the SGI IRIX man pages for r10k_counters, perfex, libperfex.
-
Cristina Hristea et al., Measuring
Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro
Benchmarks, SuperComputing'97. A study of the Origin 2000, which
is built from the R10000, and the Sun Ultra Enterprise 10000. Timing
is done with gettimeofday() and millions of repetitions. Go figure.
-
Sun Microsystems, UltraSPARC processors, Solaris.
Performance-Monitoring Counters Library, for Intel/AMD
Processors and Linux
Author: Don Heller, dheller@cse.psu.edu