Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
This example introduces
hardware registers
pmc_uint32_t
pmc_uint64_t
Control Registers -- CR0, CR4, EFLAGS
Model-Specific Registers -- TSC, PMC control, PMC counters
Model-specific semantics
pmc_query, pmc_test, pmc_test.c
pmc_enable, pmc_disable, pmc_able.c
These are implementation details and various opinions that can be skipped
by most readers.
Previous example -- Pentium Pro/II/III events
Next example -- exclusive access
Return to Main Menu
The Intel processors, having developed over many years with the constraint
of backward-compatibility, have a variety of register lengths, from 8 to
128 bits. The 32-bit registers (EAX, and so on) are covered by the
type pmc_uint32_t, for an unsigned 32-bit integer.
The 64-bit registers, which are really pairs of 32-bit registers, are covered
by the type pmc_uint64_t, for an unsigned 64-bit integer.
These types should not be used outside the library code. In most
cases, there is a more meaningful type, such as pmc_cycles_t,
that should be used instead.
The control registers CR0, CR1, CR2, CR3, CR4 and EFLAGS are
described in the Intel Architecture publications. A few of the bits
in these registers affect the library code, and can be modified.
Do not attempt these modifications if you are uncertain of the outcome,
and then especially not if you cannot reboot the system yourself; some
of the modifications allowed are not friendly.
The Time-Stamp Counter, and the Performance-Monitoring Counters
and their associated control registers, are Model-Specific Registers.
That is, they are first and foremost not strictly part of the Intel Architecture,
and their implementations and semantics have changed from one Pentium model
to the next. All the MSR's can be read or written with the rdmsr
and wrmsr instructions, from kernel mode (CPL 0). The TSC can be
read with rdtsc, and the PMC's with rdpmc, from user mode if the permission
bits are set properly in the control registers, and otherwise only from
kernel mode.
-
rdtsc is architectural, while rdpmc is not. This mainly reflects
Intel's commitment to support the rdtsc instruction in future processors,
and lack of stated commitment to the Performance-Monitoring Counters.
However, Intel is committed to the
VTUNE
product, which makes use of the counters; it is presently available only
on the Microsoft operating systems.
-
The TSC is set to 0 at system reset. It presently increments once
per processor cycle, and is 64 bits wide. Only CPL = 0 can modify
TSC, and then only the lower 32 bits can be set, while the upper 32 bits
are cleared. We do not recommend changing TSC, and the library does
not permit it. The TSC will continue to increment in the halt state
(HLT instruction).
-
The rdtsc instruction is available with all the Pentium processors.
It is not serializing. On the first Pentium (60/66 MHz), rdtsc is
6 clocks at CPL = 0, 11 clocks at CPL = 1,2,3; this does not include the
time to move the data from registers to memory.
-
The rdpmc instruction is available only with the Pentium/MMX, Pentium Pro
and Pentium II or III processors. It is not serializing.
-
If, for some reason, it is important to serialize rdtsc and rdpmc, this
can be done with the cpuid instruction. Compile the library and user code with
-DPMC_READ_SERIAL.
-
The Performance-Monitoring Counters are 40 bits wide, but are accessed
as 64 bits with rdpmc or rdmsr. The upper 24 bits are not specified
by Intel, but on the Pentium they are 0, while on the Pentium Pro/II/III
they are the upper 24 bits of the Time Stamp Counter, and must be cleared
before use.
-
It is tempting to think that the time-stamp and performance counters should
be saved as part of the process state by the operating system. On
the Pentium, all 40 PMC bits can be written correctly from memory (and
then zero-extended to 64 bits). On the Pentium Pro, only the lower 32 bits
can be written, and then they are sign-extended to 40 bits (and then TSC-extended
to 64 bits). The reason for doing the sign-extension is to make a
countdown test, as an interrupt can be generated when the upward-moving
40-bit PMC overflows. The library ignores these overflows in its
present implementation.
-
The AMD Athlon uses the control fields of the Pentium Pro family, but it
has four 48-bit counters that are zero-extended like the Pentium, and a
different set of events.
The program pmc_test uses calls to /dev/pmc to read the
various control registers and counters. It is not intended as a model
for other programs; /dev/pmc should not be used directly. Sample
output from pmc_test:
pmc_query is a simpler version of pmc_test
that is useful when configuring the installation.
pmc_enable and pmc_disable allow you to toggle the permission bit
for the rdpmc instruction.
pmc_query should show that rdpmc is enabled,
and rabbit will check this.
What do we want in a hardware performance counter?
It helps to view the measurement system as an experimental probe, with
the ultimate goal of integrating measurements into the program design.
In any experiment, noise is bad, unless you are actually trying to measure
the noise. The goal for the typical application programmer is to
acquire data that is not influenced by other processes or by OS activity.
The simplest way to do this may be to get rid of the other processes, and
fend off the OS. Of course, all user programs cause OS activity on
their behalf, and it would be a mistake not to inquire about that activity.
The safest way to get clean data would be for the OS to make the measurements
part of the process state, but this drives up the cost of a context switch
for all programs. It would then be necessary to prove that the increased
cost is small, always, that the measurements are correct, always, and that
all cycles expended are counted somewhere between the kernel and all the
processes.
At minimum, a pure 64-bit cycle counter should be provided in user mode.
The Intel Time-Stamp Counter has all the right characteristics, though
it is implemented on a processor without 64-bit integer registers.
For a processor like the DEC Alpha, which is thoroughly 64-bit, to have
only a 32-bit cycle counter that is architecturally polluted by the operating
system, makes the job of acquiring clean data far too difficult outside
the OS. [Watch this space for more slams on the other microprocessors.]
-
Low-level functionality - efficient operation
-
fast access, low overhead
-
always on, or at least always available without much effort
-
single-instruction atomic read access from user mode
-
no need to switch to OS kernel mode (usually)
-
a back-channel to move performance data out of the system without perturbing
the system under study
-
optional serialization with neighboring instructions in a parallel pipelined
speculative execution architecture
-
measurements that correspond to compiler optimization techniques, allowing
feedback to the programmer
-
High-level functionality - efficient programming
-
machine-independent interfaces
-
processor implementation-independent
-
processor architecture-independent
-
programming language-independent (dream on!)
-
object oriented?
-
no manual intervention required?
-
Note the contradictory goals - the data to be collected is not machine-independent,
though the control and access methods should be. The user wants the
best available information that the current system can provide, but the
code must be portable. The information should be specific to the goals
of the program, but the information-gathering should be as automatic as
possible.
-
Part of the process state
-
thread state also?
-
what goes into the process state?
-
table of accumulated time and counters
-
selection vector for counters, used at start of time slice
-
counter values taken at start of time slice, if counters are not reset
-
counter accumulators, control selectors, saved and restored on context
switch, including process migration in a multiprocessor
-
is this antithetic to efficiency?
-
prevent other entities from changing the selection and control of the counters
-
support for initiating, selecting, gathering, presenting the data
-
feedback to the program? what if the programmer doesn't care?
-
Not part of the process state
-
feedback to system management, about all system activities
-
not part of the OS functionality at all?
-
daemon? pseudo-device?
-
Available during collection
-
movies
-
performance steering
-
Scalable granularity
-
sampling - frequency of measurement under system control, or under program
control
-
direct intervention - points of measurement selected by the programmer
-
software alligator clip
-
Pure cycle counter
-
at least 64 bits, strictly monotonic increasing (for a long time)
-
no OS pollution, no arbitrary reset (these would violate monotonicity outside
kernel mode)
-
for any other measurement, can derive per cycle data
-
useful in many circumstances beyond simple time measurement, such as transaction
time stamps
-
litany of bad examples
-
are there any bad consequences to using this counter for two distinct purposes?
-
Memory hierarchy effectiveness
-
cache behavior at each level
-
occupancy rates (by cache line)
-
miss rates, eviction rates
-
fraction of cache line actually used before write-back or other disposal
-
time to acquire first part of cache line; miss penalty
-
separate counters for global (static) / stack / heap data as used by each
process
-
This requires the counters to know the virtual address (the process address
space) as well as the physical address (the system address space). To a
large extent this presupposes the compiler and operating system.
-
Instruction stream effectiveness
-
branch prediction success rate
-
speculative execution miss rate
-
instructions (or micro-instructions) per cycle
-
resource stalls, and which resource
-
operation counts, not just execution unit counts
-
Network stream effectiveness
-
message-passing statistics
-
Miscellaneous low-level considerations
-
two actual counters selected from a larger collection of possibilities
-
more than two counters in future systems?
-
is this a lossy system? are there cycles that are never accumulated
anywhere? what fraction of time is now devoted to measurements?
-
inlined data collection for efficiency, but shared runtime library for
portability. which is better?
-
an external monitor like rabbit assumes that no other program will change
the selection of the counters.
Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
Author: Don Heller, dheller@scl.ameslab.gov
Last revised: 2 August 2000