Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
This example introduces
Control Registers -- CR0, CR4, EFLAGS
Model-Specific Registers -- TSC, PMC control, PMC counters
pmc_query, pmc_test, pmc_test.c
pmc_enable, pmc_disable, pmc_able.c
These are implementation details and various opinions that can be skipped
by most readers.
Previous example -- Pentium Pro/II/III events
Next example -- exclusive access
Return to Main Menu
The Intel processors, having developed over many years with the constraint
of backward-compatibility, have a variety of register lengths, from 8 to
128 bits. The 32-bit registers (EAX, and so on) are covered by the
type pmc_uint32_t, for an unsigned 32-bit integer.
The 64-bit registers, which are really pairs of 32-bit registers, are covered
by the type pmc_uint64_t, for an unsigned 64-bit integer.
These types should not be used outside the library code. In most
cases, there is a more meaningful type, such as pmc_cycles_t,
that should be used instead.
The control registers CR0, CR1, CR2, CR3, CR4 and EFLAGS are
described in the Intel Architecture publications. A few of the bits
in these registers affect the library code, and can be modified.
Do not attempt these modifications if you are uncertain of the outcome,
and then especially not if you cannot reboot the system yourself; some
of the modifications allowed are not friendly.
The Time-Stamp Counter, and the Performance-Monitoring Counters
and their associated control registers, are Model-Specific Registers.
That is, they are first and foremost not strictly part of the Intel Architecture,
and their implementations and semantics have changed from one Pentium model
to the next. All the MSR's can be read or written with the rdmsr
and wrmsr instructions, from kernel mode (CPL 0). The TSC can be
read with rdtsc, and the PMC's with rdpmc, from user mode if the permission
bits are set properly in the control registers, and otherwise only from
The program pmc_test uses calls to /dev/pmc to read the
various control registers and counters. It is not intended as a model
for other programs; /dev/pmc should not be used directly. Sample
output from pmc_test:
rdtsc is architectural, while rdpmc is not. This mainly reflects
Intel's commitment to support the rdtsc instruction in future processors,
and lack of stated commitment to the Performance-Monitoring Counters.
However, Intel is committed to the
product, which makes use of the counters; it is presently available only
on the Microsoft operating systems.
The TSC is set to 0 at system reset. It presently increments once
per processor cycle, and is 64 bits wide. Only CPL = 0 can modify
TSC, and then only the lower 32 bits can be set, while the upper 32 bits
are cleared. We do not recommend changing TSC, and the library does
not permit it. The TSC will continue to increment in the halt state
The rdtsc instruction is available with all the Pentium processors.
It is not serializing. On the first Pentium (60/66 MHz), rdtsc is
6 clocks at CPL = 0, 11 clocks at CPL = 1,2,3; this does not include the
time to move the data from registers to memory.
The rdpmc instruction is available only with the Pentium/MMX, Pentium Pro
and Pentium II or III processors. It is not serializing.
If, for some reason, it is important to serialize rdtsc and rdpmc, this
can be done with the cpuid instruction. Compile the library and user code with
The Performance-Monitoring Counters are 40 bits wide, but are accessed
as 64 bits with rdpmc or rdmsr. The upper 24 bits are not specified
by Intel, but on the Pentium they are 0, while on the Pentium Pro/II/III
they are the upper 24 bits of the Time Stamp Counter, and must be cleared
It is tempting to think that the time-stamp and performance counters should
be saved as part of the process state by the operating system. On
the Pentium, all 40 PMC bits can be written correctly from memory (and
then zero-extended to 64 bits). On the Pentium Pro, only the lower 32 bits
can be written, and then they are sign-extended to 40 bits (and then TSC-extended
to 64 bits). The reason for doing the sign-extension is to make a
countdown test, as an interrupt can be generated when the upward-moving
40-bit PMC overflows. The library ignores these overflows in its
The AMD Athlon uses the control fields of the Pentium Pro family, but it
has four 48-bit counters that are zero-extended like the Pentium, and a
different set of events.
pmc_query is a simpler version of pmc_test
that is useful when configuring the installation.
pmc_enable and pmc_disable allow you to toggle the permission bit
for the rdpmc instruction.
pmc_query should show that rdpmc is enabled,
and rabbit will check this.
What do we want in a hardware performance counter?
It helps to view the measurement system as an experimental probe, with
the ultimate goal of integrating measurements into the program design.
In any experiment, noise is bad, unless you are actually trying to measure
the noise. The goal for the typical application programmer is to
acquire data that is not influenced by other processes or by OS activity.
The simplest way to do this may be to get rid of the other processes, and
fend off the OS. Of course, all user programs cause OS activity on
their behalf, and it would be a mistake not to inquire about that activity.
The safest way to get clean data would be for the OS to make the measurements
part of the process state, but this drives up the cost of a context switch
for all programs. It would then be necessary to prove that the increased
cost is small, always, that the measurements are correct, always, and that
all cycles expended are counted somewhere between the kernel and all the
At minimum, a pure 64-bit cycle counter should be provided in user mode.
The Intel Time-Stamp Counter has all the right characteristics, though
it is implemented on a processor without 64-bit integer registers.
For a processor like the DEC Alpha, which is thoroughly 64-bit, to have
only a 32-bit cycle counter that is architecturally polluted by the operating
system, makes the job of acquiring clean data far too difficult outside
the OS. [Watch this space for more slams on the other microprocessors.]
Low-level functionality - efficient operation
High-level functionality - efficient programming
fast access, low overhead
always on, or at least always available without much effort
single-instruction atomic read access from user mode
no need to switch to OS kernel mode (usually)
a back-channel to move performance data out of the system without perturbing
the system under study
optional serialization with neighboring instructions in a parallel pipelined
speculative execution architecture
measurements that correspond to compiler optimization techniques, allowing
feedback to the programmer
Part of the process state
programming language-independent (dream on!)
no manual intervention required?
Note the contradictory goals - the data to be collected is not machine-independent,
though the control and access methods should be. The user wants the
best available information that the current system can provide, but the
code must be portable. The information should be specific to the goals
of the program, but the information-gathering should be as automatic as
Not part of the process state
thread state also?
what goes into the process state?
table of accumulated time and counters
selection vector for counters, used at start of time slice
counter values taken at start of time slice, if counters are not reset
counter accumulators, control selectors, saved and restored on context
switch, including process migration in a multiprocessor
is this antithetic to efficiency?
prevent other entities from changing the selection and control of the counters
support for initiating, selecting, gathering, presenting the data
feedback to the program? what if the programmer doesn't care?
Available during collection
feedback to system management, about all system activities
not part of the OS functionality at all?
Pure cycle counter
sampling - frequency of measurement under system control, or under program
direct intervention - points of measurement selected by the programmer
software alligator clip
Memory hierarchy effectiveness
at least 64 bits, strictly monotonic increasing (for a long time)
no OS pollution, no arbitrary reset (these would violate monotonicity outside
for any other measurement, can derive per cycle data
useful in many circumstances beyond simple time measurement, such as transaction
litany of bad examples
are there any bad consequences to using this counter for two distinct purposes?
Instruction stream effectiveness
cache behavior at each level
occupancy rates (by cache line)
miss rates, eviction rates
fraction of cache line actually used before write-back or other disposal
time to acquire first part of cache line; miss penalty
separate counters for global (static) / stack / heap data as used by each
This requires the counters to know the virtual address (the process address
space) as well as the physical address (the system address space). To a
large extent this presupposes the compiler and operating system.
Network stream effectiveness
branch prediction success rate
speculative execution miss rate
instructions (or micro-instructions) per cycle
resource stalls, and which resource
operation counts, not just execution unit counts
Miscellaneous low-level considerations
two actual counters selected from a larger collection of possibilities
more than two counters in future systems?
is this a lossy system? are there cycles that are never accumulated
anywhere? what fraction of time is now devoted to measurements?
inlined data collection for efficiency, but shared runtime library for
portability. which is better?
an external monitor like rabbit assumes that no other program will change
the selection of the counters.
Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
Author: Don Heller, firstname.lastname@example.org
Last revised: 2 August 2000