Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
This example introduces
   hardware registers
   Control Registers        -- CR0, CR4, EFLAGS
   Model-Specific Registers -- TSC, PMC control, PMC counters
   Model-specific semantics
   pmc_query, pmc_test, pmc_test.c
   pmc_enable, pmc_disable, pmc_able.c

These are implementation details and various opinions that can be skipped
by most readers.

Previous example       -- Pentium Pro/II/III events
Next example           -- exclusive access
Return to Main Menu

The Intel processors, having developed over many years with the constraint of backward-compatibility, have a variety of register lengths, from 8 to 128 bits. The 32-bit registers (EAX, and so on) are covered by the type pmc_uint32_t, for an unsigned 32-bit integer. The 64-bit registers, which are really pairs of 32-bit registers, are covered by the type pmc_uint64_t, for an unsigned 64-bit integer. These types should not be used outside the library code. In most cases, there is a more meaningful type, such as pmc_cycles_t, that should be used instead.

The control registers CR0, CR1, CR2, CR3, CR4 and EFLAGS are described in the Intel Architecture publications. A few of the bits in these registers affect the library code, and can be modified. Do not attempt these modifications if you are uncertain of the outcome, and then especially not if you cannot reboot the system yourself; some of the modifications allowed are not friendly.

The Time-Stamp Counter, and the Performance-Monitoring Counters and their associated control registers, are Model-Specific Registers. That is, they are first and foremost not strictly part of the Intel Architecture, and their implementations and semantics have changed from one Pentium model to the next. All the MSR's can be read or written with the rdmsr and wrmsr instructions, from kernel mode (CPL 0). The TSC can be read with rdtsc, and the PMC's with rdpmc, from user mode if the permission bits are set properly in the control registers, and otherwise only from kernel mode.

  1. rdtsc is architectural, while rdpmc is not. This mainly reflects Intel's commitment to support the rdtsc instruction in future processors, and lack of stated commitment to the Performance-Monitoring Counters. However, Intel is committed to the VTUNE product, which makes use of the counters; it is presently available only on the Microsoft operating systems.
  2. The TSC is set to 0 at system reset. It presently increments once per processor cycle, and is 64 bits wide. Only CPL = 0 can modify TSC, and then only the lower 32 bits can be set, while the upper 32 bits are cleared. We do not recommend changing TSC, and the library does not permit it. The TSC will continue to increment in the halt state (HLT instruction).
  3. The rdtsc instruction is available with all the Pentium processors. It is not serializing. On the first Pentium (60/66 MHz), rdtsc is 6 clocks at CPL = 0, 11 clocks at CPL = 1,2,3; this does not include the time to move the data from registers to memory.
  4. The rdpmc instruction is available only with the Pentium/MMX, Pentium Pro and Pentium II or III processors. It is not serializing.
  5. If, for some reason, it is important to serialize rdtsc and rdpmc, this can be done with the cpuid instruction. Compile the library and user code with -DPMC_READ_SERIAL.
  6. The Performance-Monitoring Counters are 40 bits wide, but are accessed as 64 bits with rdpmc or rdmsr. The upper 24 bits are not specified by Intel, but on the Pentium they are 0, while on the Pentium Pro/II/III they are the upper 24 bits of the Time Stamp Counter, and must be cleared before use.
  7. It is tempting to think that the time-stamp and performance counters should be saved as part of the process state by the operating system. On the Pentium, all 40 PMC bits can be written correctly from memory (and then zero-extended to 64 bits). On the Pentium Pro, only the lower 32 bits can be written, and then they are sign-extended to 40 bits (and then TSC-extended to 64 bits). The reason for doing the sign-extension is to make a countdown test, as an interrupt can be generated when the upward-moving 40-bit PMC overflows. The library ignores these overflows in its present implementation.
  8. The AMD Athlon uses the control fields of the Pentium Pro family, but it has four 48-bit counters that are zero-extended like the Pentium, and a different set of events.
The program pmc_test uses calls to /dev/pmc to read the various control registers and counters. It is not intended as a model for other programs; /dev/pmc should not be used directly. Sample output from pmc_test:

Pentium Pentium/MMX Pentium Pro Pentium II Pentium III AMD Athlon

pmc_query is a simpler version of pmc_test that is useful when configuring the installation. pmc_enable and pmc_disable allow you to toggle the permission bit for the rdpmc instruction. pmc_query should show that rdpmc is enabled, and rabbit will check this.

What do we want in a hardware performance counter?

It helps to view the measurement system as an experimental probe, with the ultimate goal of integrating measurements into the program design. In any experiment, noise is bad, unless you are actually trying to measure the noise. The goal for the typical application programmer is to acquire data that is not influenced by other processes or by OS activity. The simplest way to do this may be to get rid of the other processes, and fend off the OS. Of course, all user programs cause OS activity on their behalf, and it would be a mistake not to inquire about that activity. The safest way to get clean data would be for the OS to make the measurements part of the process state, but this drives up the cost of a context switch for all programs. It would then be necessary to prove that the increased cost is small, always, that the measurements are correct, always, and that all cycles expended are counted somewhere between the kernel and all the processes.

At minimum, a pure 64-bit cycle counter should be provided in user mode. The Intel Time-Stamp Counter has all the right characteristics, though it is implemented on a processor without 64-bit integer registers. For a processor like the DEC Alpha, which is thoroughly 64-bit, to have only a 32-bit cycle counter that is architecturally polluted by the operating system, makes the job of acquiring clean data far too difficult outside the OS. [Watch this space for more slams on the other microprocessors.]

  1. Low-level functionality - efficient operation
  2. High-level functionality - efficient programming
  3. Part of the process state
  4. Not part of the process state
  5. Available during collection
  6. Scalable granularity
  7. Pure cycle counter
  8. Memory hierarchy effectiveness
  9. Instruction stream effectiveness
  10. Network stream effectiveness
  11. Miscellaneous low-level considerations

Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
Author: Don Heller,
Last revised: 2 August 2000