Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
This example introduces
   pmc_start()           -- starting point for elapsed time
   pmc_select()          -- set the counter control registers
   pmc_read()            -- read the counters
   pmc_accumulate()      -- subtract and add the counters
   pmc_reset()           -- reset the counters and controls
   overhead calculation  -- estimate the cost of pmc_read()

Previous example       -- pmc_counter_t, details
Download this example
Next example           -- report generation
Return to Main Menu


Compile with gcc -o menu8 -O `pmc_options` menu8.c -lpmc Try these examples: menu8 -g 0 menu8 -g 0 100 Try these examples (Pentium Pro/II/III): menu8 --e 192,194 menu8 --e 192,194 -Stats 2 100 menu8 --e 0x43,0x5 100
#include <pmc_lib.h> /* for atoi() */ #include <stdlib.h> int main(int argc, char * argv[]) { pmc_control_t Ctl = pmc_control_null; pmc_data_t t0, t1; pmc_cycles_t elapsed; int i, trials = 1; /* read command line, initialize internal data structures */ if (pmc_getargs(stderr, argv[0], &argc, &argv, &Ctl) == FALSE) { exit(1); } if (argc > 0) { trials = atoi(argv[0]); } if (trials < 1) { trials = 0; } if (pmc_open(0) == FALSE) /* open /dev/pmc */ { exit(1); } pmc_start(); /* starting point for elapsed time */ pmc_select(&Ctl.counters[0]); /* set the counter control registers */ for (i = 0; i < trials; i++) { pmc_read(&t0); /* read the counters */ /* do something */ pmc_read(&t1); /* read the counters */ /* counters[0] += (t1 -= t0) */ elapsed = pmc_accumulate(&Ctl.counters[0], &t1, &t0); } pmc_close(); /* close /dev/pmc */ pmc_print_results(argc, argv, &Ctl); exit(0); }
Synopsis void pmc_start(void); int pmc_select(const pmc_counter_t * counter); void pmc_read(pmc_data_t * ticks); pmc_cycles_t pmc_accumulate (pmc_counter_t * counter, pmc_data_t * t1, pmc_data_t * t0); int pmc_reset(void);
pmc_start() redefines the time from which the return value of pmc_accumulate() is determined; otherwise, the time of pmc_open() is used. pmc_select() passes to the hardware the pmc_selector_t component of a pmc_counter_t, changing the hardware registers that control the event counters. It requires a previous successful call to pmc_open(), and returns TRUE (1) if successful, FALSE (0) otherwise. pmc_read() marks a moment in time, by recording the cycle and event counters. If the option -DPMC_READ_INLINE was used for the installation (you can check this with the pmc_options command), then pmc_read() is inlined, otherwise it is a library function. If the option -DPMC_READ_KERNEL_MODE was used for the installation (again, check pmc_options), then pmc_read() is a library function that makes a kernel system call to the /dev/pmc module; this is required on the original Pentium. Except in the case of using kernel mode, pmc_read() assumes that /dev/pmc has been opened; if not, your results are unreliable. If the option -DPMC_READ_SERIAL was used for the installation, the cpuid instruction is used to serialize instructions. You should consider the values assigned by pmc_read() to be meaningless until after pmc_accumulate() is called. pmc_accumulate() implements the operation 'counter += (t1 -= t0)' with some statistics as previously described. It returns the cycles since pmc_open(), or the last pmc_start(), up to the input value of t1. pmc_reset() is useful for house cleaning; it sets the event counters to 0 and clears their control registers. This example differs from the previous by using the counter Ctl.counters[0]. The events defined on the command line, either by --events for one pair, or -group or -input for a list of pairs, will allocate an array of pmc_counter_t. In most cases, this is easier than declaring and maintaining individual counters, and the full array can be passed to pmc_print_results().
Sample output, Pentium II holmes% menu8 --e 192,194 -Stats 2 100 --------------------------- performance counters --------------------------- Host processor: holmes.scl.ameslab.gov Command executed: 100 Options: --duration 0,0 --user 1,1 --os 1,1 Options: --mesi 0xf,0xf --bus_agent 1,1 --compare 0,0 --invert 0,0 Options: --MMX 0x3f,0x3f Options: --Enable 1,1 --PC 1,1 --APIC 0,0 Event Events Events/sec ---------------------------------------- ---------------- ---------------- 0xc0 192 inst_retired 2200 92497430.63 0xc2 194 uops_retired 9200 386807437.17 --events 192,194 0: instruction decode and retire unit, instructions retired 1: instruction decode and retire unit, micro-operations retired Events 0xc0 0xc2 intervals cycles event 0 event 1 total 100 10703 2200 9200 minimum 107 22 92 mean 107 22 92 maximum 110 22 92 std.dev. 0 0 0 Events per cycle 0.20554985 0.85957208 minimum 100 0.20000000 0.83636364 mean 0.20555140 0.85957859 maximum 0.20560748 0.85981308 std.dev. 0.00056075 0.00234494 Ratio, event 0/1, event 1/0 minimum 100 100 0.23913043 4.18181818 mean 0.23913043 4.18181818 maximum 0.23913043 4.18181818 std.dev. 0.00000000 0.00000000 holmes%
Overhead Calculation Consider the code sequence pmc_read(&t0); something pmc_read(&t1); pmc_accumulate(&counter, &t1, &t0); Should pmc_accumulate() attempt to remove the effect of calls to pmc_read(), to obtain a better time for "something"? If so, how much better are the results, compared to the additional effort? Start with the easy case, where pmc_read() only obtains the cycle counter. The obvious technique is to run pmc_select(&counter); pmc_read(&t0); pmc_read(&t1); pmc_accumulate(&counter, &t1, &t0); without subtracting an overhead, in order to estimate the overhead to subtract later. Subject to the usual problems of out-of-order execution, this might not be such a bad idea. There is an Intel Application Note, "Using the RDTSC Instruction for Performance Monitoring", which explains most of the issues. [Here's a nice exercise for the attentive reader - find the consistent bug in their example programs.] Now consider the two events that are being measured, with their various options (--user and so on). The overhead is calculated by pmc_select() on its first use with this counter; pmc_counter_init() would be the other possibility, but it can be called before pmc_open(). With the -Clean n command-line option (or Ctl.clean = n), the pmc_read(), pmc_read(), pmc_accumulate() sequence is repeated n times, and the minimum of t1 - t0 is used as the overhead. In later use with pmc_accumulate(), if the actual measurement is less than the estimated overhead, the net cost is zero. What is the cycle and event overhead in practice? You can see this information with pmc_print_results() when -Clean is activated.
Forward References pmc_print_results()

Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
Author: Don Heller, dheller@scl.ameslab.gov
Last revised: 2 August 2000