Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
This example introduces
pmc_start() -- starting point for elapsed time
pmc_select() -- set the counter control registers
pmc_read() -- read the counters
pmc_accumulate() -- subtract and add the counters
pmc_reset() -- reset the counters and controls
overhead calculation -- estimate the cost of pmc_read()
Previous example -- pmc_counter_t, details
Download this example
Next example -- report generation
Return to Main Menu
Compile with
gcc -o menu8 -O `pmc_options` menu8.c -lpmc
Try these examples:
menu8 -g 0
menu8 -g 0 100
Try these examples (Pentium Pro/II/III):
menu8 --e 192,194
menu8 --e 192,194 -Stats 2 100
menu8 --e 0x43,0x5 100
#include <pmc_lib.h>
/* for atoi() */
#include <stdlib.h>
int main(int argc, char * argv[])
{
pmc_control_t Ctl = pmc_control_null;
pmc_data_t t0, t1;
pmc_cycles_t elapsed;
int i, trials = 1;
/* read command line, initialize internal data structures */
if (pmc_getargs(stderr, argv[0], &argc, &argv, &Ctl) == FALSE)
{ exit(1); }
if (argc > 0) { trials = atoi(argv[0]); }
if (trials < 1) { trials = 0; }
if (pmc_open(0) == FALSE) /* open /dev/pmc */
{ exit(1); }
pmc_start(); /* starting point for elapsed time */
pmc_select(&Ctl.counters[0]); /* set the counter control registers */
for (i = 0; i < trials; i++)
{
pmc_read(&t0); /* read the counters */
/* do something */
pmc_read(&t1); /* read the counters */
/* counters[0] += (t1 -= t0) */
elapsed = pmc_accumulate(&Ctl.counters[0], &t1, &t0);
}
pmc_close(); /* close /dev/pmc */
pmc_print_results(argc, argv, &Ctl);
exit(0);
}
Synopsis
void pmc_start(void);
int pmc_select(const pmc_counter_t * counter);
void pmc_read(pmc_data_t * ticks);
pmc_cycles_t pmc_accumulate
(pmc_counter_t * counter, pmc_data_t * t1, pmc_data_t * t0);
int pmc_reset(void);
pmc_start() redefines the time from which the return value of pmc_accumulate()
is determined; otherwise, the time of pmc_open() is used.
pmc_select() passes to the hardware the pmc_selector_t component of a
pmc_counter_t, changing the hardware registers that control the event
counters. It requires a previous successful call to pmc_open(), and
returns TRUE (1) if successful, FALSE (0) otherwise.
pmc_read() marks a moment in time, by recording the cycle and event counters.
If the option -DPMC_READ_INLINE was used for the installation (you can check
this with the pmc_options command), then pmc_read() is inlined, otherwise it
is a library function. If the option -DPMC_READ_KERNEL_MODE was used for the
installation (again, check pmc_options), then pmc_read() is a library function
that makes a kernel system call to the /dev/pmc module; this is required on
the original Pentium. Except in the case of using kernel mode, pmc_read()
assumes that /dev/pmc has been opened; if not, your results are unreliable.
If the option -DPMC_READ_SERIAL was used for the installation, the cpuid
instruction is used to serialize instructions.
You should consider the values assigned by pmc_read() to be meaningless until
after pmc_accumulate() is called.
pmc_accumulate() implements the operation 'counter += (t1 -= t0)' with some
statistics as previously described. It returns the cycles since pmc_open(),
or the last pmc_start(), up to the input value of t1.
pmc_reset() is useful for house cleaning; it sets the event counters to 0
and clears their control registers.
This example differs from the previous by using the counter Ctl.counters[0].
The events defined on the command line, either by --events for one pair, or
-group or -input for a list of pairs, will allocate an array of pmc_counter_t.
In most cases, this is easier than declaring and maintaining individual
counters, and the full array can be passed to pmc_print_results().
Sample output, Pentium II
holmes% menu8 --e 192,194 -Stats 2 100
--------------------------- performance counters ---------------------------
Host processor: holmes.scl.ameslab.gov
Command executed: 100
Options: --duration 0,0 --user 1,1 --os 1,1
Options: --mesi 0xf,0xf --bus_agent 1,1 --compare 0,0 --invert 0,0
Options: --MMX 0x3f,0x3f
Options: --Enable 1,1 --PC 1,1 --APIC 0,0
Event Events Events/sec
---------------------------------------- ---------------- ----------------
0xc0 192 inst_retired 2200 92497430.63
0xc2 194 uops_retired 9200 386807437.17
--events 192,194
0: instruction decode and retire unit, instructions retired
1: instruction decode and retire unit, micro-operations retired
Events 0xc0 0xc2 intervals cycles event 0 event 1
total 100 10703 2200 9200
minimum 107 22 92
mean 107 22 92
maximum 110 22 92
std.dev. 0 0 0
Events per cycle 0.20554985 0.85957208
minimum 100 0.20000000 0.83636364
mean 0.20555140 0.85957859
maximum 0.20560748 0.85981308
std.dev. 0.00056075 0.00234494
Ratio, event 0/1, event 1/0
minimum 100 100 0.23913043 4.18181818
mean 0.23913043 4.18181818
maximum 0.23913043 4.18181818
std.dev. 0.00000000 0.00000000
holmes%
Overhead Calculation
Consider the code sequence
pmc_read(&t0);
something
pmc_read(&t1);
pmc_accumulate(&counter, &t1, &t0);
Should pmc_accumulate() attempt to remove the effect of calls to pmc_read(),
to obtain a better time for "something"? If so, how much better are the
results, compared to the additional effort?
Start with the easy case, where pmc_read() only obtains the cycle counter.
The obvious technique is to run
pmc_select(&counter);
pmc_read(&t0);
pmc_read(&t1);
pmc_accumulate(&counter, &t1, &t0);
without subtracting an overhead, in order to estimate the overhead to
subtract later. Subject to the usual problems of out-of-order execution,
this might not be such a bad idea. There is an Intel Application Note,
"Using the RDTSC Instruction for Performance Monitoring", which explains
most of the issues. [Here's a nice exercise for the attentive reader -
find the consistent bug in their example programs.]
Now consider the two events that are being measured, with their various
options (--user and so on). The overhead is calculated by pmc_select()
on its first use with this counter; pmc_counter_init() would be the other
possibility, but it can be called before pmc_open(). With the -Clean n
command-line option (or Ctl.clean = n), the pmc_read(), pmc_read(),
pmc_accumulate() sequence is repeated n times, and the minimum of t1 - t0
is used as the overhead. In later use with pmc_accumulate(), if the actual
measurement is less than the estimated overhead, the net cost is zero.
What is the cycle and event overhead in practice? You can see this
information with pmc_print_results() when -Clean is activated.
Forward References
pmc_print_results()
Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
Author: Don Heller, dheller@scl.ameslab.gov
Last revised: 2 August 2000