Performance-Monitoring Counters Library, for Intel/AMD Processors and Linux
This example introduces
   the Stream benchmark for memory bandwidth
   modifications to use the PMC library
   driver scripts for the Pentium, Pentium Pro, Pentium II/III, Athlon
   interpretations of the measurements for the Pentium Pro

Return to Main Menu


The Stream benchmark, by John D. McCalpin, has been used to expose problems with memory bandwidth on large computer systems.  Stream was one of the first benchmarks to emphasize data movement over calculation.  It reports bytes moved per second, rather than operations performed per second, based on the observations that most scientific computations require a large volume of data, and that processors often have the opportunity to overlap data movement with calculations, if not always doing so.  In most systems, the memory data rates are much lower than the computing rates from cache, and memory becomes the constraint on performance.

The essence of the program is four loops over large double precision arrays,

#define N 1000000
double a[N], b[N], c[N];
register int j;

for (j = 0; j < N; j++)    /* copy */
    c[j] = a[j];

for (j = 0; j < N; j++)    /* scale */
    b[j] = scalar*c[j];

for (j = 0; j < N; j++)    /* add */
    c[j] = a[j] + b[j];

for (j = 0; j < N; j++)    /* triad */
    a[j] = b[j] + scalar*c[j];
each surrounded by a clock reading
times[0][k] = second();
for (j = 0; j < N; j++)
    c[j] = a[j];
times[0][k] = second() - times[0][k];
with some simple statistical calculations on times[].  The first change to use the PMC library is
pmc_select(&Ctl.counters[0]);
pmc_read(&in);
{ register int j; for (j = 0; j < N; j++) c[j] = a[j]; }
pmc_read(&out);
pmc_accumulate(&Ctl.counters[0], &out, &in);
times[0][k] = pmc_second(pmc_cycle(&out));
with the declarations and initialization
#include <pmc_lib.h>

int main(int argc, char *argv[])
  {
    pmc_control_t Ctl = pmc_control_null;
    pmc_data_t in, out;

    if (pmc_getargs(stderr, argv[0], &argc, &argv, &Ctl) == FALSE)
      { exit(0); }
    if (pmc_open(0) == FALSE)
      { exit(0); }

    ...

    pmc_close();

    fflush(stdout);
    pmc_print_results(argc, argv, &Ctl);
    exit(0);
  }
To ensure that the same events are used in all four counters, the command line should be like stream -n 4 --e 36,38 which creates four counters each with events 36 and 38.  The original output from stream goes to stdout, and the new performance output, from pmc_print_results(), goes to stderr; these are both redirected by the driver script.  The entire survey takes about 11 minutes on a 200 MHz Pentium Pro (66 MHz system bus).

The declaration of the loop index j was changed so the compiler would continue to respect the request to store j in a register.  In the more obvious code

pmc_read(&in);
for (j = 0; j < N; j++) c[j] = a[j];
pmc_read(&out);
j is kept in memory, although effectively in the L1 d-cache, for a few more cycles per loop iteration.  It is well-known that measuring a program will perturb its dynamic behavior, but this example shows that changing the measurement technique can have an even larger impact, by changing the quality of the compiled code itself.  When using any library to gather data for code improvements, inspection of the compiled .s file is essential.


Driver script

#!/bin/sh

# Stream/PMC, driver script for a 200 MHz Pentium Pro, 32 MB memory
LENGTH=1000000
TRIALS=10
OPT="-O -DPMC_P6 -DPMC_MHZ=200 -DN=${LENGTH} -DNTRIALS=${TRIALS}"

# original code
gcc -o stream_d6 ${OPT} stream_d.c -lm
gcc -o stream_d6.s -S ${OPT} stream_d.c

# modified code
gcc -o stream6 ${OPT} stream.c -lm -lpmc
gcc -o stream6.s -S ${OPT} stream.c

# if you rewrite stream6.s, use this instead
# gcc -o stream6 ${OPT} stream6.s -lm -lpmc

# all the events for this processor
Events=`stream6 -c | awk '{print $2}'`

# some events are not valid in one or the other counter
DualEvents="193,17 16,18 20,19"

# original code
stream_d6 1>stream6.out 2>survey6.out

# processor cycles, user and system together
stream6 -n 4 --e 121 1>>stream6.out 2>>survey6.out

# system bus, DRDY clocks, user and system separated
stream6 -n 4 --e 98 --u 1,0 --o 0,1 --b 0 1>>stream6.out 2>>survey6.out
stream6 -n 4 --e 98 --u 1,0 --o 0,1 --b 1 1>>stream6.out 2>>survey6.out

# one event, user and system separated
for e in $Events
do
  stream6 -n 4 --e $e --u 1,0 --o 0,1 1>>stream6.out 2>>survey6.out
done

echo -e "\nDual\n" >> survey6.out

# two events, user only then system only
for d in $DualEvents
do
  stream6 -n 4 --e $d --u 1 --o 0 1>>stream6.out 2>>survey6.out
  stream6 -n 4 --e $d --u 0 --o 1 1>>stream6.out 2>>survey6.out
done

# post-processing, for information per loop iteration
awk -f survey.awk -v t=${TRIALS} -v n=${LENGTH} survey6.out > survey6.sum


Stream output (original code, Pentium Pro)

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 1000000, Offset = 0
Total memory required = 22.9 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 7 microseconds.
Each test below will take on the order of 89857 microseconds.
   (= 12836 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:          99.4957       0.1628       0.1608       0.1769
Scale:         99.1848       0.1617       0.1613       0.1636
Add:          109.4426       0.2197       0.2193       0.2224
Triad:        107.5336       0.2236       0.2232       0.2253


Stream output (modified code, Pentium Pro, sample)

-------------------------------------------------------------
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 1000000, Offset = 0
Total memory required = 22.9 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 116837 microseconds.
   (= 58418 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:          99.9378       0.1603       0.1601       0.1613
Scale:         99.5847       0.1610       0.1607       0.1627
Add:          109.7790       0.2187       0.2186       0.2189
Triad:        107.9303       0.2224       0.2224       0.2226
The results show that the multiply operation is almost free, hidden behind memory delays.

Note that the data volume MB is computed with 1024*1024 bytes, while the data rate MB/s is computed with 1000*1000 bytes per second.  This is a typical marketing trick to boost the apparent speed of a computer, though here it is done honestly, to remain consistent with floating-point computation rates, which are usually reported as 1000*1000 operations per second.


PMC library output (Pentium Pro, sample)

------------------------   performance counters   ------------------------

Host processor: o
Options: --duration 0,0 --user 1,1 --os 1,1
Options: --mesi 0xf,0xf --bus_agent 1,1 --compare 0,0 --invert 0,0
Options: --Enable 1,1 --PC 1,1 --APIC 0,0

Event                                             Events        Events/sec
--------------------------------------    --------------    --------------
0x79 121  cpu_clk_unhalted                     320696965      200000000.00
0x79 121  cpu_clk_unhalted                     320696987      200000013.72
0x79 121  cpu_clk_unhalted                     322024133      200000000.00
0x79 121  cpu_clk_unhalted                     322024166      200000020.50
0x79 121  cpu_clk_unhalted                     437387074      200000000.00
0x79 121  cpu_clk_unhalted                     437387096      200000010.06
0x79 121  cpu_clk_unhalted                     444822800      200000000.00
0x79 121  cpu_clk_unhalted                     444822822      200000009.89
This event is "Number of cycles during which the processor is not halted."  Both user and system events are counted; later in the survey the user and system events are separated.  Although not providing more information than the cycle counter alone, we have confirmation that the basic mechanism is working correctly.


PMC library output (Pentium Pro, sample)

------------------------   performance counters   ------------------------

Host processor: o
Options: --duration 0,0 --user 1,0 --os 0,1
Options: --mesi 0xf,0xf --bus_agent 1,1 --compare 0,0 --invert 0,0
Options: --Enable 1,1 --PC 1,1 --APIC 0,0

Event                                             Events        Events/sec
--------------------------------------    --------------    --------------
0xc8 200  hw_int_rx                                  162            101.45
0xc8 200  hw_int_rx                                    0              0.00
0xc8 200  hw_int_rx                                  163            101.42
0xc8 200  hw_int_rx                                    0              0.00
0xc8 200  hw_int_rx                                  219            100.43
0xc8 200  hw_int_rx                                    0              0.00
0xc8 200  hw_int_rx                                  223            100.45
0xc8 200  hw_int_rx                                    0              0.00
This event is "Number of hardware interrupts received," and we have separated user and system events.  Hardware interval timer interrupts should occur at a rate of 100 per second, for the operating system's time slice (a jiffy, in Linux terms).  The observed rates are usually 99-102 hardware interrupts per second.  The loops take about 16 or 22 jiffies each, so we did need to be careful by separating user and system events in the driver script.  The stream program's advice about "20 clock ticks per test" is to deal with a different problem, caused by relatively coarse wall-clocks, which also mix user and system time.


PMC library output (Pentium Pro, sample)

------------------------   performance counters   ------------------------

Host processor: o
Options: --duration 0,0 --user 1,0 --os 0,1
Options: --mesi 0xf,0xf --bus_agent 1,1 --compare 0,0 --invert 0,0
Options: --Enable 1,1 --PC 1,1 --APIC 0,0

Event                                             Events        Events/sec
--------------------------------------    --------------    --------------
0xa2 162  resource_stalls                      287600192      179547405.93
0xa2 162  resource_stalls                         581934         363298.58
0xa2 162  resource_stalls                      286838801      178989551.52
0xa2 162  resource_stalls                         225508         140718.67
0xa2 162  resource_stalls                      408582761      187268429.33
0xa2 162  resource_stalls                         267574         122638.95
0xa2 162  resource_stalls                      407692053      183403769.87
0xa2 162  resource_stalls                         628898         282915.16
The event measured here is "Number of cycles during which there are resource-related stalls."  It would be more useful to have information per loop iteration, so we do some additional work on the survey.out file, to produce a summary of the user-state events.


Awk post-processing script

BEGIN {
  Single = 1
  f = t * n     # t, n assigned by -v option
  print "Events per loop iteration                   " \
        "Copy    Scale      Add    Triad"
}

$1 == "Event" {
  if (Single == 1) {
    getline ; getline ; copy    = $4 / f
    getline ; getline ; scale   = $4 / f
    getline ; getline ; add     = $4 / f
    getline ; getline ; triad   = $4 / f
    printf "%4s %3s %30s %8.4f %8.4f %8.4f %8.4f\n", \
        $1, $2, $3, copy, scale, add, triad
  } else {
    getline
    getline ; event01 = $1 ; event02 = $2 ; event03 = $3 ; copy0 = $4 / f
    getline ; event11 = $1 ; event12 = $2 ; event13 = $3 ; copy1 = $4 / f
    getline ; scale0  = $4 / f
    getline ; scale1  = $4 / f
    getline ; add0    = $4 / f
    getline ; add1    = $4 / f
    getline ; triad0  = $4 / f
    getline ; triad1  = $4 / f
    printf "%4s %3s %30s %8.4f %8.4f %8.4f %8.4f\n", \
        event01, event02, event03, copy0, scale0, add0, triad0
    printf "%4s %3s %30s %8.4f %8.4f %8.4f %8.4f\n", \
        event11, event12, event13, copy1, scale1, add1, triad1
  }
  next
}

$1 == "Dual" { Single = 0 }


Post-processed PMC data, Pentium Pro

Events per loop iteration                   Copy    Scale      Add    Triad
0x79 121               cpu_clk_unhalted  32.0697  32.2024  43.7387  44.4823
0x62  98                bus_drdy_clocks   0.9972   0.9994   1.0061   0.9990
0x62  98                bus_drdy_clocks   8.9937   8.9716  12.0122  12.0136
0x43  67                  data_mem_refs   4.0041   3.0123   3.0118   4.0195
0x45  69                   dcu_lines_in   0.5039   0.5039   0.7561   0.7560
0x46  70                 dcu_m_lines_in   0.2500   0.2500   0.2500   0.2500
0x47  71                dcu_m_lines_out   0.2499   0.2501   0.2504   0.2501
0x48  72           dcu_miss_outstanding  47.7792  50.2285  99.1987  71.8793
0x80 128                     ifu_ifetch  31.9418  32.0281  43.5341  44.3404
0x81 129                ifu_ifetch_miss   0.0000   0.0000   0.0000   0.0000
0x85 133                      itlb_miss   0.0000   0.0000   0.0000   0.0000
0x86 134                  ifu_mem_stall   0.0022   0.0013   0.0018   0.0028
0x87 135                      ild_stall   0.0000   0.0000   0.0000   0.0000
0x79 121               cpu_clk_unhalted  31.9046  32.0122  43.5008  44.3186
0x28  40                      l2_ifetch   0.0000   0.0000   0.0000   0.0000
0x29  41                          l2_ld   0.9882   1.0104   1.0231   0.6909
0x2a  42                          l2_st   0.7723   0.7799   0.5341   0.4303
0x24  36                    l2_lines_in   0.5006   0.5006   0.7508   0.7508
0x26  38                   l2_lines_out   0.5006   0.5006   0.7508   0.7508
0x25  37                 l2_m_lines_inm   0.2502   0.2502   0.2502   0.2502
0x27  39                l2_m_lines_outm   0.2491   0.2500   0.2513   0.2499
0x2e  46                       l2_rqsts   1.7906   1.7618   1.5551   1.1193
0x21  33                         l2_ads   3.2483   3.2875   3.3123   2.8652
0x22  34                   l2_dbus_busy   4.7864   4.7951   5.5883   5.1320
0x23  35                l2_dbus_busy_rd   2.7921   2.7987   2.5864   2.1281
0x62  98                bus_drdy_clocks   8.9936   8.9712  12.0146  12.0105
0x63  99                bus_lock_clocks   0.0000   0.0000   0.0000   0.0000
0x60  96            bus_req_outstanding  45.1040  46.8326  99.7297  76.2825
0x65 101                   bus_tran_brd   0.2506   0.2505   0.5008   0.5008
0x66 102                   bus_tran_rfo   0.2500   0.2472   0.2488   0.2500
0x67 103                   bus_trans_wb   0.2491   0.2499   0.2513   0.2501
0x68 104                bus_tran_ifetch   0.0000   0.0000   0.0000   0.0000
0x69 105                 bus_tran_inval   0.0000   0.0026   0.0013   0.0000
0x6a 106                   bus_tran_pwr   0.0000   0.0000   0.0000   0.0000
0x6b 107                    bus_trans_p   0.0000   0.0000   0.0000   0.0000
0x6c 108                   bus_trans_io   0.0000   0.0000   0.0000   0.0000
0x6d 109                   bus_tran_def   0.0000   0.0000   0.0000   0.0000
0x6e 110                 bus_tran_burst   0.7500   0.7476   1.0011   1.0005
0x70 112                   bus_tran_any   0.7496   0.7506   1.0022   1.0010
0x6f 111                   bus_tran_mem   0.7497   0.7504   1.0023   1.0007
0x64 100                   bus_data_rcv   2.0022   1.9912   2.9985   3.0033
0x61  97                    bus_bnr_drv   0.0000   0.0000   0.0000   0.0000
0x7a 122                    bus_hit_drv   0.0000   0.0000   0.0000   0.0000
0x7b 123                   bus_hitm_drv   0.0000   0.0000   0.0000   0.0000
0x7e 126                bus_snoop_stall   0.0000   0.0000   0.0000   0.0000
0x03   3                      ld_blocks   0.0068   0.0130   0.0073   0.0185
0x04   4                      sb_drains   0.0000   0.0001   0.0001   0.0000
0x05   5               misalign_mem_ref   0.0000   0.0000   0.0000   0.0000
0xd0 208                   inst_decoder   7.0008  15.3740   9.7979   7.0010
0xc0 192                   inst_retired   7.0003   6.0012   6.0005   7.0002
0xc2 194                   uops_retired   9.0025   8.0022   8.0033  10.0030
0x06   6              segment_reg_loads   0.0001   0.0001   0.0001   0.0001
0xc8 200                      hw_int_rx   0.0000   0.0000   0.0000   0.0000
0xc6 198              cycles_int_masked   0.0011   0.0010   0.0014   0.0013
0xc7 199  cycles_int_pending_and_masked   0.0000   0.0000   0.0000   0.0000
0xc4 196                br_inst_retired   1.0001   1.0001   1.0001   1.0001
0xc5 197           br_miss_pred_retired   0.0000   0.0000   0.0000   0.0000
0xc9 201               br_taken_retired   1.0000   1.0000   1.0000   1.0000
0xca 202         br_miss_pred_taken_ret   0.0000   0.0000   0.0000   0.0000
0xe0 224                br_inst_decoded   1.0001   1.0001   1.0002   1.0002
0xe2 226                     btb_misses   0.0033   0.0000   0.0001   0.0000
0xe4 228                       br_bogus   0.0000   0.0000   0.0000   0.0000
0xe6 230                       baclears   0.0000   0.0000   0.0000   0.0000
0xa2 162                resource_stalls  28.7600  28.6839  40.8583  40.7692
0xd2 210             partial_rat_stalls   0.0000   0.0000   0.0000   0.0000
0xc1 193                          flops   0.0000   1.0000   1.0000   2.0000
0x11  17                      fp_assist   0.0000   0.0000   0.0000   0.0000
0xc1 193                          flops   0.0000   0.0000   0.0000   0.0000
0x11  17                      fp_assist   0.0000   0.0000   0.0000   0.0000
0x10  16                fp_comp_ops_exe   0.0000   1.0000   1.0000   2.0000
0x12  18                            mul   0.0000   1.0000   0.0000   1.0000
0x10  16                fp_comp_ops_exe   0.0000   0.0000   0.0000   0.0000
0x12  18                            mul   0.0000   0.0000   0.0000   0.0000
0x14  20                cycles_div_busy   0.0000   0.0000   0.0000   0.0000
0x13  19                            div   0.0000   0.0000   0.0000   0.0000
0x14  20                cycles_div_busy   0.0006   0.0006   0.0009   0.0008
0x13  19                            div   0.0000   0.0000   0.0000   0.0000


Compiled code (GCC, excerpt with annotations, number of micro-operations)

Copy
        xorl %eax,%eax            # 1  j = 0
        .align 4                  #    no-op(s)
  .L37:
        movl a(,%eax,8),%edx      # 1  load a[j] low
        movl a+4(,%eax,8),%ecx    # 1  load a[j] high
        movl %edx,c(,%eax,8)      # 2  store c[j] low
        movl %ecx,c+4(,%eax,8)    # 2  store c[j] high
        incl %eax                 # 1  j++
        cmpl $999999,%eax         # 1  j < N
        jle .L37                  # 1  conditional jump

Scale
        xorl %edx,%edx            # 1  j = 0
        .align 4                  #    no-op(s)
  .L45:
        fldl .LC24                # 1  load scalar
        fmull c(,%edx,8)          # 2  load c[j]; *
        fstpl b(,%edx,8)          # 2  store b[j]
        incl %edx                 # 1  j++
        cmpl $999999,%edx         # 1  j < N
        jle .L45                  # 1  conditional jump

Add
        xorl %edx,%edx            # 1  j = 0
        .align 4                  #    no-op(s)
  .L53:
        fldl a(,%edx,8)           # 1  load a[j]
        faddl b(,%edx,8)          # 2  load b[j]; +
        fstpl c(,%edx,8)          # 2  store c[j]
        incl %edx                 # 1  j++
        cmpl $999999,%edx         # 1  j < N
        jle .L53                  # 1  conditional jump

Triad
        xorl %edx,%edx            # 1  j = 0
        .align 4                  #    no-op(s)
  .L61:
        fldl .LC24                # 1  load scalar
        fmull c(,%edx,8)          # 2  load c[j]; *
        faddl b(,%edx,8)          # 2  load b[j]; +
        fstpl a(,%edx,8)          # 2  store a[j]
        incl %edx                 # 1  j++
        cmpl $999999,%edx         # 1  j < N
        jle .L61                  # 1  conditional jump


Interpretation of the measurements

References:

Instructions and micro-operations retired (events 192, 194).  The Pentium Pro decodes an x86 instruction into micro-operations, and executes the micro-ops on a larger register pool than the ancient architectural registers (EAX, etc).  Instructions are also fetched and executed speculatively, and are only accepted (retired) when the actual control flow is finally determined.  Several micro-ops can be executed in one cycle; another example explores this.  Here we see that the number of instructions retired agrees with the number of instructions in the loop (7, 6, 6, 7) while the number of micro-ops retired is only a little higher (9, 8, 8, 10).  Without the {register int j; ... } declarations, the compiler generates (9, 8, 7, 9) instructions per loop, becoming (12, 11, 13, 13) micro-ops, and running at (34.36, 34.69, 45.21, 45.14) cycles per loop iteration, user and system time.

Floating-point operations (events 193, 17, 16, 18, 20, 19).  The number of fl.pt. add, multiply and divide operations are counted.  These include certain fixed-point operations that use the fl.pt. unit, and distinguish operations executed (16) from retired (193), though these cannot be counted concurrently.  There are no fl.pt. exceptions in these loops.

Branches (events 196, 197, 201, 202, 224, 226, 228, 230).  The one conditional branch in each loop is predicted correctly, except for the final iteration and at the start of a time slice; the branch is taken, and found in the Branch Target Buffer.

Instruction Fetch and Decode (events 121, 128, 129, 133, 134, 135, 40, 104, 208, 192).  One instruction is fetched per cycle; most of these are far ahead of the present loop iteration, and given the correct branch prediction, will be used in later iterations.  The loop is small, so all instructions are found in the L1 i-cache, of course with no i-TLB misses, and essentially no stalls in the instruction fetch pipe stage (less than 1 cycle in 400 iterations).  The scale loop shows far more instructions decoded than retired, but some small modifications to the compiled code can push this effect to another loop or magnify it.

Memory access (events 98, 100, 101, 102, 103, 110, 111, 112).  The external bus (between Level 2 cache and memory, among other system components) delivers 64 bits of data per bus clock when the DRDY (data ready) signal is asserted.  A burst transaction moves one full cache line, or 32 bytes.  Bus requests can be issued every 3 bus clocks, with up to 4 pending requests.  On this system, one bus clock (at 66 MHz) is three processor clocks (at 200 MHz).  An assignment requires a read-for-ownership memory operation prior to the write to memory, even on a single-processor system.  These unit-stride loops touch all the bytes in a cache line, in sequence.  If the loops are compiled to minimize memory traffic, we should expect a pattern like this, per loop iteration:

                      Bus Clocks
Bus Transaction       Copy   Scale   Add   Triad    Events
  read                  1      1      2      2      101
  read for ownership    1      1      1      1      102
    subtotal            2      2      3      3      100
  write-back            1      1      1      1      103, 98 [-b 0]
    total               3      3      4      4      110, 111, 112, 98 [-b 1]
These are all part of burst transactions, so the event counts should be divided by 4 (one 32 byte cache line is consumed in 4 iterations), or multiplied by 3 (change to processor clocks), as appropriate.  The correct branch prediction means that speculative loads (not easily distinguished with this set of performance counters) will prove to be useful.

Memory access (events 121, 162, 67, 5).  The best clue that these loops are memory-limited is that all but about three cycles of each loop are resource stalls, and those three cycles are loop control [from the previous analysis].  There are no misaligned memory references, which can be a serious problem for double-precision.

Level 1 Cache access (events 69, 70, 71).  The average cache line counts should be in units of 1/4 line per iteration.  The lines allocated in L1 d-cache per iteration should be (2, 2, 3, 3)/4, assuming scalar is maintained in a register or never leaves cache, and neglecting the loop index and addresses.  This is confirmed.  Only one of these lines is written, so the Modified-state line transit rate is (1, 1, 1, 1)/4 in and out (allocated and removed) per iteration.

Level 2 Cache access (events 36, 38, 37, 39).  The lines moving in and out of Level 1 cache must also move through Level 2, so the number of lines in and out is again (2, 2, 3, 3)/4, and the number of modified lines in and out is (1, 1, 1, 1)/4, per iteration.

Memory access (events 101, 102, 103, 110, 111, 112).  The lines moving in and out of Level 2 cache must also move through the memory via the external bus, so the number of lines moved on a burst-read transaction is (1, 1, 2, 2)/4, and on both a read-for-ownership transaction and write-back transaction are (1, 1, 1, 1)/4 per iteration.  The total memory bus activity is (3, 3, 4, 4)/4 transactions per loop iteration (burst transactions, memory transactions, and all transactions).  The bus transactions occur at a rate of 4.6-4.8 million per second.

Memory access (events 99, 105, 106, 107, 108, 109, 97, 122, 123, 126).  These are all zero, or nearly so.  Some will be zero on every single-processor system.

Memory access (events 100, 98 [-b 0], 98 [-b 1]).  The number of bus clock cycles per iteration where the processor is receiving data is (2, 2, 3, 3), corresponding to the read and read-for ownership transactions, and (1, 1, 1, 1) when it is supplying data, corresponding to the write-back transaction.  In total, (9, 9, 12, 12) processor cycles are devoted to data movement in each iteration.


Download the original Stream program.

Download the event codes and descriptions, slightly-modified Stream program, Stream with the PMC library, driver script, post-processing script, Stream output, performance monitors output, and performance monitors summary.

Pentium Pentium/MMX Pentium Pro Pentium II
help (-h) help (-h) help (-h) help (-h)
version (-v) version (-v) version (-v) version (-v)
all events (-l) all events (-l) all events (-l) all events (-l)
all events (-c) all events (-c) all events (-c) all events (-c)
all events (-d) all events (-d) all events (-d) all events (-d)
original code original code original code original code
modified code modified code modified code modified code
survey (sh) survey (sh) survey (sh) survey (sh)
survey.awk survey.awk survey.awk survey.awk
stream.out stream.out stream.out stream.out
survey.out survey.out survey.out survey.out
survey.sum survey.sum survey.sum survey.sum


Performance-Monitoring Counters Library, for Intel Processors and Linux
Author: Don Heller, dheller@scl.ameslab.gov
Last revised: 2 August 2000