Intel Pentium Pro / Pentium II/III
Performance-Monitoring Events, Commentary and [Review Questions]
(under construction)
Previous example -- pmc_event_set_t
Next example -- hardware registers
Return to Main Menu
Diagram -- events and processor/cache/memory components
jpeg format, pdf format (requires Adobe Acrobat Reader)
Suggested event pairs, as a rabbit input file
Nearly complete coverage
Data references related to other events
Events vs. cycles for some events
User mode vs. System mode
References:
"quoted text" without a reference is (with minor modifications) from
Intel Corp., Intel Architecture Software Developer's Manual,
vol. 3: System Programming Guide, 1997, order no. 243192,
Appendix A, Performance-Monitoring Events.
[1] Dileep Bhandarkar and Jason Ding,
"Performance Characterization of the Pentium Pro Processor,"
Proc. Third Int. Symp. on High-Performance Computer Architecture,
Feb. 1-5, 1997, IEEE Computer Society Press, 1997, pp. 288-297.
[2] Intel Corp., Intel Architecture Optimization Manual, 1997,
order no. 242816-003.
[3] Intel Corp., Intel Architecture Optimization Reference Manual,
1999, order no. 245127-001.
In the following event descriptions, the event codes are hex (base 16),
and the arrows represent movement of information in response to a request.
Rabbit command-line options (-this or --that) are used.
Terms and Acronyms:
L1, L2 = Level 1 cache, Level 2 cache.
(Think of the registers as Level 0, the memory as Level 3, and
the paging disk as Level 4.)
IFU = Instruction Fetch Unit, includes the L1 instruction cache.
DCU = Data Cache Unit, includes the L1 data cache.
(L2 is a unified cache.)
TLB = Translation Lookaside Buffer, a small cache for virtual memory
address calculations.
BTB = Branch Target Buffer, used for dynamic branch prediction.
MMX = MMX. Some people claim this means Multi-Media eXtensions.
hit, miss: When the processor needs some information, it is found
(a hit) or not (a miss). A miss at Level i causes a request to
Level i+1, and then a hit or miss at Level i+1. Since Level i+1
is slower than Level i, there will be a delay (the Level i miss
penalty) before data is delivered in response to the request.
The miss ratio (misses at Level i / requests to Level i) is also
important information.
fault: A miss that causes paging activity.
cache line: To improve efficiency, based on spatial locality, 32
bytes (a cache line) are moved between Level 1, Level 2 and memory
in response to any processor request.
cache line management:
... allocate, remove, evict, replace
data bus: Connects the L1 and L2 caches; also called the backside
cache bus.
external bus: Connects the processor to the memory and other devices.
Carries data, addresses, and transaction information. A bus agent
is any device, usually a processor or memory subsystem, that can
initiate a bus transaction. An external bus transaction will
occupy the bus for some number of cycles. The counters can be
configured to react to any bus agent (rabbit option --bus_agent 1)
or only to this processor (--bus_agent 0).
processor cycle, bus cycle: ...
For example, a 200 MHz Pentium Pro with a 66 MHz system bus will
use 3 processor cycles for each bus cycle.
active cycle: A processor cycle when the processor is not halted.
halt state: The Time Stamp Counter will continue to increment in
the halt state (HLT instruction).
Cache Line States:
S = shared (and not modified)
E = exclusive (and not modified)
M = modified (and exclusive)
I = invalid
(Intel prefers the acronym MESI for this protocol.)
The cache line state answers these questions related to consistency:
How does this copy of the line compare to the one in memory?
(the local copy has been modified or has not been modified
since it was obtained from memory)
Is there another copy of the line in another processor's cache?
(if shared then maybe, if exclusive then no)
Lines in the Level 1 instruction cache are either shared or invalid.
Copying a line from memory to the local cache requires a conversation
with the other processors to see if they already hold a copy of the
line.
When reading a line, an M-state line elsewhere must be written back
to memory and changed to S-state, and an E-state line elsewhere must
be changed to S-state; S-state lines elsewhere are unaffected. The
read from memory can then proceed, with the local line in either S-
or E-state.
To establish exclusive ownership before writing to a line, an M-state
line elsewhere must be written back to memory and invalidated, while
all S- or E-state lines elsewhere can simply be invalidated. The line
can then be loaded from memory in M-state and modified. This "read for
ownership" sequence will occur on single- and multi-processor systems.
Some cache lines or virtual memory pages may be non-cacheable. This
is usually determined by the operating system, or by the user through
a system call. [What is the effect of the C keyword "volatile"?]
Pipeline Stalls:
The processor pipelines must delay execution of an operation if the
required information is not available. These "stalls" are usually
caused by a miss at some level of the cache/memory hierarchy, but
also by a relatively slow or delayed previous operation, or by a
resource conflict in the processor.
Pipeline Latency (Pentium Pro):
... 14-stage pipeline
in-order front end, 8 stages
out-of-order execution, 3 stages, 5 execution units
in-order retirement, 3 stages
A simple register-to-register integer operation executes in one cycle.
operation cycles, cycles/result,
latency throughput
load (L1 hit) 3
integer multiply 4 1
fl.pt. add 3 1
fl.pt. multiply 5 2
fl.pt. divide 17 single not pipelined
32 double
37 extended
Speculative Execution:
To compensate for various delays, the processor will attempt to execute
an instruction as soon its data is available, even before the previous
instructions have all completed. This entails fetching instructions from
predicted paths following a conditional branch; when the branch direction
is resolved, the instructions on the wrong path are discarded, and those
on the right path are accepted (retired). Instructions are retired in
the order in which they appear in the program, so there are no semantic
difficulties. The ability to predict branch direction based on recent
behavior reduces the possibility that speculative work is wasted.
Micro-operations:
An instruction is decoded into one or more micro-operations, which are
then matched with available data and sent to the execution units. The
number of micro-operations executed per cycle is a measure of the internal
parallelism in the processor. On the Pentium Pro, up to five micro-ops
can be in execution in one cycle, and up to three retired per cycle.
Register renaming is used to compensate for the relatively small number
of architectural registers. Up to three micro-ops can be renamed in
one cycle; there are 40 physical registers available.
L1 and L2 cache:
The L1 d-cache can execute one load and one store operation per cycle.
Buffering between the processor and the cache helps to smooth out delays.
size (KB) set associativity
L1 i L1 d L1 i L1 d L2
Pentium Pro 8 8 4 2 4
Pentium II 16 16
The size of the L2 cache can be determined from the CPUID instruction
(try 'rabbit -p'). A larger L2 cache tends to reduce the average number
of cycles per instruction, and definitely increases the cost of the
processor.
Data Bus Transactions: On the L1 side of L2, these events record the
action taken by the processor that caused a cache line to be copied.
See the next group for the memory side of L2.
2e requests
28 instruction fetch L2 to L1 i-cache
29 data load L2 to L1 d-cache
2a data store L1 d-cache to L2
These event count relations should hold:
2e = 28 + 29 + 2a
These are the only events affected by the --mesi option.
The data bus is 8 bytes wide, while a cache line is 32 bytes wide.
On the Pentium Pro, the transaction timing is 4-1-1-1 at the processor
clock rate; that is, 4 cycles for the first 8 bytes, followed by 1
cycle for each of the next 8 bytes.
External Bus Transactions: As seen by the event counters,
70 all
6c i/o
6f memory
6e burst
65 burst read
66 read for ownership
67 writeback
68 instruction fetch
6b partial
6a partial write
69 invalidate
6d deferred
These event count relations should hold:
6e = 65 + 66 + 67 + 68
6f = 6e + 6b + 69
70 = 6c + 6f + 6d
The external bus is 8 bytes wide. With the Pentium Pro, it operates
at 60 or 66 MHz; with the Celeron, Pentium II and Pentium II Xeon
processors, it operates at 66 or 100 MHz. With the Pentium III, it
operates at 100 MHz.
On the Pentium Pro, up to 8 outstanding bus transactions are allowed.
Event 60 will count some of these transactions. The transaction timing
depends on the degree of memory interleaving. On the Pentium Pro, it
can be 14-1-1-1, 14-2-2-2, or 14-4-4-4 for 4-, 2- or 1-way interleaving,
counted in bus cycles.
Subset Relations:
The first event is included in the second event:
85 81
46 45
23 22
25 24
27 26
Miss Relations:
The first event is a miss directly following the second event:
81 80 L1 instruction cache
45 43 L1 data cache
68 28 L2, instruction fetch
Events or cycles.
These events may be configured (with --duration) to count events (--d 0)
or cycles (--d 1):
04, 14, 22, 23, 48, 60, 61, 62, 63, 64, 79, 7a, 7b, 7e, 86, 87, a2,
c6, c7, d2.
However, there is usually no difference in the results. Bummer.
At each cycle,
clocks
79 cpu_clk_unhalted
"Number of cycles during which the processor is not halted."
The halt state is typically used in the OS idle loop.
a2 resource_stalls
"Number of cycles during which there are resource-related stalls."
Other operations could occur when one operation is stalled. Resources
include "register renaming or reorder buffer entries, memory buffer
entries, and execution units." [1]
d2 partial_rat_stalls
"Number of cycles or events for partial stalls."
RAT = register alias table, used for renaming; AX is a partial register
in EAX.
An instruction fetch will induce these events:
memory access
80 ifu_ifetch
processor <-- L1 i-cache.
"Number of instruction fetches, both cacheable and non-cacheable."
Includes speculative fetches for instructions decoded but not executed,
or decoded and executed but not retired. The processor will attempt
one fetch of 16 bytes each active cycle.
81 ifu_ifetch_miss
L1 i-cache <-- L2 cache.
"Number of instruction fetch misses."
Caused by an L1 i-cache miss.
85 itlb_miss
L1 i-cache <-- L2 cache.
"Number of instruction TLB misses."
Caused by an L1 i-cache TLB miss (which would also imply an L1 i-cache
miss).
28 l2_ifetch
L1 i-cache <-- L2 cache.
"Number of L2 instruction fetches." (--mesi)
Caused by an L1 i-cache miss.
[What is the relation between events 81 and 28?]
68 bus_tran_ifetch
L2 cache <-- external bus (memory).
"Number of instruction fetch transactions." (--bus_agent)
Caused by an L2 cache miss.
processor pipelines
86 ifu_mem_stall
processor <-- L1 i-cache <-- L2 cache.
"Number of cycles that the instruction fetch pipe stage is stalled,
including cache misses, ITLB misses, ITLB faults, and victim cache
evictions."
The victim cache holds recently removed TLB entries.
87 ild_stall
"Number of cycles that the instruction length decoder is stalled."
This pipe stage is immediately downstream from the fetch pipe stage.
[What is the longest legal instruction, in bytes?]
instruction decode and retire unit
d0 inst_decoder
"Number of instructions decoded."
Up to three instructions can be decoded in one cycle, generating up
to 6 micro-operations in one cycle (one instruction up to 4 uops,
and two instructions at 1 uop each). A microcode sequencer is used
for complex operations. The decode can stall if its output queue
(6 uops) is not drained by the 5 execution units.
The number of uops per instruction (except for the complex cases)
for the Pentium Pro or Pentium II/III is given in reference [2] or
[3], Appendix C.
branch unit
e0 br_inst_decoded
"Number of branch instructions decoded."
e2 btb_misses
"Number of branches that miss the branch target buffer."
The BTB holds information about recently executed branches, and the
direction actually taken. It is used for branch prediction and
speculative execution; on the Pentium Pro, it has 512 entries.
e4 br_bogus
"Number of bogus branches."
[Is this event bogus?]
e6 baclears
"Number of times BACLEAR is asserted." (static branch prediction)
floating-point execution unit
10 fp_comp_ops_exe
"Number of computational floating-point operations executed."
(counter 0 only)
This includes integer multiply.
11 fp_assist
"Number of floating-point exception cases handled by microcode."
(counter 1 only)
12 mul
"Number of multiplies." (counter 1 only)
This includes integer multiply.
13 div
"Number of divides." (counter 1 only)
14 cycles_div_busy
"Number of cycles during which the divider is busy." (counter 0 only)
On the Pentium II, 35 cycles per divide.
MMX execution unit (Pentium II only)
b0 MMX_instr_exec
"Number of MMX instructions executed."
b1 MMX_sat_instr_exec
"Number of MMX saturating arithmetic instructions executed."
b2 MMX_uops_exec
"Number of MMX micro-operations executed." (--MMX)
On port 0..3, requested by the unit mask, bitwise.
b3 MMX_instr_type_exec
Number of MMX instructions executed (--MMX).
Requested by the unit mask, bitwise:
1 packed multiply
2 packed shift
4 pack operation
8 unpack operation
10 packed logical
20 packed arithmetic
cc fp_MMX_trans
Number of transitions between FP and MMX states (--MMX).
Requested by the unit mask:
0 MMX to FP
1 FP to MMX
cd MMX_assist
"Number of MMX assists (that is, the number of EMMS instructions
executed)."
ce MMX_instr_ret
"Number of MMX instructions retired."
cf PII_0xCF
MMX, saturated arithmetic instructions retired (undocumented)
Streaming SIMD execution unit (Pentium III only)
d8 emon_sse_inst_retired
"Number of Streaming SIMD Extensions instructions retired." (--MMX)
Requested by the unit mask, bitwise:
0 packed and scalar
1 scalar
d9 emon_sse_comp_inst_ret
"Number of Streaming SIMD Extensions computational instructions retired."
(--MMX)
Unit mask as above.
instruction decode and retire unit
c0 inst_retired
"Number of instructions retired."
c2 uops_retired
"Number of micro-operations retired."
Up to 3 per cycle.
c1 flops
"Number of computational floating-point operations retired."
(counter 0 only)
This does not include integer multiply (refer to events 10, 12).
branch unit
c4 br_inst_retired
"Number of branch instructions retired."
c5 br_miss_pred_retired
"Number of mispredicted branches retired."
"Mispredicted branches are not recognized for about 10 to 15 cycles." [1]
c9 br_taken_retired
"Number of taken branches retired."
ca br_miss_pred_taken_ret
"Number of taken mispredicted branches retired."
A data memory access (load or store) will induce these events:
memory access
43 data_mem_refs
processor <--> L1 d-cache.
"All memory references, both cacheable and non-cacheable."
Includes speculative loads for instructions executed but not retired.
There are no speculative stores, but there may be a speculative load
for ownership. The number of bytes referenced is not indicated, and
must be deduced by other means.
The Pentium Pro load latency on an L1 d-cache hit is 3 cycles.
[In C, how many references are caused by the assignments (q = p),
(q = *p), (p = *p) or (p = p->next), if p and q are not initially in
registers, and the types are reasonable?]
05 misalign_mem_ref
"Number of misaligned data memory references."
A compiler will usually align static or global data to the appropriate
boundary, but local (runtime stack) double precision data is easily
misaligned. Misaligned data takes longer to access than aligned data.
A data load will further induce these events:
memory access
45 dcu_lines_in
L1 d-cache <-- L2 cache.
"Total lines allocated in the Data Cache Unit."
Caused by an L1 d-cache miss (if the reference was a hit, the line
would already be allocated). Includes misses on speculative loads
for instructions executed but not retired. The cache line state is
E (exclusive) or S (shared). The bytes in the cache line are the
same as the corresponding bytes in memory, or in another cache.
Two successive misses to one cache line will allocate the line only
once; the second miss appears to be a hit, but may still suffer an
access delay. The L1 d-cache miss ratio, (event 45) / (event 43)
can thus be underestimated. A reference that misses the cache but
crosses a cache line boundary will allocate two lines, so the miss
ratio can thus be overestimated. It is necessary to examine some
other event counts and the number of cycles to decide if the miss
ratio has been estimated reliably.
48 dcu_miss_outstanding
L1 d-cache <-- L2 cache.
"Weighted number of cycles while a DCU miss is outstanding."
This event has the unit miss-cycles. That is, in each cycle the
number of unsatisfied data references is added to the event counter.
The aggregate (over all levels of memory) average miss penalty is
(event 48) / (event 45) cycles [explanation follows], except that
"An access that also misses the L2 is shortchanged by 2 cycles
(i.e., if it counts for N cycles, it should be for N+2 cycles).
Subsequent loads to the same cache line will not result in any
additional counts. Count value not precise, but still useful."
"The L1 data cache can accept a new load or store every cycle, and
has a latency of three cycles for loads. It can handle as many as
four simultaneously outstanding misses." [1]
29 l2_ld
L1 d-cache <-- L2 cache.
"Number of L2 data loads." (--mesi)
Caused by an L1 d-cache miss.
[What is the relation between events 45 and 29?]
60 bus_req_outstanding
"Number of bus requests outstanding."
"Counts only DCU full-line cacheable reads, not RFOs, writes,
instruction fetches, or anything else. Counts 'waiting for bus
to complete' (last data chunk received)."
[Can we use the ratio (event 60) / (event 29) to obtain the L2
miss penalty?]
A data store will further induce these events:
load for ownership
45 dcu_lines_in
(see above)
The bytes in the cache line are notionally the same as the
corresponding bytes in memory, or in another cache, but this is only
a temporary situation until the state is changed to M and the store
is executed.
46 dcu_m_lines_in
L1 d-cache <-- L2 cache.
"Number of M-state lines allocated in the Data Cache Unit."
Caused by an L1 d-cache store miss. Included in event 45. The cache
line state is M (modified), and thus exclusive. Copies of the same
line in another cache are invalidated; the copy in memory is not yet
changed.
[What is the relation between events 45, 46 and 29? Is it always
45 = 46 + 29 ?]
48 dcu_miss_outstanding
(see above)
store
03 ld_blocks
"Number of (times the) store buffer blocks."
04 sb_drains
"Number of store buffer drain cycles."
47 dcu_m_lines_out
L1 d-cache --> L2 cache.
"Number of M-state lines evicted from the Data Cache Unit. This
includes evictions via snoop HITM, intervention, or replacement."
That is, caused by normal replacement of a cache line (when the L1
d-cache is full and another line is to be brought in), explicit
write-back (for ex., a cache flush), or a snoop hit by another
processor claiming the line in modified state.
[What is the relation between events 46 and 47?]
2a l2_st
L1 d-cache --> L2 cache.
"Number of L2 data stores." (--mesi)
This should not miss the cache, because of the previous load for
ownership.
[What is the relation between events 47 and 2a? Is it always
47 = 2a + 69 ?]
A generic memory access (fetch, load or store) will induce these events:
L2 cache (L1 side)
2e l2_rqsts
L1 (d or i)-cache <--> L2 cache.
"Number of L2 cache requests." (--mesi)
Caused by an L1 cache miss, instruction or data; includes events 28, 29
and 2a. The request refers to a whole cache line.
22 l2_dbus_busy
L1 (d or i)-cache <--> L2 cache.
"Number of cycles during which the data bus was busy."
Caused by an L1 cache miss.
Each cache line should take 7 cycles to move, with the 4-1-1-1 timing.
[What is the L1 cache-miss penalty?]
23 l2_dbus_busy_rd
L1 (d or i)-cache <-- L2 cache.
"Number of cycles during which the data bus was busy transferring
data from the L2 cache to the processor."
Caused by an L1 cache miss. Included in event 22.
[Does this include instruction fetches?]
L2 cache (memory side)
21 l2_ads
"Number of L2 cache address strobes."
[By whom?]
24 l2_lines_in
L2 cache <-- external bus (memory).
"Number of lines allocated in the L2 cache."
Caused by an L2 cache miss.
cf. events 45, 28, 29.
25 l2_m_lines_inm
L2 cache <-- external bus (memory).
"Number of modified lines allocated in the L2 cache."
Caused by an L2 cache store miss. Included in event 24.
cf. event 46.
26 l2_lines_out
L2 cache --> external bus (memory).
"Number of lines removed from the L2 cache for any reason."
[An unmodified cache line could simply be marked invalid, and not
be written back to memory. What actually happens?]
27 l2_m_lines_outm
L2 cache --> external bus (memory).
"Number of modified lines removed from the L2 cache for any reason."
Included in event 26.
cf. event 47.
external bus transaction
65 bus_tran_brd
L2 cache <-- external bus (memory).
"Number of burst read transactions." (--bus_agent)
A burst will move a cache line (32 bytes) in four data transfers
(8 bytes each).
[What is the relation between events 65 and 24?]
66 bus_tran_rfo
L2 cache <-- external bus (memory).
"Number of read-for-ownership transactions." (--bus_agent)
67 bus_trans_wb
L2 cache --> external bus (memory).
"Number of write-back transactions." (--bus_agent)
68 bus_tran_ifetch
L2 cache <-- external bus (memory).
"Number of instruction fetch transactions." (--bus_agent)
6e bus_tran_burst
L2 cache <--> external bus (memory).
"Number of burst transactions." (--bus_agent)
Includes events 65, 66, 67, 68.
6a bus_tran_pwr
L2 cache --> external bus (memory).
"Number of partial write transactions." (--bus_agent)
Included in event 6b.
6b bus_trans_p
L2 cache <--> external bus (memory).
"Number of partial transactions." (--bus_agent)
A partial transaction moves less than a full cache line.
69 bus_tran_inval
"Number of invalidate transactions." (--bus_agent)
6f bus_tran_mem
L2 cache <--> external bus (memory).
"Number of memory transactions." (--bus_agent)
Includes events 6e, 6b, 69.
6c bus_trans_io
"Number of I/O transactions." (--bus_agent)
6d bus_tran_def
"Number of deferred transactions." (--bus_agent)
70 bus_tran_any
"Number of all transactions." (--bus_agent)
Includes events 6f, 6c, 6d.
An external bus transaction will occupy the bus for some number of cycles:
62 bus_drdy_clocks
"Number of clocks during which DRDY (the data ready signal) is
asserted."
"--bus_agent 0 counts bus clocks when this processor is driving DRDY;
--bus_agent 1 counts processor clocks when any agent is driving DRDY."
The DRDY signal is asserted by the bus agent driving the data to
indicate that the data is valid during the data phase of a transaction.
Wait states can be induced by not asserting DRDY during this phase.
On a single-processor system, the other bus agents would be memory
and i/o devices.
64 bus_data_rcv
"Number of bus clock cycles during which this processor is receiving
data."
[Does this include instruction fetches?]
[What does (event 64) / (event 60) represent?]
61 bus_bnr_drv
"Number of bus clock cycles during which this processor is driving
the BNR pin." (BNR = Block Next Request)
63 bus_lock_clocks
"Number of processor clocks during which the LOCK signal is asserted."
(--bus_agent)
7a bus_hit_drv
"Number of bus clock cycles during which this processor is driving
the HIT pin. Includes cycles due to snoop stalls."
7b bus_hitm_drv
"Number of bus clock cycles during which this processor is driving
the HITM pin. Includes cycles due to snoop stalls."
7e bus_snoop_stall
"Number of bus clock cycles during which the bus is snoop stalled."
This should only happen on a multiprocessor.
An interrupt will induce these events:
c8 hw_int_rx
"Number of hardware interrupts received."
On most personal computers, this will be about 100 per second, from
the real-time clock.
c6 cycles_int_masked
"Number of processor cycles for which interrupts are disabled."
c7 cycles_int_pending_and_masked
"Number of processor cycles for which interrupts are disabled
and interrupts are pending."
A memory-management operation will induce these events:
segment register
06 segment_reg_loads
"Number of segment register loads."
d4 seg_rename_stalls
"Number of segment register renaming stalls." (--MMX)
Pentium II only. Use the unit mask to specify segment registers
ES, DS, FS, GS, by or-ing 1, 2, 4, 8.
d5 seg_reg_renames
"Number of segment register renames." (--MMX)
Pentium II, as above.
d6 ret_seg_renames
"Number of segment register rename events retired."
Pentium II only.
A prefetch operation will induce these events (Pentium III only):
07 emon_sse_pre_dispatched
"Number of prefetch/weakly-ordered instructions dispatched;
speculative prefetches are included in counting." (--MMX)
Requested by the unit mask, bitwise:
0 prefetch NTA
1 prefetch T0
2 prefetch T1, prefetch T2
3 weakly ordered stores
4b emon_sse_pre_miss
"Number of prefetch/weakly-ordered instructions that miss
all caches." (--MMX)
Unit mask as above.
These events report a hazardous condition:
52 PII_0x52
self-modifying code sequences (undocumented)
Explanations.
The miss penalty for data references.
Over a period of N processor cycles, for large N, how many Level 1
data-cache misses will have been outstanding, and for how long on
average? Let
N = processor cycles = (actual cycles, but we could also use event 79),
r = data references / cycle = (event 43) / (event 79),
m = L1 d-cache misses per reference = (event 45) / (event 43),
p = aggregate average miss penalty (cycles).
Some of the L1 d-cache misses will also miss L2 or cause a page fault,
so p is not the L1 miss penalty alone. Because p is determined by an
experiment with one program or section of code, this estimate might
not apply to another program or section of code. Nevertheless, here
we go: In N cycles, there will be Nr data references. Among these
Nr references, there will be (Nr)m misses. Event 48, over N cycles,
will count (Nrm)p miss-cycles. Thus,
(event 48)
= N * r * m * p
= (event 79) * (event 43 / event 79) * (event 45 / event 43) * p,
so p = (event 48) / (event 45).
[Why was it necessary to assume a measurement over a large number of
cycles, when the number of cycles does not appear directly in the final
equation? Are there any other tacit assumptions in this analysis?]
[The processor can only hold a certain number of outstanding misses
in its buffer. How should this number enter into the analysis?]
[Suppose a miss is outstanding for t cycles, and there is a limit
of b outstanding misses; further misses are delayed in the processor.
If n misses occur in rapid succession, the last outstanding miss is
cleared after t*n/b + b-1 cycles; for simplicity we assumed n is a
multiple of b. Event 48 will count
b(b-1)/2 {ramp up over b-1 cycles}
+ b(t-b+1) {steady state for first b misses}
+ b*t*(n-b)/b {steady state for n-b misses; clear b in t cycles}
+ b(b-1)/2 {ramp down over b-1 cycles}
= nt miss-cycles.
Event 45 will count n misses, and the ratio is correct, subject to
the inaccuracies described earler.]
Preventing Out-of-order Execution For Precise Measurement:
The micro-operations, and thus the instructions, will execute when their
data is available. There is a class of serializing instructions that
will not execute until all prior instructions have completed, and will
prevent subsequent instructions from moving ahead. The CPUID instruction
is the only serializing instruction available in user mode. If the
rdtsc and rdpmc instructions (to read the cycle and event counters) have
been inlined, then serialization is required for precise measurement.
Consider the sequence
instr 1 # end of measured code
rdtsc # read the Time Stamp Counter
instr 2 # start of unmeasured code
To prevent rdtsc from moving ahead of instr 1, and measuring too little,
and to prevent instr 2 from moving ahead of rdtsc, and measuring too
much, the code can be written as
instr 1 # end of measured code
eax = 0 # operand for cpuid
cpuid # writes to eax, ebx, ecx, edx
rdtsc # writes to eax, edx
instr 2 # start of unmeasured code
and the overhead for the two additional instructions should be subtracted
from the measured time. The overhead should be an estimate of the least
possible cycle or event count, to avoid producing a negative result for
the measurement.
Measurements with rabbit.
The program should have consumed a large fraction of the CPU time;
if not, the test should be repeated when the system is otherwise idle.
Use the '--os 0' option to remove kernel-mode events, but not kernel-
mode cycles. Use the '-s 1' option to reduce the sampling overhead
when averages are the main interest. Use the '-S 2' option to see the
most measurement detail.
1. average cycles per instruction retired
average cycles per micro-operation retired
micro-operations per instruction
rabbit -S 2 --e 0xc0,0xc2 -s 1 --os 0 foo
The measurements are reported as 'Events per cycle', so they must
be inverted to get 'cycles per instruction'.
Further Topics.
For which events can there be more than one event per cycle?
[rabbit --compare gt0,le0 --e 0xd0 -s 1 foo]
[rabbit --compare gt1,le1 --e 0xd0 -s 1 foo]
[rabbit --compare gt2,le2 --e 0xd0 -s 1 foo]
[rabbit --compare gt3,le3 --e 0xd0 -s 1 foo]
[rabbit --compare gt4,le4 --e 0xd0 -s 1 foo]
d0 decode up to 3 instructions per cycle
c2 retire up to 3 micro-operations per cycle
48 up to 4 outstanding misses (L1 d-cache) in a cycle
...