Intel Pentium Pro / Pentium II/III

Performance-Monitoring Events, Commentary and [Review Questions]

(under construction)

Previous example       -- pmc_event_set_t
Next example           -- hardware registers
Return to Main Menu

Diagram           -- events and processor/cache/memory components
  jpeg format, pdf format (requires Adobe Acrobat Reader)

Suggested event pairs, as a rabbit input file
  Nearly complete coverage
  Data references related to other events
  Events vs. cycles for some events
  User mode vs. System mode

References:
  "quoted text" without a reference is (with minor modifications) from
    Intel Corp., Intel Architecture Software Developer's Manual,
    vol. 3: System Programming Guide, 1997, order no. 243192,
    Appendix A, Performance-Monitoring Events.

  [1] Dileep Bhandarkar and Jason Ding,
    "Performance Characterization of the Pentium Pro Processor,"
    Proc. Third Int. Symp. on High-Performance Computer Architecture,
    Feb. 1-5, 1997, IEEE Computer Society Press, 1997, pp. 288-297.

  [2] Intel Corp., Intel Architecture Optimization Manual, 1997,
    order no. 242816-003.

  [3] Intel Corp., Intel Architecture Optimization Reference Manual,
    1999, order no. 245127-001.



In the following event descriptions, the event codes are hex (base 16),
and the arrows represent movement of information in response to a request.
Rabbit command-line options (-this or --that) are used.


Terms and Acronyms:
  L1, L2 = Level 1 cache, Level 2 cache.
    (Think of the registers as Level 0, the memory as Level 3, and
    the paging disk as Level 4.)
  IFU = Instruction Fetch Unit, includes the L1 instruction cache.
  DCU = Data Cache Unit, includes the L1 data cache.
    (L2 is a unified cache.)
  TLB = Translation Lookaside Buffer, a small cache for virtual memory
    address calculations.
  BTB = Branch Target Buffer, used for dynamic branch prediction.
  MMX = MMX.  Some people claim this means Multi-Media eXtensions.
  hit, miss:  When the processor needs some information, it is found
    (a hit) or not (a miss).  A miss at Level i causes a request to
    Level i+1, and then a hit or miss at Level i+1.  Since Level i+1
    is slower than Level i, there will be a delay (the Level i miss
    penalty) before data is delivered in response to the request.
    The miss ratio (misses at Level i / requests to Level i) is also
    important information.
  fault:  A miss that causes paging activity.
  cache line:  To improve efficiency, based on spatial locality, 32
    bytes (a cache line) are moved between Level 1, Level 2 and memory
    in response to any processor request.
  cache line management:
	... allocate, remove, evict, replace
  data bus:  Connects the L1 and L2 caches; also called the backside
    cache bus.
  external bus:  Connects the processor to the memory and other devices.
    Carries data, addresses, and transaction information.  A bus agent
    is any device, usually a processor or memory subsystem, that can
    initiate a bus transaction.  An external bus transaction will
    occupy the bus for some number of cycles.  The counters can be
    configured to react to any bus agent (rabbit option --bus_agent 1)
    or only to this processor (--bus_agent 0).
  processor cycle, bus cycle: ...
    For example, a 200 MHz Pentium Pro with a 66 MHz system bus will
    use 3 processor cycles for each bus cycle.
  active cycle: A processor cycle when the processor is not halted.
  halt state: The Time Stamp Counter will continue to increment in
    the halt state (HLT instruction).

Cache Line States:
    S = shared (and not modified)
    E = exclusive (and not modified)
    M = modified (and exclusive)
    I = invalid
    (Intel prefers the acronym MESI for this protocol.)
  The cache line state answers these questions related to consistency:
    How does this copy of the line compare to the one in memory?
 	(the local copy has been modified or has not been modified
	 since it was obtained from memory)
    Is there another copy of the line in another processor's cache?
 	(if shared then maybe, if exclusive then no)
  Lines in the Level 1 instruction cache are either shared or invalid.

  Copying a line from memory to the local cache requires a conversation
  with the other processors to see if they already hold a copy of the
  line.
  When reading a line, an M-state line elsewhere must be written back
  to memory and changed to S-state, and an E-state line elsewhere must
  be changed to S-state; S-state lines elsewhere are unaffected.  The
  read from memory can then proceed, with the local line in either S-
  or E-state.
  To establish exclusive ownership before writing to a line, an M-state
  line elsewhere must be written back to memory and invalidated, while
  all S- or E-state lines elsewhere can simply be invalidated.  The line
  can then be loaded from memory in M-state and modified.  This "read for
  ownership" sequence will occur on single- and multi-processor systems.

  Some cache lines or virtual memory pages may be non-cacheable.  This
  is usually determined by the operating system, or by the user through
  a system call.  [What is the effect of the C keyword "volatile"?]

Pipeline Stalls:
  The processor pipelines must delay execution of an operation if the
  required information is not available.  These "stalls" are usually
  caused by a miss at some level of the cache/memory hierarchy, but
  also by a relatively slow or delayed previous operation, or by a
  resource conflict in the processor.

Pipeline Latency (Pentium Pro):

  ... 14-stage pipeline
	in-order front end, 8 stages
	out-of-order execution, 3 stages, 5 execution units
	in-order retirement, 3 stages

  A simple register-to-register integer operation executes in one cycle.

  operation		cycles,		cycles/result,
			latency		throughput
  load (L1 hit)		3
  integer multiply	4		1
  fl.pt. add		3		1
  fl.pt. multiply	5		2
  fl.pt. divide		17 single	not pipelined
			32 double
			37 extended

Speculative Execution:
  To compensate for various delays, the processor will attempt to execute
  an instruction as soon its data is available, even before the previous
  instructions have all completed.  This entails fetching instructions from
  predicted paths following a conditional branch; when the branch direction
  is resolved, the instructions on the wrong path are discarded, and those
  on the right path are accepted (retired).  Instructions are retired in
  the order in which they appear in the program, so there are no semantic
  difficulties.  The ability to predict branch direction based on recent
  behavior reduces the possibility that speculative work is wasted.

Micro-operations:
  An instruction is decoded into one or more micro-operations, which are
  then matched with available data and sent to the execution units.  The
  number of micro-operations executed per cycle is a measure of the internal
  parallelism in the processor.  On the Pentium Pro, up to five micro-ops
  can be in execution in one cycle, and up to three retired per cycle.
  Register renaming is used to compensate for the relatively small number
  of architectural registers.  Up to three micro-ops can be renamed in
  one cycle; there are 40 physical registers available.

L1 and L2 cache:
  The L1 d-cache can execute one load and one store operation per cycle.
  Buffering between the processor and the cache helps to smooth out delays.
			size (KB)	set associativity
			L1 i	L1 d	L1 i	L1 d	L2
	Pentium Pro	8	8	4	2	4
	Pentium II	16	16
  The size of the L2 cache can be determined from the CPUID instruction
  (try 'rabbit -p').  A larger L2 cache tends to reduce the average number
  of cycles per instruction, and definitely increases the cost of the
  processor.

Data Bus Transactions:  On the L1 side of L2, these events record the
  action taken by the processor that caused a cache line to be copied.
  See the next group for the memory side of L2.
  2e  requests
      28  instruction fetch	L2 to L1 i-cache
      29  data load		L2 to L1 d-cache
      2a  data store		L1 d-cache to L2
  These event count relations should hold:
    2e = 28 + 29 + 2a
  These are the only events affected by the --mesi option.

  The data bus is 8 bytes wide, while a cache line is 32 bytes wide.
  On the Pentium Pro, the transaction timing is 4-1-1-1 at the processor
  clock rate; that is, 4 cycles for the first 8 bytes, followed by 1
  cycle for each of the next 8 bytes.

External Bus Transactions:  As seen by the event counters,
  70  all
      6c  i/o
      6f  memory
          6e  burst
	      65  burst read
	      66  read for ownership
	      67  writeback
	      68  instruction fetch
          6b  partial
	      6a  partial write
          69  invalidate
      6d  deferred
  These event count relations should hold:
    6e = 65 + 66 + 67 + 68
    6f = 6e + 6b + 69
    70 = 6c + 6f + 6d

  The external bus is 8 bytes wide.  With the Pentium Pro, it operates
  at 60 or 66 MHz; with the Celeron, Pentium II and Pentium II Xeon
  processors, it operates at 66 or 100 MHz.  With the Pentium III, it
  operates at 100 MHz.

  On the Pentium Pro, up to 8 outstanding bus transactions are allowed.
  Event 60 will count some of these transactions.  The transaction timing
  depends on the degree of memory interleaving.  On the Pentium Pro, it
  can be 14-1-1-1, 14-2-2-2, or 14-4-4-4 for 4-, 2- or 1-way interleaving,
  counted in bus cycles.

Subset Relations:
  The first event is included in the second event:
	85	81
	46	45
	23	22
	25	24
	27	26

Miss Relations:
  The first event is a miss directly following the second event:
	81	80	L1 instruction cache
	45	43	L1 data cache
	68	28	L2, instruction fetch

Events or cycles.
  These events may be configured (with --duration) to count events (--d 0)
  or cycles (--d 1):
    04, 14, 22, 23, 48, 60, 61, 62, 63, 64, 79, 7a, 7b, 7e, 86, 87, a2,
    c6, c7, d2.
  However, there is usually no difference in the results.  Bummer.



At each cycle,
  clocks
    79	cpu_clk_unhalted
    	"Number of cycles during which the processor is not halted."
	The halt state is typically used in the OS idle loop.
    a2	resource_stalls
    	"Number of cycles during which there are resource-related stalls."
	Other operations could occur when one operation is stalled.  Resources
	include "register renaming or reorder buffer entries, memory buffer
	entries, and execution units." [1]
    d2	partial_rat_stalls
    	"Number of cycles or events for partial stalls."
	RAT = register alias table, used for renaming; AX is a partial register
	in EAX.

An instruction fetch will induce these events:
  memory access
    80	ifu_ifetch
    	processor <-- L1 i-cache.
    	"Number of instruction fetches, both cacheable and non-cacheable."
	Includes speculative fetches for instructions decoded but not executed,
	or decoded and executed but not retired.  The processor will attempt
	one fetch of 16 bytes each active cycle.
    81	ifu_ifetch_miss
	L1 i-cache <-- L2 cache.
    	"Number of instruction fetch misses."
	Caused by an L1 i-cache miss.
    85	itlb_miss
	L1 i-cache <-- L2 cache.
    	"Number of instruction TLB misses."
	Caused by an L1 i-cache TLB miss (which would also imply an L1 i-cache
	miss).
    28	l2_ifetch
	L1 i-cache <-- L2 cache.
    	"Number of L2 instruction fetches." (--mesi)
	Caused by an L1 i-cache miss.
	[What is the relation between events 81 and 28?]
    68	bus_tran_ifetch
	L2 cache <-- external bus (memory).
    	"Number of instruction fetch transactions." (--bus_agent)
	Caused by an L2 cache miss.
  processor pipelines
    86	ifu_mem_stall
	processor <-- L1 i-cache <-- L2 cache.
    	"Number of cycles that the instruction fetch pipe stage is stalled,
	including cache misses, ITLB misses, ITLB faults, and victim cache
	evictions."
	The victim cache holds recently removed TLB entries.
    87	ild_stall
    	"Number of cycles that the instruction length decoder is stalled."
	This pipe stage is immediately downstream from the fetch pipe stage.
	[What is the longest legal instruction, in bytes?]
  instruction decode and retire unit
    d0	inst_decoder
    	"Number of instructions decoded."
	Up to three instructions can be decoded in one cycle, generating up
	to 6 micro-operations in one cycle (one instruction up to 4 uops,
	and two instructions at 1 uop each).  A microcode sequencer is used
	for complex operations.  The decode can stall if its output queue
	(6 uops) is not drained by the 5 execution units.
	The number of uops per instruction (except for the complex cases)
	for the Pentium Pro or Pentium II/III is given in reference [2] or
	[3], Appendix C.
  branch unit
    e0	br_inst_decoded
    	"Number of branch instructions decoded."
    e2	btb_misses
    	"Number of branches that miss the branch target buffer."
	The BTB holds information about recently executed branches, and the
	direction actually taken.  It is used for branch prediction and
	speculative execution; on the Pentium Pro, it has 512 entries.
    e4	br_bogus
    	"Number of bogus branches."
	[Is this event bogus?]
    e6	baclears
    	"Number of times BACLEAR is asserted." (static branch prediction)
  floating-point execution unit
    10	fp_comp_ops_exe
    	"Number of computational floating-point operations executed."
	(counter 0 only)
	This includes integer multiply.
    11	fp_assist
    	"Number of floating-point exception cases handled by microcode."
	(counter 1 only)
    12	mul
    	"Number of multiplies."  (counter 1 only)
	This includes integer multiply.
    13	div
    	"Number of divides."  (counter 1 only)
    14	cycles_div_busy
    	"Number of cycles during which the divider is busy."  (counter 0 only)
	On the Pentium II, 35 cycles per divide.
  MMX execution unit (Pentium II only)
    b0	MMX_instr_exec
    	"Number of MMX instructions executed."
    b1	MMX_sat_instr_exec
    	"Number of MMX saturating arithmetic instructions executed."
    b2	MMX_uops_exec
    	"Number of MMX micro-operations executed." (--MMX)
	On port 0..3, requested by the unit mask, bitwise.
    b3	MMX_instr_type_exec
    	Number of MMX instructions executed (--MMX).
	Requested by the unit mask, bitwise:
	   1	packed multiply
	   2	packed shift
	   4	pack operation
	   8	unpack operation
	  10	packed logical
	  20	packed arithmetic
    cc	fp_MMX_trans
    	Number of transitions between FP and MMX states (--MMX).
	Requested by the unit mask:
	   0	MMX to FP
	   1	FP to MMX
    cd	MMX_assist
    	"Number of MMX assists (that is, the number of EMMS instructions
	executed)."
    ce	MMX_instr_ret
    	"Number of MMX instructions retired."
    cf	PII_0xCF
    	MMX, saturated arithmetic instructions retired (undocumented)
  Streaming SIMD execution unit (Pentium III only)
    d8	emon_sse_inst_retired
	"Number of Streaming SIMD Extensions instructions retired."  (--MMX)
	Requested by the unit mask, bitwise:
	  0	packed and scalar
	  1	scalar
    d9	emon_sse_comp_inst_ret
	"Number of Streaming SIMD Extensions computational instructions retired."
	(--MMX)
	Unit mask as above.
	
  instruction decode and retire unit
    c0	inst_retired
    	"Number of instructions retired."
    c2	uops_retired
    	"Number of micro-operations retired."
	Up to 3 per cycle.
    c1	flops
    	"Number of computational floating-point operations retired."
	(counter 0 only)
	This does not include integer multiply (refer to events 10, 12).
  branch unit
    c4	br_inst_retired
    	"Number of branch instructions retired."
    c5	br_miss_pred_retired
    	"Number of mispredicted branches retired."
	"Mispredicted branches are not recognized for about 10 to 15 cycles." [1]
    c9	br_taken_retired
    	"Number of taken branches retired."
    ca	br_miss_pred_taken_ret
    	"Number of taken mispredicted branches retired."

A data memory access (load or store) will induce these events:
  memory access
    43	data_mem_refs
    	processor <--> L1 d-cache.
	"All memory references, both cacheable and non-cacheable."
	Includes speculative loads for instructions executed but not retired.
	There are no speculative stores, but there may be a speculative load
	for ownership.  The number of bytes referenced is not indicated, and
	must be deduced by other means.
	The Pentium Pro load latency on an L1 d-cache hit is 3 cycles.
	[In C, how many references are caused by the assignments (q = p),
	(q = *p), (p = *p) or (p = p->next), if p and q are not initially in
	registers, and the types are reasonable?]
    05	misalign_mem_ref
    	"Number of misaligned data memory references."
	A compiler will usually align static or global data to the appropriate
	boundary, but local (runtime stack) double precision data is easily
	misaligned.  Misaligned data takes longer to access than aligned data.

A data load will further induce these events:
  memory access
    45	dcu_lines_in
    	L1 d-cache <-- L2 cache.
	"Total lines allocated in the Data Cache Unit."
	Caused by an L1 d-cache miss (if the reference was a hit, the line
	would already be allocated).  Includes misses on speculative loads
	for instructions executed but not retired.  The cache line state is
	E (exclusive) or S (shared).  The bytes in the cache line are the
	same as the corresponding bytes in memory, or in another cache.
	Two successive misses to one cache line will allocate the line only
	once; the second miss appears to be a hit, but may still suffer an
	access delay.  The L1 d-cache miss ratio, (event 45) / (event 43)
	can thus be underestimated.  A reference that misses the cache but
	crosses a cache line boundary will allocate two lines, so the miss
	ratio can thus be overestimated.  It is necessary to examine some
	other event counts and the number of cycles to decide if the miss
	ratio has been estimated reliably.
    48	dcu_miss_outstanding
	L1 d-cache <-- L2 cache.
    	"Weighted number of cycles while a DCU miss is outstanding."
	This event has the unit miss-cycles.  That is, in each cycle the
	number of unsatisfied data references is added to the event counter.
	The aggregate (over all levels of memory) average miss penalty is
	(event 48) / (event 45) cycles [explanation follows], except that
	"An access that also misses the L2 is shortchanged by 2 cycles
	(i.e., if it counts for N cycles, it should be for N+2 cycles).
	Subsequent loads to the same cache line will not result in any
	additional counts.  Count value not precise, but still useful."
	"The L1 data cache can accept a new load or store every cycle, and
	has a latency of three cycles for loads.  It can handle as many as
	four simultaneously outstanding misses." [1]
    29	l2_ld
	L1 d-cache <-- L2 cache.
    	"Number of L2 data loads." (--mesi)
	Caused by an L1 d-cache miss.
	[What is the relation between events 45 and 29?]
    60	bus_req_outstanding
    	"Number of bus requests outstanding."
	"Counts only DCU full-line cacheable reads, not RFOs, writes,
	instruction fetches, or anything else.  Counts 'waiting for bus
	to complete' (last data chunk received)."
	[Can we use the ratio (event 60) / (event 29) to obtain the L2
	miss penalty?]

A data store will further induce these events:
  load for ownership
    45	dcu_lines_in
	(see above)
	The bytes in the cache line are notionally the same as the
	corresponding bytes in memory, or in another cache, but this is only
	a temporary situation until the state is changed to M and the store
	is executed.
    46	dcu_m_lines_in
	L1 d-cache <-- L2 cache.
	"Number of M-state lines allocated in the Data Cache Unit."
	Caused by an L1 d-cache store miss.  Included in event 45.  The cache
	line state is M (modified), and thus exclusive.  Copies of the same
	line in another cache are invalidated; the copy in memory is not yet
	changed.
	[What is the relation between events 45, 46 and 29?  Is it always
	45 = 46 + 29 ?]
    48	dcu_miss_outstanding
    	(see above)
  store
    03	ld_blocks
    	"Number of (times the) store buffer blocks."
    04	sb_drains
    	"Number of store buffer drain cycles."
    47	dcu_m_lines_out
	L1 d-cache --> L2 cache.
	"Number of M-state lines evicted from the Data Cache Unit.  This
	includes evictions via snoop HITM, intervention, or replacement."
	That is, caused by normal replacement of a cache line (when the L1
	d-cache is full and another line is to be brought in), explicit
	write-back (for ex., a cache flush), or a snoop hit by another
	processor claiming the line in modified state.
	[What is the relation between events 46 and 47?]
    2a	l2_st
	L1 d-cache --> L2 cache.
    	"Number of L2 data stores." (--mesi)
	This should not miss the cache, because of the previous load for
	ownership.
	[What is the relation between events 47 and 2a?  Is it always
	47 = 2a + 69 ?]

A generic memory access (fetch, load or store) will induce these events:
  L2 cache (L1 side)
    2e	l2_rqsts
	L1 (d or i)-cache <--> L2 cache.
    	"Number of L2 cache requests." (--mesi)
	Caused by an L1 cache miss, instruction or data; includes events 28, 29
	and 2a.  The request refers to a whole cache line.
    22	l2_dbus_busy
	L1 (d or i)-cache <--> L2 cache.
    	"Number of cycles during which the data bus was busy."
	Caused by an L1 cache miss.
	Each cache line should take 7 cycles to move, with the 4-1-1-1 timing.
	[What is the L1 cache-miss penalty?]
    23	l2_dbus_busy_rd
	L1 (d or i)-cache <-- L2 cache.
    	"Number of cycles during which the data bus was busy transferring
	data from the L2 cache to the processor."
	Caused by an L1 cache miss.  Included in event 22.
	[Does this include instruction fetches?]
  L2 cache (memory side)
    21	l2_ads
    	"Number of L2 cache address strobes."
	[By whom?]
    24	l2_lines_in
	L2 cache <-- external bus (memory).
    	"Number of lines allocated in the L2 cache."
	Caused by an L2 cache miss.
	cf. events 45, 28, 29.
    25	l2_m_lines_inm
	L2 cache <-- external bus (memory).
    	"Number of modified lines allocated in the L2 cache."
	Caused by an L2 cache store miss.  Included in event 24.
	cf. event 46.
    26	l2_lines_out
	L2 cache --> external bus (memory).
    	"Number of lines removed from the L2 cache for any reason."
	[An unmodified cache line could simply be marked invalid, and not
	be written back to memory.  What actually happens?]
    27	l2_m_lines_outm
	L2 cache --> external bus (memory).
    	"Number of modified lines removed from the L2 cache for any reason."
	Included in event 26.
	cf. event 47.
  external bus transaction
    65	bus_tran_brd
	L2 cache <-- external bus (memory).
    	"Number of burst read transactions." (--bus_agent)
	A burst will move a cache line (32 bytes) in four data transfers
	(8 bytes each).
	[What is the relation between events 65 and 24?]
    66	bus_tran_rfo
    	L2 cache <-- external bus (memory).
	"Number of read-for-ownership transactions." (--bus_agent)
    67	bus_trans_wb
	L2 cache --> external bus (memory).
    	"Number of write-back transactions." (--bus_agent)
    68	bus_tran_ifetch
	L2 cache <-- external bus (memory).
    	"Number of instruction fetch transactions." (--bus_agent)
    6e	bus_tran_burst
	L2 cache <--> external bus (memory).
    	"Number of burst transactions." (--bus_agent)
	Includes events 65, 66, 67, 68.
    6a	bus_tran_pwr
	L2 cache --> external bus (memory).
    	"Number of partial write transactions." (--bus_agent)
	Included in event 6b.
    6b	bus_trans_p
	L2 cache <--> external bus (memory).
    	"Number of partial transactions." (--bus_agent)
	A partial transaction moves less than a full cache line.
    69	bus_tran_inval
    	"Number of invalidate transactions." (--bus_agent)
    6f	bus_tran_mem
	L2 cache <--> external bus (memory).
    	"Number of memory transactions." (--bus_agent)
	Includes events 6e, 6b, 69.
    6c	bus_trans_io
    	"Number of I/O transactions." (--bus_agent)
    6d	bus_tran_def
    	"Number of deferred transactions." (--bus_agent)
    70	bus_tran_any
    	"Number of all transactions." (--bus_agent)
	Includes events 6f, 6c, 6d.

An external bus transaction will occupy the bus for some number of cycles:
    62	bus_drdy_clocks
	"Number of clocks during which DRDY (the data ready signal) is
	asserted."
	"--bus_agent 0 counts bus clocks when this processor is driving DRDY;
	--bus_agent 1 counts processor clocks when any agent is driving DRDY."
	The DRDY signal is asserted by the bus agent driving the data to
	indicate that the data is valid during the data phase of a transaction.
	Wait states can be induced by not asserting DRDY during this phase.
	On a single-processor system, the other bus agents would be memory
	and i/o devices.
    64	bus_data_rcv
    	"Number of bus clock cycles during which this processor is receiving
	data."
	[Does this include instruction fetches?]
	[What does (event 64) / (event 60) represent?]
    61	bus_bnr_drv
    	"Number of bus clock cycles during which this processor is driving
	the BNR pin."  (BNR = Block Next Request)
    63	bus_lock_clocks
    	"Number of processor clocks during which the LOCK signal is asserted."
	(--bus_agent)
    7a	bus_hit_drv
    	"Number of bus clock cycles during which this processor is driving
	the HIT pin.  Includes cycles due to snoop stalls."
    7b	bus_hitm_drv
    	"Number of bus clock cycles during which this processor is driving
	the HITM pin.  Includes cycles due to snoop stalls."
    7e	bus_snoop_stall
    	"Number of bus clock cycles during which the bus is snoop stalled."
	This should only happen on a multiprocessor.

An interrupt will induce these events:
    c8	hw_int_rx
    	"Number of hardware interrupts received."
	On most personal computers, this will be about 100 per second, from
	the real-time clock.
    c6	cycles_int_masked
    	"Number of processor cycles for which interrupts are disabled."
    c7	cycles_int_pending_and_masked
    	"Number of processor cycles for which interrupts are disabled
	and interrupts are pending."

A memory-management operation will induce these events:
  segment register
    06	segment_reg_loads
    	"Number of segment register loads."
    d4	seg_rename_stalls
    	"Number of segment register renaming stalls." (--MMX)
	Pentium II only.  Use the unit mask to specify segment registers
	ES, DS, FS, GS, by or-ing 1, 2, 4, 8.
    d5	seg_reg_renames
    	"Number of segment register renames." (--MMX)
	Pentium II, as above.
    d6	ret_seg_renames
    	"Number of segment register rename events retired."
	Pentium II only.

A prefetch operation will induce these events (Pentium III only):
    07	emon_sse_pre_dispatched
	"Number of prefetch/weakly-ordered instructions dispatched;
	speculative prefetches are included in counting."  (--MMX)
        Requested by the unit mask, bitwise:
          0     prefetch NTA
          1     prefetch T0
          2     prefetch T1, prefetch T2
          3     weakly ordered stores
    4b	emon_sse_pre_miss
	"Number of prefetch/weakly-ordered instructions that miss
	all caches."  (--MMX)
	Unit mask as above.

These events report a hazardous condition:
    52	PII_0x52
	self-modifying code sequences (undocumented)


Explanations.

  The miss penalty for data references.

  Over a period of N processor cycles, for large N, how many Level 1
  data-cache misses will have been outstanding, and for how long on
  average?  Let
    N = processor cycles = (actual cycles, but we could also use event 79),
    r = data references / cycle = (event 43) / (event 79),
    m = L1 d-cache misses per reference = (event 45) / (event 43),
    p = aggregate average miss penalty (cycles).
  Some of the L1 d-cache misses will also miss L2 or cause a page fault,
  so p is not the L1 miss penalty alone.  Because p is determined by an
  experiment with one program or section of code, this estimate might
  not apply to another program or section of code.  Nevertheless, here
  we go:  In N cycles, there will be Nr data references.  Among these
  Nr references, there will be (Nr)m misses.  Event 48, over N cycles,
  will count (Nrm)p miss-cycles.  Thus,
    (event 48)
    = N * r * m * p
    = (event 79) * (event 43 / event 79) * (event 45 / event 43) * p,
  so p = (event 48) / (event 45).
  [Why was it necessary to assume a measurement over a large number of
  cycles, when the number of cycles does not appear directly in the final
  equation?  Are there any other tacit assumptions in this analysis?]
  [The processor can only hold a certain number of outstanding misses
  in its buffer.  How should this number enter into the analysis?]

  [Suppose a miss is outstanding for t cycles, and there is a limit
  of b outstanding misses; further misses are delayed in the processor.
  If n misses occur in rapid succession, the last outstanding miss is
  cleared after t*n/b + b-1 cycles; for simplicity we assumed n is a
  multiple of b.  Event 48 will count
    b(b-1)/2		{ramp up over b-1 cycles}
    + b(t-b+1)		{steady state for first b misses}
    + b*t*(n-b)/b	{steady state for n-b misses; clear b in t cycles}
    + b(b-1)/2		{ramp down over b-1 cycles}
    = nt miss-cycles.
  Event 45 will count n misses, and the ratio is correct, subject to
  the inaccuracies described earler.]


Preventing Out-of-order Execution For Precise Measurement:

  The micro-operations, and thus the instructions, will execute when their
  data is available.  There is a class of serializing instructions that
  will not execute until all prior instructions have completed, and will
  prevent subsequent instructions from moving ahead.  The CPUID instruction
  is the only serializing instruction available in user mode.  If the
  rdtsc and rdpmc instructions (to read the cycle and event counters) have
  been inlined, then serialization is required for precise measurement.
  Consider the sequence
	instr 1		# end of measured code
	rdtsc		# read the Time Stamp Counter
	instr 2		# start of unmeasured code
  To prevent rdtsc from moving ahead of instr 1, and measuring too little,
  and to prevent instr 2 from moving ahead of rdtsc, and measuring too
  much, the code can be written as
	instr 1		# end of measured code
	eax = 0		# operand for cpuid
	cpuid		# writes to eax, ebx, ecx, edx
	rdtsc		# writes to eax, edx
	instr 2		# start of unmeasured code
  and the overhead for the two additional instructions should be subtracted
  from the measured time.  The overhead should be an estimate of the least
  possible cycle or event count, to avoid producing a negative result for
  the measurement.


Measurements with rabbit.

  The program should have consumed a large fraction of the CPU time;
  if not, the test should be repeated when the system is otherwise idle.
  Use the '--os 0' option to remove kernel-mode events, but not kernel-
  mode cycles.  Use the '-s 1' option to reduce the sampling overhead
  when averages are the main interest.  Use the '-S 2' option to see the
  most measurement detail.

  1. average cycles per instruction retired
     average cycles per micro-operation retired
     micro-operations per instruction

	rabbit -S 2 --e 0xc0,0xc2 -s 1 --os 0 foo

     The measurements are reported as 'Events per cycle', so they must
     be inverted to get 'cycles per instruction'.


Further Topics.

  For which events can there be more than one event per cycle?
  [rabbit --compare gt0,le0 --e 0xd0 -s 1 foo]
  [rabbit --compare gt1,le1 --e 0xd0 -s 1 foo]
  [rabbit --compare gt2,le2 --e 0xd0 -s 1 foo]
  [rabbit --compare gt3,le3 --e 0xd0 -s 1 foo]
  [rabbit --compare gt4,le4 --e 0xd0 -s 1 foo]

  d0  decode up to 3 instructions per cycle
  c2  retire up to 3 micro-operations per cycle
  48  up to 4 outstanding misses (L1 d-cache) in a cycle
  ...


Performance-Monitoring Counters Library, for Intel Processors and Linux
Author: Don Heller, dheller@scl.ameslab.gov
Last revised: 2 August 2000