Skip to content

Commit

Permalink
Add AMD Bergamo to Perfmon
Browse files Browse the repository at this point in the history
  • Loading branch information
TomTheBear committed Nov 8, 2024
1 parent 49eb72d commit d827ec6
Show file tree
Hide file tree
Showing 25 changed files with 1,978 additions and 2 deletions.
32 changes: 32 additions & 0 deletions groups/zen4c/BRANCH.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
SHORT Branch prediction miss rate/ratio

EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 RETIRED_BRANCH_INSTR
PMC3 RETIRED_MISP_BRANCH_INSTR

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
Branch rate PMC2/PMC0
Branch misprediction rate PMC3/PMC0
Branch misprediction ratio PMC3/PMC2
Instructions per branch PMC0/PMC2

LONG
Formulas:
Branch rate = RETIRED_BRANCH_INSTR/RETIRED_INSTRUCTIONS
Branch misprediction rate = RETIRED_MISP_BRANCH_INSTR/RETIRED_INSTRUCTIONS
Branch misprediction ratio = RETIRED_MISP_BRANCH_INSTR/RETIRED_BRANCH_INSTR
Instructions per branch = RETIRED_INSTRUCTIONS/RETIRED_BRANCH_INSTR
-
The rates state how often on average a branch or a mispredicted branch occurred
per instruction retired in total. The branch misprediction ratio sets directly
into relation what ratio of all branch instruction where mispredicted.
Instructions per branch is 1/branch rate.

39 changes: 39 additions & 0 deletions groups/zen4c/CACHE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
SHORT Data cache miss rate/ratio

EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 DATA_CACHE_ACCESSES
PMC3 ANY_DATA_CACHE_FILLS_ALL

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
data cache requests PMC2
data cache request rate PMC2/PMC0
data cache misses PMC3
data cache miss rate PMC3/PMC0
data cache miss ratio PMC3/PMC2

LONG
Formulas:
data cache requests = DATA_CACHE_ACCESSES
data cache request rate = DATA_CACHE_ACCESSES / RETIRED_INSTRUCTIONS
data cache misses = ANY_DATA_CACHE_FILLS_ALL
data cache miss rate = ANY_DATA_CACHE_FILLS_ALL / RETIRED_INSTRUCTIONS
data cache miss ratio = ANY_DATA_CACHE_FILLS_ALL / DATA_CACHE_ACCESSES
-
This group measures the locality of your data accesses with regard to the
L1 cache. Data cache request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The data cache miss rate gives a measure how often it was necessary to get
cache lines from higher levels of the memory hierarchy. And finally
data cache miss ratio tells you how many of your memory references required
a cache line to be loaded from a higher level. While the# data cache miss rate
might be given by your algorithm you should try to get data cache miss ratio
as low as possible by increasing your cache reuse.

22 changes: 22 additions & 0 deletions groups/zen4c/CLOCK.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
SHORT Cycles per instruction

EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PWR1 RAPL_PKG_ENERGY

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] PMC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
Energy [J] PWR1
Power [W] PWR1/time

LONG
Formulas:
CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
Power [W] = PWR_PKG_ENERGY / time
-
30 changes: 30 additions & 0 deletions groups/zen4c/CPI.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
SHORT Cycles per instruction

EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 RETIRED_UOPS

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] PMC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
CPI (based on uops) PMC1/PMC2
IPC PMC0/PMC1


LONG
Formulas:
CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
CPI (based on uops) = CPU_CLOCKS_UNHALTED/RETIRED_UOPS
IPC = RETIRED_INSTRUCTIONS/CPU_CLOCKS_UNHALTED
-
This group measures how efficient the processor works with
regard to instruction throughput. Also important as a standalone
metric is RETIRED_INSTRUCTIONS as it tells you how many instruction
you need to execute for a task. An optimization might show very
low CPI values but execute many more instruction for it.

24 changes: 24 additions & 0 deletions groups/zen4c/DATA.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
SHORT Load to store ratio

EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 LS_DISPATCH_LOADS
PMC3 LS_DISPATCH_STORES
PMC4 LS_DISPATCH_LOAD_OP_STORES

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
Load to store ratio (PMC2+PMC4)/(PMC3+PMC4)

LONG
Formulas:
Load to store ratio = (LS_DISPATCH_LOADS+LS_DISPATCH_LOAD_OP_STORES)/(LS_DISPATCH_STORES+LS_DISPATCH_LOAD_OP_STORES)
-
This is a simple metric to determine your load to store ratio.

25 changes: 25 additions & 0 deletions groups/zen4c/DIVIDE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
SHORT Divide unit information

EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 DIV_OP_COUNT
PMC3 DIV_BUSY_CYCLES


METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
Number of divide ops PMC2
Avg. divide unit usage duration PMC3/PMC2

LONG
Formulas:
Number of divide ops = DIV_OP_COUNT
Avg. divide unit usage duration = DIV_BUSY_CYCLES/DIV_OP_COUNT
--
This performance group measures the average latency of divide operations
32 changes: 32 additions & 0 deletions groups/zen4c/ENERGY.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
SHORT Power and Energy consumption

EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PWR0 RAPL_CORE_ENERGY
PWR2 RAPL_L3_ENERGY



METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
Energy Core [J] PWR0
Power Core [W] PWR0/time
Energy L3 [J] PWR2
Power L3 [W] PWR2/time

LONG
Formulas:
Power Core [W] = RAPL_CORE_ENERGY/time
Power L3 [W] = RAPL_L3_ENERGY/time
-
Ryzen implements the RAPL interface previously introduced by Intel.
This interface enables to monitor the consumed energy on the core and L3
domain.
It is not documented by AMD which parts of the CPU are in which domain.

28 changes: 28 additions & 0 deletions groups/zen4c/FLOPS_DP.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
SHORT Double Precision MFLOP/s

EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 RETIRED_SSE_AVX_FLOPS_ALL
PMC3 MERGE

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
DP [MFLOP/s] 1.0E-06*(PMC2)/time

LONG
Formulas:
CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
DP [MFLOP/s] = 1.0E-06*(RETIRED_SSE_AVX_FLOPS_ALL)/time
-
Profiling group to measure (double-precisision) FLOP rate. The event might
have a higher per-cycle increment than 15, so the MERGE event is required. In
contrast to AMD Zen, the Zen2 microarchitecture does not provide events to
differentiate between single- and double-precision.


28 changes: 28 additions & 0 deletions groups/zen4c/FLOPS_SP.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
SHORT Single Precision MFLOP/s

EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 RETIRED_SSE_AVX_FLOPS_ALL
PMC3 MERGE

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
SP [MFLOP/s] 1.0E-06*(PMC2)/time

LONG
Formulas:
CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
SP [MFLOP/s] = 1.0E-06*(RETIRED_SSE_AVX_FLOPS_ALL)/time
-
Profiling group to measure (single-precisision) FLOP rate. The event might
have a higher per-cycle increment than 15, so the MERGE event is required. In
contrast to AMD Zen, the Zen2 microarchitecture does not provide events to
differentiate between single- and double-precision.


28 changes: 28 additions & 0 deletions groups/zen4c/ICACHE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
SHORT Instruction cache miss rate/ratio

EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 ICACHE_FETCHES
PMC2 ICACHE_L2_REFILLS
PMC3 ICACHE_SYSTEM_REFILLS

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/PMC0
L1I request rate PMC1/PMC0
L1I miss rate (PMC2+PMC3)/PMC0
L1I miss ratio (PMC2+PMC3)/PMC1

LONG
Formulas:
L1I request rate = ICACHE_FETCHES / RETIRED_INSTRUCTIONS
L1I miss rate = (ICACHE_L2_REFILLS + ICACHE_SYSTEM_REFILLS)/RETIRED_INSTRUCTIONS
L1I miss ratio = (ICACHE_L2_REFILLS + ICACHE_SYSTEM_REFILLS)/ICACHE_FETCHES
-
This group measures the locality of your instruction code with regard to the
L1 I-Cache.

29 changes: 29 additions & 0 deletions groups/zen4c/L2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
SHORT L2 cache bandwidth in MBytes/s (experimental)

EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 REQUESTS_TO_L2_GRP1_ALL_NO_PF
PMC3 L2_PF_HIT_IN_L2_ALLPREF

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
L2 bandwidth [MBytes/s] 1.0E-06*(PMC2)*64.0/time
L2 data volume [GBytes] 1.0E-09*(PMC2)*64.0
Prefetch bandwidth [MBytes/s] 1.0E-06*(PMC3)*64.0/time
Prefetch data volume [GBytes] 1.0E-09*(PMC3)*64.0

LONG
Formulas:
L2 bandwidth [MBytes/s] = 1.0E-06*(REQUESTS_TO_L2_GRP1_ALL_NO_PF)*64/time
L2 data volume [GBytes] = 1.0E-09*(REQUESTS_TO_L2_GRP1_ALL_NO_PF)*64
Prefetch bandwidth [MBytes/s] = 1.0E-06*(L2_PF_HIT_IN_L2_ALLPREF)*64/time
Prefetch data volume [GBytes] = 1.0E-09*(L2_PF_HIT_IN_L2_ALLPREF)*64
-
Profiling group to measure L2 cache load bandwidth including prefetchers.
There are no events to count stores.
40 changes: 40 additions & 0 deletions groups/zen4c/L2CACHE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
SHORT L2 cache miss rate/ratio (experimental)

EVENTSET
PMC0 REQUESTS_TO_L2_GRP1_ALL_NO_PF
PMC1 L2_PF_HIT_IN_L2
PMC2 L2_PF_HIT_IN_L3
PMC3 L2_PF_MISS_IN_L3
PMC4 CORE_TO_L2_CACHE_REQUESTS_HITS
PMC5 RETIRED_INSTRUCTIONS

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
L2 request rate (PMC0+PMC1+PMC2+PMC3)/PMC5
L2 miss rate ((PMC0+PMC1+PMC2+PMC3)-(PMC4+PMC1))/PMC5
L2 miss ratio ((PMC0+PMC1+PMC2+PMC3)-(PMC4+PMC1))/(PMC0+PMC1+PMC2+PMC3)
L2 accesses (PMC0+PMC1+PMC2+PMC3)
L2 hits (PMC4+PMC1)
L2 misses (PMC0+PMC1+PMC2+PMC3)-(PMC4+PMC1)

LONG
Formulas:
L2 request rate = (REQUESTS_TO_L2_GRP1_ALL_NO_PF+L2_PF_HIT_IN_L2+L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3)/RETIRED_INSTRUCTIONS
L2 miss rate = ((REQUESTS_TO_L2_GRP1_ALL_NO_PF+L2_PF_HIT_IN_L2+L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3)-(CORE_TO_L2_CACHE_REQUESTS_HITS+L2_PF_HIT_IN_L2))/INSTR_RETIRED_ANY
L2 miss ratio = ((REQUESTS_TO_L2_GRP1_ALL_NO_PF+L2_PF_HIT_IN_L2+L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3)-(CORE_TO_L2_CACHE_REQUESTS_HITS+L2_PF_HIT_IN_L2))/(REQUESTS_TO_L2_GRP1_ALL_NO_PF+L2_PF_HIT_IN_L2+L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3)
L2 accesses = (REQUESTS_TO_L2_GRP1_ALL_NO_PF+L2_PF_HIT_IN_L2+L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3)
L2 hits = CORE_TO_L2_CACHE_REQUESTS_HITS+L2_PF_HIT_IN_L2
L2 misses = (REQUESTS_TO_L2_GRP1_ALL_NO_PF+L2_PF_HIT_IN_L2+L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3)-(CORE_TO_L2_CACHE_REQUESTS_HITS+L2_PF_HIT_IN_L2)
-
This group measures the locality of your data accesses with regard to the
L2 cache. L2 request rate tells you how data intensive your code is
or how many data accesses you have on average per instruction.
The L2 miss rate gives a measure how often it was necessary to get
cache lines from memory. And finally L2 miss ratio tells you how many of your
memory references required a cache line to be loaded from a higher level.
While the data cache miss rate might be given by your algorithm you should
try to get data cache miss ratio as low as possible by increasing your cache reuse.



26 changes: 26 additions & 0 deletions groups/zen4c/L3.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
SHORT L3 cache bandwidth in MBytes/s

EVENTSET
FIXC1 ACTUAL_CPU_CLOCK
FIXC2 MAX_CPU_CLOCK
PMC0 RETIRED_INSTRUCTIONS
PMC1 CPU_CLOCKS_UNHALTED
PMC2 L2_PF_HIT_IN_L3
PMC3 L2_PF_MISS_IN_L3
PMC4 L2_CACHE_MISS_AFTER_L1_MISS

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI PMC1/PMC0
L3 bandwidth [MBytes/s] 1.0E-06*(PMC2+PMC3+PMC4)*64.0/time
L3 data volume [GBytes] 1.0E-09*(PMC2+PMC3+PMC4)*64.0

LONG
Formulas:
L3 bandwidth [MBytes/s] = 1.0E-06*(L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3+L2_CACHE_MISS_AFTER_L1_MISS)*64.0/time
L3 data volume [GBytes] = 1.0E-09*(L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3+L2_CACHE_MISS_AFTER_L1_MISS)*64.0
--
Profiling group to measure L3 cache bandwidth. It measures only loads from L3.
There is no performance event to measure the stores to L3.
Loading

0 comments on commit d827ec6

Please sign in to comment.