Add AMD Bergamo to Perfmon

breiters · Nov 8, 2024 · d827ec6 · d827ec6
1 parent 49eb72d
commit d827ec6
Show file tree

Hide file tree

Showing 25 changed files with 1,978 additions and 2 deletions.
diff --git a/groups/zen4c/BRANCH.txt b/groups/zen4c/BRANCH.txt
@@ -0,0 +1,32 @@
+SHORT Branch prediction miss rate/ratio
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  RETIRED_BRANCH_INSTR
+PMC3  RETIRED_MISP_BRANCH_INSTR
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI   PMC1/PMC0
+Branch rate   PMC2/PMC0
+Branch misprediction rate  PMC3/PMC0
+Branch misprediction ratio  PMC3/PMC2
+Instructions per branch  PMC0/PMC2
+
+LONG
+Formulas:
+Branch rate = RETIRED_BRANCH_INSTR/RETIRED_INSTRUCTIONS
+Branch misprediction rate = RETIRED_MISP_BRANCH_INSTR/RETIRED_INSTRUCTIONS
+Branch misprediction ratio = RETIRED_MISP_BRANCH_INSTR/RETIRED_BRANCH_INSTR
+Instructions per branch = RETIRED_INSTRUCTIONS/RETIRED_BRANCH_INSTR
+-
+The rates state how often on average a branch or a mispredicted branch occurred
+per instruction retired in total. The branch misprediction ratio sets directly
+into relation what ratio of all branch instruction where mispredicted.
+Instructions per branch is 1/branch rate.
+
diff --git a/groups/zen4c/CACHE.txt b/groups/zen4c/CACHE.txt
@@ -0,0 +1,39 @@
+SHORT Data cache miss rate/ratio
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  DATA_CACHE_ACCESSES
+PMC3  ANY_DATA_CACHE_FILLS_ALL
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI   PMC1/PMC0
+data cache requests PMC2
+data cache request rate PMC2/PMC0
+data cache misses PMC3
+data cache miss rate PMC3/PMC0
+data cache miss ratio PMC3/PMC2
+
+LONG
+Formulas:
+data cache requests = DATA_CACHE_ACCESSES
+data cache request rate = DATA_CACHE_ACCESSES / RETIRED_INSTRUCTIONS
+data cache misses = ANY_DATA_CACHE_FILLS_ALL
+data cache miss rate = ANY_DATA_CACHE_FILLS_ALL / RETIRED_INSTRUCTIONS
+data cache miss ratio = ANY_DATA_CACHE_FILLS_ALL / DATA_CACHE_ACCESSES
+-
+This group measures the locality of your data accesses with regard to the
+L1 cache. Data cache request rate tells you how data intensive your code is
+or how many data accesses you have on average per instruction.
+The data cache miss rate gives a measure how often it was necessary to get
+cache lines from higher levels of the memory hierarchy. And finally
+data cache miss ratio tells you how many of your memory references required
+a cache line to be loaded from a higher level. While the# data cache miss rate
+might be given by your algorithm you should try to get data cache miss ratio
+as low as possible by increasing your cache reuse.
+
diff --git a/groups/zen4c/CLOCK.txt b/groups/zen4c/CLOCK.txt
@@ -0,0 +1,22 @@
+SHORT  Cycles per instruction
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PWR1  RAPL_PKG_ENERGY
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   PMC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI PMC1/PMC0
+Energy [J]  PWR1
+Power [W] PWR1/time
+
+LONG
+Formulas:
+CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
+Power [W] =  PWR_PKG_ENERGY / time
+-
diff --git a/groups/zen4c/CPI.txt b/groups/zen4c/CPI.txt
@@ -0,0 +1,30 @@
+SHORT  Cycles per instruction
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  RETIRED_UOPS
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   PMC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI PMC1/PMC0
+CPI (based on uops)   PMC1/PMC2
+IPC PMC0/PMC1
+
+
+LONG
+Formulas:
+CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
+CPI (based on uops) = CPU_CLOCKS_UNHALTED/RETIRED_UOPS
+IPC = RETIRED_INSTRUCTIONS/CPU_CLOCKS_UNHALTED
+-
+This group measures how efficient the processor works with
+regard to instruction throughput. Also important as a standalone
+metric is RETIRED_INSTRUCTIONS as it tells you how many instruction
+you need to execute for a task. An optimization might show very
+low CPI values but execute many more instruction for it.
+
diff --git a/groups/zen4c/DATA.txt b/groups/zen4c/DATA.txt
@@ -0,0 +1,24 @@
+SHORT Load to store ratio
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  LS_DISPATCH_LOADS
+PMC3  LS_DISPATCH_STORES
+PMC4  LS_DISPATCH_LOAD_OP_STORES
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI   PMC1/PMC0
+Load to store ratio (PMC2+PMC4)/(PMC3+PMC4)
+
+LONG
+Formulas:
+Load to store ratio = (LS_DISPATCH_LOADS+LS_DISPATCH_LOAD_OP_STORES)/(LS_DISPATCH_STORES+LS_DISPATCH_LOAD_OP_STORES)
+-
+This is a simple metric to determine your load to store ratio.
+
diff --git a/groups/zen4c/DIVIDE.txt b/groups/zen4c/DIVIDE.txt
@@ -0,0 +1,25 @@
+SHORT Divide unit information
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  DIV_OP_COUNT
+PMC3  DIV_BUSY_CYCLES
+
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI PMC1/PMC0
+Number of divide ops PMC2
+Avg. divide unit usage duration PMC3/PMC2
+
+LONG
+Formulas:
+Number of divide ops = DIV_OP_COUNT
+Avg. divide unit usage duration = DIV_BUSY_CYCLES/DIV_OP_COUNT
+--
+This performance group measures the average latency of divide operations
diff --git a/groups/zen4c/ENERGY.txt b/groups/zen4c/ENERGY.txt
@@ -0,0 +1,32 @@
+SHORT Power and Energy consumption
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PWR0  RAPL_CORE_ENERGY
+PWR2  RAPL_L3_ENERGY
+
+
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI   PMC1/PMC0
+Energy Core [J]  PWR0
+Power Core [W] PWR0/time
+Energy L3 [J]  PWR2
+Power L3 [W] PWR2/time
+
+LONG
+Formulas:
+Power Core [W] = RAPL_CORE_ENERGY/time
+Power L3 [W] = RAPL_L3_ENERGY/time
+-
+Ryzen implements the RAPL interface previously introduced by Intel.
+This interface enables to monitor the consumed energy on the core and L3
+domain.
+It is not documented by AMD which parts of the CPU are in which domain.
+
diff --git a/groups/zen4c/FLOPS_DP.txt b/groups/zen4c/FLOPS_DP.txt
@@ -0,0 +1,28 @@
+SHORT Double Precision MFLOP/s
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  RETIRED_SSE_AVX_FLOPS_ALL
+PMC3  MERGE
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI   PMC1/PMC0
+DP [MFLOP/s]   1.0E-06*(PMC2)/time
+
+LONG
+Formulas:
+CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
+DP [MFLOP/s] = 1.0E-06*(RETIRED_SSE_AVX_FLOPS_ALL)/time
+-
+Profiling group to measure (double-precisision) FLOP rate. The event might
+have a higher per-cycle increment than 15, so the MERGE event is required. In
+contrast to AMD Zen, the Zen2 microarchitecture does not provide events to
+differentiate between single- and double-precision.
+
+
diff --git a/groups/zen4c/FLOPS_SP.txt b/groups/zen4c/FLOPS_SP.txt
@@ -0,0 +1,28 @@
+SHORT Single Precision MFLOP/s
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  RETIRED_SSE_AVX_FLOPS_ALL
+PMC3  MERGE
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI   PMC1/PMC0
+SP [MFLOP/s]   1.0E-06*(PMC2)/time
+
+LONG
+Formulas:
+CPI = CPU_CLOCKS_UNHALTED/RETIRED_INSTRUCTIONS
+SP [MFLOP/s] = 1.0E-06*(RETIRED_SSE_AVX_FLOPS_ALL)/time
+-
+Profiling group to measure (single-precisision) FLOP rate. The event might
+have a higher per-cycle increment than 15, so the MERGE event is required. In
+contrast to AMD Zen, the Zen2 microarchitecture does not provide events to
+differentiate between single- and double-precision.
+
+
diff --git a/groups/zen4c/ICACHE.txt b/groups/zen4c/ICACHE.txt
@@ -0,0 +1,28 @@
+SHORT Instruction cache miss rate/ratio
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  ICACHE_FETCHES
+PMC2  ICACHE_L2_REFILLS
+PMC3  ICACHE_SYSTEM_REFILLS
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s]   FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI   FIXC1/PMC0
+L1I request rate   PMC1/PMC0
+L1I miss rate    (PMC2+PMC3)/PMC0
+L1I miss ratio   (PMC2+PMC3)/PMC1
+
+LONG
+Formulas:
+L1I request rate = ICACHE_FETCHES / RETIRED_INSTRUCTIONS
+L1I miss rate = (ICACHE_L2_REFILLS + ICACHE_SYSTEM_REFILLS)/RETIRED_INSTRUCTIONS
+L1I miss ratio = (ICACHE_L2_REFILLS + ICACHE_SYSTEM_REFILLS)/ICACHE_FETCHES
+-
+This group measures the locality of your instruction code with regard to the
+L1 I-Cache.
+
diff --git a/groups/zen4c/L2.txt b/groups/zen4c/L2.txt
@@ -0,0 +1,29 @@
+SHORT L2 cache bandwidth in MBytes/s (experimental)
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  REQUESTS_TO_L2_GRP1_ALL_NO_PF
+PMC3  L2_PF_HIT_IN_L2_ALLPREF
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  PMC1/PMC0
+L2 bandwidth [MBytes/s] 1.0E-06*(PMC2)*64.0/time
+L2 data volume [GBytes] 1.0E-09*(PMC2)*64.0
+Prefetch bandwidth [MBytes/s] 1.0E-06*(PMC3)*64.0/time
+Prefetch data volume [GBytes] 1.0E-09*(PMC3)*64.0
+
+LONG
+Formulas:
+L2 bandwidth [MBytes/s] = 1.0E-06*(REQUESTS_TO_L2_GRP1_ALL_NO_PF)*64/time
+L2 data volume [GBytes] = 1.0E-09*(REQUESTS_TO_L2_GRP1_ALL_NO_PF)*64
+Prefetch bandwidth [MBytes/s] = 1.0E-06*(L2_PF_HIT_IN_L2_ALLPREF)*64/time
+Prefetch data volume [GBytes] = 1.0E-09*(L2_PF_HIT_IN_L2_ALLPREF)*64
+-
+Profiling group to measure L2 cache load bandwidth including prefetchers.
+There are no events to count stores.
diff --git a/groups/zen4c/L2CACHE.txt b/groups/zen4c/L2CACHE.txt
@@ -0,0 +1,40 @@
+SHORT L2 cache miss rate/ratio (experimental)
+
+EVENTSET
+PMC0  REQUESTS_TO_L2_GRP1_ALL_NO_PF
+PMC1  L2_PF_HIT_IN_L2
+PMC2  L2_PF_HIT_IN_L3
+PMC3  L2_PF_MISS_IN_L3
+PMC4  CORE_TO_L2_CACHE_REQUESTS_HITS
+PMC5  RETIRED_INSTRUCTIONS
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+L2 request rate (PMC0+PMC1+PMC2+PMC3)/PMC5
+L2 miss rate ((PMC0+PMC1+PMC2+PMC3)-(PMC4+PMC1))/PMC5
+L2 miss ratio ((PMC0+PMC1+PMC2+PMC3)-(PMC4+PMC1))/(PMC0+PMC1+PMC2+PMC3)
+L2 accesses (PMC0+PMC1+PMC2+PMC3)
+L2 hits (PMC4+PMC1)
+L2 misses (PMC0+PMC1+PMC2+PMC3)-(PMC4+PMC1)
+
+LONG
+Formulas:
+L2 request rate = (REQUESTS_TO_L2_GRP1_ALL_NO_PF+L2_PF_HIT_IN_L2+L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3)/RETIRED_INSTRUCTIONS
+L2 miss rate = ((REQUESTS_TO_L2_GRP1_ALL_NO_PF+L2_PF_HIT_IN_L2+L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3)-(CORE_TO_L2_CACHE_REQUESTS_HITS+L2_PF_HIT_IN_L2))/INSTR_RETIRED_ANY
+L2 miss ratio = ((REQUESTS_TO_L2_GRP1_ALL_NO_PF+L2_PF_HIT_IN_L2+L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3)-(CORE_TO_L2_CACHE_REQUESTS_HITS+L2_PF_HIT_IN_L2))/(REQUESTS_TO_L2_GRP1_ALL_NO_PF+L2_PF_HIT_IN_L2+L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3)
+L2 accesses = (REQUESTS_TO_L2_GRP1_ALL_NO_PF+L2_PF_HIT_IN_L2+L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3)
+L2 hits = CORE_TO_L2_CACHE_REQUESTS_HITS+L2_PF_HIT_IN_L2
+L2 misses = (REQUESTS_TO_L2_GRP1_ALL_NO_PF+L2_PF_HIT_IN_L2+L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3)-(CORE_TO_L2_CACHE_REQUESTS_HITS+L2_PF_HIT_IN_L2)
+-
+This group measures the locality of your data accesses with regard to the
+L2 cache. L2 request rate tells you how data intensive your code is
+or how many data accesses you have on average per instruction.
+The L2 miss rate gives a measure how often it was necessary to get
+cache lines from memory. And finally L2 miss ratio tells you how many of your
+memory references required a cache line to be loaded from a higher level.
+While the data cache miss rate might be given by your algorithm you should
+try to get data cache miss ratio as low as possible by increasing your cache reuse.
+
+
+
diff --git a/groups/zen4c/L3.txt b/groups/zen4c/L3.txt
@@ -0,0 +1,26 @@
+SHORT L3 cache bandwidth in MBytes/s
+
+EVENTSET
+FIXC1 ACTUAL_CPU_CLOCK
+FIXC2 MAX_CPU_CLOCK
+PMC0  RETIRED_INSTRUCTIONS
+PMC1  CPU_CLOCKS_UNHALTED
+PMC2  L2_PF_HIT_IN_L3
+PMC3  L2_PF_MISS_IN_L3
+PMC4  L2_CACHE_MISS_AFTER_L1_MISS
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  PMC1/PMC0
+L3 bandwidth [MBytes/s] 1.0E-06*(PMC2+PMC3+PMC4)*64.0/time
+L3 data volume [GBytes] 1.0E-09*(PMC2+PMC3+PMC4)*64.0
+
+LONG
+Formulas:
+L3 bandwidth [MBytes/s] = 1.0E-06*(L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3+L2_CACHE_MISS_AFTER_L1_MISS)*64.0/time
+L3 data volume [GBytes] = 1.0E-09*(L2_PF_HIT_IN_L3+L2_PF_MISS_IN_L3+L2_CACHE_MISS_AFTER_L1_MISS)*64.0
+--
+Profiling group to measure L3 cache bandwidth. It measures only loads from L3.
+There is no performance event to measure the stores to L3.