forked from RRZE-HPC/likwid
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add support for Intel Granite Rapids and Sierra Forrest (RRZE-HPC#639)
* Add support for Intel Granite Rapids and Sierra Forrest * Fix RAPL DRAM energy unit for SPR * Improve error handling in likwid-sysfeatures likwid-sysfeatures now also queries the categories properly, so different categories with the same feature name do not conflict. * Add missing includes to sysFeatures_common_rapl Due to include order the missing includes did not cause problems, but in future files and commits, this must be fixed. * Allow hexadecimal numbers in sysFeatures To give the user more flexibility when specifying numbers. * Add missing (void) in sysFeatures_types.h * Add AMD HSMP sysFeatures support * Fix APIC mapping in sysFeatures_amd_hsmp Likwid and hwloc currently have an incomplete/wrong understanding of APIC IDs and blindly assume a mapping of linux processor number to APIC ID. This is wrong. For example AMD EPYC 9354 reports gaps and jumps in its ID order. This commit makes sure to explicitly query the APIC ID via CPUID in order to correctly map LikwidDevice_t to APIC IDs. * Explicitly set DRAM energy unit on Sapphire Rapids While the power unit MSR appers to match the value specified in the Intel SDM, it specifies the energy unit is always 61 uJ. There is no mention to read it from MSR, so we always assume it is 61 uJ. * Restore old likwid-sysfeatures hwthread behavior Commit 8c49e8a introduced device type prefixes, which broke the old device/cpu list behavior of just specifying a range of hardware threads (e.g. 0-12). This commit restores this behavior when no device type prefix is specified. The only remaining difference is that higher level devices (e.g. cores, sockets, etc.) are not implicitly created. * Fix missing include in sysFeatures_amd * Finalize Intel Granite Rapids support * Fix for Intel SPR TMA metrics * New counter list for SPR * Remove unrequired read in finalize of Intel SPR and GNR * Add groups for GNR * Add support for Intel Sierra Forrest (core, uncore, energy) * Add support for Intel Granite Rapids and Sierra Forrest * Finalize Intel Granite Rapids support * Fix for Intel SPR TMA metrics * New counter list for SPR * Remove unrequired read in finalize of Intel SPR and GNR * Add groups for GNR * Add support for Intel Sierra Forrest (core, uncore, energy) * Add update of device location for SPR UPI and M3UPI units See https://github.com/torvalds/linux/blob/master/arch/x86/events/intel/uncore_snbep.c#L6591-L6640 * Add way to use unnamed perf uncore devices --------- Co-authored-by: chriswasser <[email protected]> Co-authored-by: Michael Panzlaff <[email protected]>
- Loading branch information
1 parent
9e14e9b
commit 638191a
Showing
55 changed files
with
12,573 additions
and
1,991 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
SHORT Branch prediction miss rate/ratio | ||
|
||
EVENTSET | ||
FIXC0 INSTR_RETIRED_ANY | ||
FIXC1 CPU_CLK_UNHALTED_CORE | ||
FIXC2 CPU_CLK_UNHALTED_REF | ||
FIXC3 TOPDOWN_SLOTS | ||
PMC0 BR_INST_RETIRED_ALL_BRANCHES | ||
PMC1 BR_MISP_RETIRED_ALL_BRANCHES | ||
|
||
METRICS | ||
Runtime (RDTSC) [s] time | ||
Runtime unhalted [s] FIXC1*inverseClock | ||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock | ||
CPI FIXC1/FIXC0 | ||
Branch rate PMC0/FIXC0 | ||
Branch misprediction rate PMC1/FIXC0 | ||
Branch misprediction ratio PMC1/PMC0 | ||
Instructions per branch FIXC0/PMC0 | ||
|
||
LONG | ||
Formulas: | ||
Branch rate = BR_INST_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY | ||
Branch misprediction rate = BR_MISP_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY | ||
Branch misprediction ratio = BR_MISP_RETIRED_ALL_BRANCHES/BR_INST_RETIRED_ALL_BRANCHES | ||
Instructions per branch = INSTR_RETIRED_ANY/BR_INST_RETIRED_ALL_BRANCHES | ||
- | ||
The rates state how often on average a branch or a mispredicted branch occurred | ||
per instruction retired in total. The branch misprediction ratio sets directly | ||
into relation what ratio of all branch instruction where mispredicted. | ||
Instructions per branch is 1/branch rate. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
SHORT Power and Energy consumption | ||
|
||
EVENTSET | ||
FIXC0 INSTR_RETIRED_ANY | ||
FIXC1 CPU_CLK_UNHALTED_CORE | ||
FIXC2 CPU_CLK_UNHALTED_REF | ||
FIXC3 TOPDOWN_SLOTS | ||
PWR0 PWR_PKG_ENERGY | ||
UBOX0 UNCORE_CLOCKTICKS | ||
|
||
METRICS | ||
Runtime (RDTSC) [s] time | ||
Runtime unhalted [s] FIXC1*inverseClock | ||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock | ||
Uncore Clock [MHz] 1.E-06*UBOX0/time | ||
CPI FIXC1/FIXC0 | ||
Energy [J] PWR0 | ||
Power [W] PWR0/time | ||
|
||
LONG | ||
Formulas: | ||
Power = PWR_PKG_ENERGY / time | ||
Uncore Clock [MHz] = 1.E-06 * UNCORE_CLOCKTICKS / time | ||
- | ||
Icelake implements the RAPL interface. This interface enables to | ||
monitor the consumed energy on the package (socket) level. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
SHORT Cycle Activities | ||
|
||
EVENTSET | ||
FIXC0 INSTR_RETIRED_ANY | ||
FIXC1 CPU_CLK_UNHALTED_CORE | ||
FIXC2 CPU_CLK_UNHALTED_REF | ||
FIXC3 TOPDOWN_SLOTS | ||
PMC0 CYCLE_ACTIVITY_CYCLES_L2_MISS | ||
PMC1 CYCLE_ACTIVITY_CYCLES_MEM_ANY | ||
PMC2 CYCLE_ACTIVITY_CYCLES_L1D_MISS | ||
PMC3 CYCLE_ACTIVITY_STALLS_TOTAL | ||
|
||
METRICS | ||
Runtime (RDTSC) [s] time | ||
Runtime unhalted [s] FIXC1*inverseClock | ||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock | ||
CPI FIXC1/FIXC0 | ||
Cycles without execution [%] (PMC3/FIXC1)*100 | ||
Cycles without execution due to L1D [%] (PMC2/FIXC1)*100 | ||
Cycles without execution due to L2 [%] (PMC0/FIXC1)*100 | ||
Cycles without execution due to memory loads [%] (PMC1/FIXC1)*100 | ||
|
||
LONG | ||
Formulas: | ||
Cycles without execution [%] = CYCLE_ACTIVITY_STALLS_TOTAL/CPU_CLK_UNHALTED_CORE*100 | ||
Cycles with stalls due to L1D [%] = CYCLE_ACTIVITY_CYCLES_L1D_MISS/CPU_CLK_UNHALTED_CORE*100 | ||
Cycles with stalls due to L2 [%] = CYCLE_ACTIVITY_CYCLES_L2_MISS/CPU_CLK_UNHALTED_CORE*100 | ||
Cycles without execution due to memory loads [%] = CYCLE_ACTIVITY_CYCLES_MEM_ANY/CPU_CLK_UNHALTED_CORE*100 | ||
-- | ||
This performance group measures the cycles while waiting for data from the cache | ||
and memory hierarchy. | ||
CYCLE_ACTIVITY_STALLS_TOTAL: Total execution stalls. | ||
CYCLE_ACTIVITY_CYCLES_L1D_MISS: Cycles while L1 cache miss demand load is | ||
outstanding. | ||
CYCLE_ACTIVITY_CYCLES_L2_MISS: Cycles while L2 cache miss demand load is | ||
outstanding. | ||
CYCLE_ACTIVITY_CYCLES_MEM_ANY: Cycles while memory subsystem has an | ||
outstanding load. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
SHORT Cycle Activities (Stalls) | ||
|
||
EVENTSET | ||
FIXC0 INSTR_RETIRED_ANY | ||
FIXC1 CPU_CLK_UNHALTED_CORE | ||
FIXC2 CPU_CLK_UNHALTED_REF | ||
FIXC3 TOPDOWN_SLOTS | ||
PMC0 CYCLE_ACTIVITY_STALLS_L2_MISS | ||
PMC2 CYCLE_ACTIVITY_STALLS_L1D_MISS | ||
PMC3 CYCLE_ACTIVITY_STALLS_TOTAL | ||
|
||
METRICS | ||
Runtime (RDTSC) [s] time | ||
Runtime unhalted [s] FIXC1*inverseClock | ||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock | ||
CPI FIXC1/FIXC0 | ||
Total execution stalls PMC3 | ||
Stalls caused by L1D misses [%] (PMC2/PMC3)*100 | ||
Stalls caused by L2 misses [%] (PMC0/PMC3)*100 | ||
Execution stall rate [%] (PMC3/FIXC1)*100 | ||
Stalls caused by L1D misses rate [%] (PMC2/FIXC1)*100 | ||
Stalls caused by L2 misses rate [%] (PMC0/FIXC1)*100 | ||
|
||
LONG | ||
Formulas: | ||
Total execution stalls = CYCLE_ACTIVITY_STALLS_TOTAL | ||
Stalls caused by L1D misses [%] = (CYCLE_ACTIVITY_STALLS_L1D_MISS/CYCLE_ACTIVITY_STALLS_TOTAL)*100 | ||
Stalls caused by L2 misses [%] = (CYCLE_ACTIVITY_STALLS_L2_MISS/CYCLE_ACTIVITY_STALLS_TOTAL)*100 | ||
Execution stall rate [%] = (CYCLE_ACTIVITY_STALLS_TOTAL/CPU_CLK_UNHALTED_CORE)*100 | ||
Stalls caused by L1D misses rate [%] = (CYCLE_ACTIVITY_STALLS_L1D_MISS/CPU_CLK_UNHALTED_CORE)*100 | ||
Stalls caused by L2 misses rate [%] = (CYCLE_ACTIVITY_STALLS_L2_MISS/CPU_CLK_UNHALTED_CORE)*100 | ||
-- | ||
This performance group measures the stalls caused by data traffic in the cache | ||
hierarchy. | ||
CYCLE_ACTIVITY_STALLS_TOTAL: Total execution stalls. | ||
CYCLE_ACTIVITY_STALLS_L1D_MISS: Execution stalls while L1 cache miss demand | ||
load is outstanding. | ||
CYCLE_ACTIVITY_STALLS_L2_MISS: Execution stalls while L2 cache miss demand | ||
load is outstanding. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
SHORT Load to store ratio | ||
|
||
EVENTSET | ||
FIXC0 INSTR_RETIRED_ANY | ||
FIXC1 CPU_CLK_UNHALTED_CORE | ||
FIXC2 CPU_CLK_UNHALTED_REF | ||
FIXC3 TOPDOWN_SLOTS | ||
PMC0 MEM_INST_RETIRED_ALL_LOADS | ||
PMC1 MEM_INST_RETIRED_ALL_STORES | ||
|
||
METRICS | ||
Runtime (RDTSC) [s] time | ||
Runtime unhalted [s] FIXC1*inverseClock | ||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock | ||
CPI FIXC1/FIXC0 | ||
Load to store ratio PMC0/PMC1 | ||
|
||
LONG | ||
Formulas: | ||
Load to store ratio = MEM_INST_RETIRED_ALL_LOADS/MEM_INST_RETIRED_ALL_STORES | ||
- | ||
This is a metric to determine your load to store ratio. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
SHORT Divide unit information | ||
|
||
EVENTSET | ||
FIXC0 INSTR_RETIRED_ANY | ||
FIXC1 CPU_CLK_UNHALTED_CORE | ||
FIXC2 CPU_CLK_UNHALTED_REF | ||
FIXC3 TOPDOWN_SLOTS | ||
PMC0 ARITH_DIV_COUNT | ||
PMC1 ARITH_DIV_ACTIVE | ||
|
||
|
||
METRICS | ||
Runtime (RDTSC) [s] time | ||
Runtime unhalted [s] FIXC1*inverseClock | ||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock | ||
CPI FIXC1/FIXC0 | ||
Number of divide ops PMC0 | ||
Avg. divide unit usage duration PMC1/PMC0 | ||
|
||
LONG | ||
Formulas: | ||
Number of divide ops = ARITH_DIV_COUNT | ||
Avg. divide unit usage duration = ARITH_DIV_ACTIVE/ARITH_DIV_COUNT | ||
- | ||
This performance group measures the average latency of divide operations |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
SHORT Power and Energy consumption | ||
|
||
EVENTSET | ||
FIXC0 INSTR_RETIRED_ANY | ||
FIXC1 CPU_CLK_UNHALTED_CORE | ||
FIXC2 CPU_CLK_UNHALTED_REF | ||
FIXC3 TOPDOWN_SLOTS | ||
TMP0 TEMP_CORE | ||
PWR0 PWR_PKG_ENERGY | ||
PWR1 PWR_PP0_ENERGY | ||
PWR3 PWR_DRAM_ENERGY | ||
PWR4 PWR_PLATFORM_ENERGY | ||
UBOX0 UNCORE_CLOCKTICKS | ||
|
||
|
||
|
||
METRICS | ||
Runtime (RDTSC) [s] time | ||
Runtime unhalted [s] FIXC1*inverseClock | ||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock | ||
Uncore Clock [MHz] 1.E-06*UBOX0/time | ||
CPI FIXC1/FIXC0 | ||
Temperature [C] TMP0 | ||
Energy [J] PWR0 | ||
Power [W] PWR0/time | ||
Energy PP0 [J] PWR1 | ||
Power PP0 [W] PWR1/time | ||
Energy DRAM [J] PWR3 | ||
Power DRAM [W] PWR3/time | ||
Energy PLATFORM [J] PWR4 | ||
Power PLATFORM [W] PWR4/time | ||
|
||
LONG | ||
Formulas: | ||
Power = PWR_PKG_ENERGY / time | ||
Power PP0 = PWR_PP0_ENERGY / time | ||
Power DRAM = PWR_DRAM_ENERGY / time | ||
Power PLATFORM = PWR_PLATFORM_ENERGY / time | ||
- | ||
Icelake implements the RAPL interface. This interface enables to | ||
monitor the consumed energy on the package (socket), DRAM and | ||
platform level. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
SHORT Packed AVX MFLOP/s | ||
|
||
EVENTSET | ||
FIXC0 INSTR_RETIRED_ANY | ||
FIXC1 CPU_CLK_UNHALTED_CORE | ||
FIXC2 CPU_CLK_UNHALTED_REF | ||
FIXC3 TOPDOWN_SLOTS | ||
PMC0 FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE | ||
PMC1 FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | ||
PMC2 FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE | ||
PMC3 FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | ||
|
||
METRICS | ||
Runtime (RDTSC) [s] time | ||
Runtime unhalted [s] FIXC1*inverseClock | ||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock | ||
CPI FIXC1/FIXC0 | ||
Packed SP [MFLOP/s] 1.0E-06*(PMC0*8.0+PMC2*16.0)/time | ||
Packed DP [MFLOP/s] 1.0E-06*(PMC1*4.0+PMC3*8.0)/time | ||
|
||
LONG | ||
Formulas: | ||
Packed SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE*16)/runtime | ||
Packed DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime | ||
- | ||
Packed 32b AVX FLOPs rates. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
SHORT Double Precision MFLOP/s | ||
|
||
EVENTSET | ||
FIXC0 INSTR_RETIRED_ANY | ||
FIXC1 CPU_CLK_UNHALTED_CORE | ||
FIXC2 CPU_CLK_UNHALTED_REF | ||
FIXC3 TOPDOWN_SLOTS | ||
PMC0 FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE | ||
PMC1 FP_ARITH_INST_RETIRED_SCALAR_DOUBLE | ||
PMC2 FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | ||
PMC3 FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | ||
|
||
METRICS | ||
Runtime (RDTSC) [s] time | ||
Runtime unhalted [s] FIXC1*inverseClock | ||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock | ||
CPI FIXC1/FIXC0 | ||
DP [MFLOP/s] 1.0E-06*(PMC0*2.0+PMC1+PMC2*4.0+PMC3*8.0)/time | ||
AVX DP [MFLOP/s] 1.0E-06*(PMC2*4.0+PMC3*8.0)/time | ||
AVX512 DP [MFLOP/s] 1.0E-06*(PMC3*8.0)/time | ||
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2+PMC3)/time | ||
Scalar [MUOPS/s] 1.0E-06*PMC1/time | ||
Vectorization ratio [%] 100*(PMC0+PMC2+PMC3)/(PMC0+PMC1+PMC2+PMC3) | ||
|
||
LONG | ||
Formulas: | ||
DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE*2+FP_ARITH_INST_RETIRED_SCALAR_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime | ||
AVX DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime | ||
AVX512 DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime | ||
Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE)/runtime | ||
Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED_SCALAR_DOUBLE/runtime | ||
Vectorization ratio [%] = 100*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE)/(FP_ARITH_INST_RETIRED_SCALAR_DOUBLE+FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE) | ||
- | ||
SSE scalar and packed double precision FLOP rates. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
SHORT Single Precision MFLOP/s | ||
|
||
EVENTSET | ||
FIXC0 INSTR_RETIRED_ANY | ||
FIXC1 CPU_CLK_UNHALTED_CORE | ||
FIXC2 CPU_CLK_UNHALTED_REF | ||
FIXC3 TOPDOWN_SLOTS | ||
PMC0 FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE | ||
PMC1 FP_ARITH_INST_RETIRED_SCALAR_SINGLE | ||
PMC2 FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE | ||
PMC3 FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE | ||
|
||
METRICS | ||
Runtime (RDTSC) [s] time | ||
Runtime unhalted [s] FIXC1*inverseClock | ||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock | ||
CPI FIXC1/FIXC0 | ||
SP [MFLOP/s] 1.0E-06*(PMC0*4.0+PMC1+PMC2*8.0+PMC3*16.0)/time | ||
AVX SP [MFLOP/s] 1.0E-06*(PMC2*8.0+PMC3*16.0)/time | ||
AVX512 SP [MFLOP/s] 1.0E-06*(PMC3*16.0)/time | ||
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2+PMC3)/time | ||
Scalar [MUOPS/s] 1.0E-06*PMC1/time | ||
Vectorization ratio [%] 100*(PMC0+PMC2+PMC3)/(PMC0+PMC1+PMC2+PMC3) | ||
|
||
LONG | ||
Formulas: | ||
SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE*4+FP_ARITH_INST_RETIRED_SCALAR_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE*16)/runtime | ||
AVX SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE*16)/runtime | ||
AVX512 SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE*16)/runtime | ||
Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE)/runtime | ||
Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED_SCALAR_SINGLE/runtime | ||
Vectorization ratio [%] [%] = 100*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE)/(FP_ARITH_INST_RETIRED_SCALAR_SINGLE+FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE) | ||
- | ||
SSE scalar and packed single precision FLOP rates. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
SHORT L2 cache bandwidth in MBytes/s | ||
|
||
EVENTSET | ||
FIXC0 INSTR_RETIRED_ANY | ||
FIXC1 CPU_CLK_UNHALTED_CORE | ||
FIXC2 CPU_CLK_UNHALTED_REF | ||
FIXC3 TOPDOWN_SLOTS | ||
PMC0 L1D_REPLACEMENT | ||
PMC1 L2_TRANS_L1D_WB | ||
|
||
METRICS | ||
Runtime (RDTSC) [s] time | ||
Runtime unhalted [s] FIXC1*inverseClock | ||
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock | ||
CPI FIXC1/FIXC0 | ||
L2D load bandwidth [MBytes/s] 1.0E-06*PMC0*64.0/time | ||
L2D load data volume [GBytes] 1.0E-09*PMC0*64.0 | ||
L2D evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time | ||
L2D evict data volume [GBytes] 1.0E-09*PMC1*64.0 | ||
L2 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time | ||
L2 data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0 | ||
|
||
LONG | ||
Formulas: | ||
L2D load bandwidth [MBytes/s] = 1.0E-06*L1D_REPLACEMENT*64.0/time | ||
L2D load data volume [GBytes] = 1.0E-09*L1D_REPLACEMENT*64.0 | ||
L2D evict bandwidth [MBytes/s] = 1.0E-06*L2_TRANS_L1D_WB*64.0/time | ||
L2D evict data volume [GBytes] = 1.0E-09*L2_TRANS_L1D_WB*64.0 | ||
L2 bandwidth [MBytes/s] = 1.0E-06*(L1D_REPLACEMENT+L2_TRANS_L1D_WB)*64/time | ||
L2 data volume [GBytes] = 1.0E-09*(L1D_REPLACEMENT+L2_TRANS_L1D_WB)*64 | ||
- | ||
Profiling group to measure L2 cache bandwidth. The bandwidth is computed by the | ||
number of cache line allocated in the L1 and the number of modified cache lines | ||
evicted from the L1. The group also output total data volume transferred between | ||
L2 and L1. Note that this bandwidth also includes data transfers due to a write | ||
allocate load on a store miss in L1. It does not include data loaded into the L1 | ||
instruction cache. | ||
|
Oops, something went wrong.