Add support for Intel Granite Rapids and Sierra Forrest (RRZE-HPC#639)

* Add support for Intel Granite Rapids and Sierra Forrest * Fix RAPL DRAM energy unit for SPR * Improve error handling in likwid-sysfeatures likwid-sysfeatures now also queries the categories properly, so different categories with the same feature name do not conflict. * Add missing includes to sysFeatures_common_rapl Due to include order the missing includes did not cause problems, but in future files and commits, this must be fixed. * Allow hexadecimal numbers in sysFeatures To give the user more flexibility when specifying numbers. * Add missing (void) in sysFeatures_types.h * Add AMD HSMP sysFeatures support * Fix APIC mapping in sysFeatures_amd_hsmp Likwid and hwloc currently have an incomplete/wrong understanding of APIC IDs and blindly assume a mapping of linux processor number to APIC ID. This is wrong. For example AMD EPYC 9354 reports gaps and jumps in its ID order. This commit makes sure to explicitly query the APIC ID via CPUID in order to correctly map LikwidDevice_t to APIC IDs. * Explicitly set DRAM energy unit on Sapphire Rapids While the power unit MSR appers to match the value specified in the Intel SDM, it specifies the energy unit is always 61 uJ. There is no mention to read it from MSR, so we always assume it is 61 uJ. * Restore old likwid-sysfeatures hwthread behavior Commit 8c49e8a introduced device type prefixes, which broke the old device/cpu list behavior of just specifying a range of hardware threads (e.g. 0-12). This commit restores this behavior when no device type prefix is specified. The only remaining difference is that higher level devices (e.g. cores, sockets, etc.) are not implicitly created. * Fix missing include in sysFeatures_amd * Finalize Intel Granite Rapids support * Fix for Intel SPR TMA metrics * New counter list for SPR * Remove unrequired read in finalize of Intel SPR and GNR * Add groups for GNR * Add support for Intel Sierra Forrest (core, uncore, energy) * Add support for Intel Granite Rapids and Sierra Forrest * Finalize Intel Granite Rapids support * Fix for Intel SPR TMA metrics * New counter list for SPR * Remove unrequired read in finalize of Intel SPR and GNR * Add groups for GNR * Add support for Intel Sierra Forrest (core, uncore, energy) * Add update of device location for SPR UPI and M3UPI units See https://github.com/torvalds/linux/blob/master/arch/x86/events/intel/uncore_snbep.c#L6591-L6640 * Add way to use unnamed perf uncore devices --------- Co-authored-by: chriswasser <[email protected]> Co-authored-by: Michael Panzlaff <[email protected]>
breiters · Nov 8, 2024 · 638191a · 638191a
1 parent 9e14e9b
commit 638191a
Show file tree

Hide file tree

Showing 55 changed files with 12,573 additions and 1,991 deletions.
diff --git a/groups/GNR/BRANCH.txt b/groups/GNR/BRANCH.txt
@@ -0,0 +1,32 @@
+SHORT Branch prediction miss rate/ratio
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+FIXC3 TOPDOWN_SLOTS
+PMC0  BR_INST_RETIRED_ALL_BRANCHES
+PMC1  BR_MISP_RETIRED_ALL_BRANCHES
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+Branch rate   PMC0/FIXC0
+Branch misprediction rate  PMC1/FIXC0
+Branch misprediction ratio  PMC1/PMC0
+Instructions per branch  FIXC0/PMC0
+
+LONG
+Formulas:
+Branch rate = BR_INST_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
+Branch misprediction rate =  BR_MISP_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
+Branch misprediction ratio = BR_MISP_RETIRED_ALL_BRANCHES/BR_INST_RETIRED_ALL_BRANCHES
+Instructions per branch = INSTR_RETIRED_ANY/BR_INST_RETIRED_ALL_BRANCHES
+-
+The rates state how often on average a branch or a mispredicted branch occurred
+per instruction retired in total. The branch misprediction ratio sets directly
+into relation what ratio of all branch instruction where mispredicted.
+Instructions per branch is 1/branch rate.
+
diff --git a/groups/GNR/CLOCK.txt b/groups/GNR/CLOCK.txt
@@ -0,0 +1,27 @@
+SHORT Power and Energy consumption
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+FIXC3 TOPDOWN_SLOTS
+PWR0  PWR_PKG_ENERGY
+UBOX0 UNCORE_CLOCKTICKS
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+Uncore Clock [MHz] 1.E-06*UBOX0/time
+CPI  FIXC1/FIXC0
+Energy [J]  PWR0
+Power [W] PWR0/time
+
+LONG
+Formulas:
+Power =  PWR_PKG_ENERGY / time
+Uncore Clock [MHz] = 1.E-06 * UNCORE_CLOCKTICKS / time
+-
+Icelake implements the RAPL interface. This interface enables to
+monitor the consumed energy on the package (socket) level.
+
diff --git a/groups/GNR/CYCLE_ACTIVITY.txt b/groups/GNR/CYCLE_ACTIVITY.txt
@@ -0,0 +1,38 @@
+SHORT Cycle Activities
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+FIXC3 TOPDOWN_SLOTS
+PMC0 CYCLE_ACTIVITY_CYCLES_L2_MISS
+PMC1 CYCLE_ACTIVITY_CYCLES_MEM_ANY
+PMC2 CYCLE_ACTIVITY_CYCLES_L1D_MISS
+PMC3 CYCLE_ACTIVITY_STALLS_TOTAL
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+Cycles without execution [%] (PMC3/FIXC1)*100
+Cycles without execution due to L1D [%] (PMC2/FIXC1)*100
+Cycles without execution due to L2 [%] (PMC0/FIXC1)*100
+Cycles without execution due to memory loads [%] (PMC1/FIXC1)*100
+
+LONG
+Formulas:
+Cycles without execution [%] = CYCLE_ACTIVITY_STALLS_TOTAL/CPU_CLK_UNHALTED_CORE*100
+Cycles with stalls due to L1D [%] = CYCLE_ACTIVITY_CYCLES_L1D_MISS/CPU_CLK_UNHALTED_CORE*100
+Cycles with stalls due to L2 [%] = CYCLE_ACTIVITY_CYCLES_L2_MISS/CPU_CLK_UNHALTED_CORE*100
+Cycles without execution due to memory loads [%] = CYCLE_ACTIVITY_CYCLES_MEM_ANY/CPU_CLK_UNHALTED_CORE*100
+--
+This performance group measures the cycles while waiting for data from the cache
+and memory hierarchy.
+CYCLE_ACTIVITY_STALLS_TOTAL: Total execution stalls.
+CYCLE_ACTIVITY_CYCLES_L1D_MISS: Cycles while L1 cache miss demand load is
+outstanding.
+CYCLE_ACTIVITY_CYCLES_L2_MISS: Cycles while L2 cache miss demand load is
+outstanding.
+CYCLE_ACTIVITY_CYCLES_MEM_ANY: Cycles while memory subsystem has an
+outstanding load.
diff --git a/groups/GNR/CYCLE_STALLS.txt b/groups/GNR/CYCLE_STALLS.txt
@@ -0,0 +1,39 @@
+SHORT Cycle Activities (Stalls)
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+FIXC3 TOPDOWN_SLOTS
+PMC0 CYCLE_ACTIVITY_STALLS_L2_MISS
+PMC2 CYCLE_ACTIVITY_STALLS_L1D_MISS
+PMC3 CYCLE_ACTIVITY_STALLS_TOTAL
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+Total execution stalls PMC3
+Stalls caused by L1D misses [%] (PMC2/PMC3)*100
+Stalls caused by L2 misses [%] (PMC0/PMC3)*100
+Execution stall rate [%] (PMC3/FIXC1)*100
+Stalls caused by L1D misses rate [%] (PMC2/FIXC1)*100
+Stalls caused by L2 misses rate [%] (PMC0/FIXC1)*100
+
+LONG
+Formulas:
+Total execution stalls = CYCLE_ACTIVITY_STALLS_TOTAL
+Stalls caused by L1D misses [%] = (CYCLE_ACTIVITY_STALLS_L1D_MISS/CYCLE_ACTIVITY_STALLS_TOTAL)*100
+Stalls caused by L2 misses [%] = (CYCLE_ACTIVITY_STALLS_L2_MISS/CYCLE_ACTIVITY_STALLS_TOTAL)*100
+Execution stall rate [%] = (CYCLE_ACTIVITY_STALLS_TOTAL/CPU_CLK_UNHALTED_CORE)*100
+Stalls caused by L1D misses rate [%] = (CYCLE_ACTIVITY_STALLS_L1D_MISS/CPU_CLK_UNHALTED_CORE)*100
+Stalls caused by L2 misses rate [%] = (CYCLE_ACTIVITY_STALLS_L2_MISS/CPU_CLK_UNHALTED_CORE)*100
+--
+This performance group measures the stalls caused by data traffic in the cache
+hierarchy.
+CYCLE_ACTIVITY_STALLS_TOTAL: Total execution stalls.
+CYCLE_ACTIVITY_STALLS_L1D_MISS: Execution stalls while L1 cache miss demand
+load is outstanding.
+CYCLE_ACTIVITY_STALLS_L2_MISS: Execution stalls while L2 cache miss demand
+load is outstanding.
diff --git a/groups/GNR/DATA.txt b/groups/GNR/DATA.txt
@@ -0,0 +1,23 @@
+SHORT Load to store ratio
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+FIXC3 TOPDOWN_SLOTS
+PMC0  MEM_INST_RETIRED_ALL_LOADS
+PMC1  MEM_INST_RETIRED_ALL_STORES
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+Load to store ratio PMC0/PMC1
+
+LONG
+Formulas:
+Load to store ratio = MEM_INST_RETIRED_ALL_LOADS/MEM_INST_RETIRED_ALL_STORES
+-
+This is a metric to determine your load to store ratio.
+
diff --git a/groups/GNR/DIVIDE.txt b/groups/GNR/DIVIDE.txt
@@ -0,0 +1,25 @@
+SHORT Divide unit information
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+FIXC3 TOPDOWN_SLOTS
+PMC0  ARITH_DIV_COUNT
+PMC1  ARITH_DIV_ACTIVE
+
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+Number of divide ops PMC0
+Avg. divide unit usage duration PMC1/PMC0
+
+LONG
+Formulas:
+Number of divide ops = ARITH_DIV_COUNT
+Avg. divide unit usage duration = ARITH_DIV_ACTIVE/ARITH_DIV_COUNT
+-
+This performance group measures the average latency of divide operations
diff --git a/groups/GNR/ENERGY.txt b/groups/GNR/ENERGY.txt
@@ -0,0 +1,43 @@
+SHORT Power and Energy consumption
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+FIXC3 TOPDOWN_SLOTS
+TMP0  TEMP_CORE
+PWR0  PWR_PKG_ENERGY
+PWR1  PWR_PP0_ENERGY
+PWR3  PWR_DRAM_ENERGY
+PWR4  PWR_PLATFORM_ENERGY
+UBOX0 UNCORE_CLOCKTICKS
+
+
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+Uncore Clock [MHz] 1.E-06*UBOX0/time
+CPI  FIXC1/FIXC0
+Temperature [C]  TMP0
+Energy [J]  PWR0
+Power [W] PWR0/time
+Energy PP0 [J]  PWR1
+Power PP0 [W] PWR1/time
+Energy DRAM [J]  PWR3
+Power DRAM [W] PWR3/time
+Energy PLATFORM [J]  PWR4
+Power PLATFORM [W] PWR4/time
+
+LONG
+Formulas:
+Power = PWR_PKG_ENERGY / time
+Power PP0 = PWR_PP0_ENERGY / time
+Power DRAM = PWR_DRAM_ENERGY / time
+Power PLATFORM = PWR_PLATFORM_ENERGY / time
+-
+Icelake implements the RAPL interface. This interface enables to
+monitor the consumed energy on the package (socket), DRAM and
+platform level.
+
diff --git a/groups/GNR/FLOPS_AVX.txt b/groups/GNR/FLOPS_AVX.txt
@@ -0,0 +1,26 @@
+SHORT Packed AVX MFLOP/s
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+FIXC3 TOPDOWN_SLOTS
+PMC0  FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE
+PMC1  FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE
+PMC2  FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE
+PMC3  FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+Packed SP [MFLOP/s]  1.0E-06*(PMC0*8.0+PMC2*16.0)/time
+Packed DP [MFLOP/s]  1.0E-06*(PMC1*4.0+PMC3*8.0)/time
+
+LONG
+Formulas:
+Packed SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE*16)/runtime
+Packed DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime
+-
+Packed 32b AVX FLOPs rates.
diff --git a/groups/GNR/FLOPS_DP.txt b/groups/GNR/FLOPS_DP.txt
@@ -0,0 +1,35 @@
+SHORT Double Precision MFLOP/s
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+FIXC3 TOPDOWN_SLOTS
+PMC0  FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE
+PMC1  FP_ARITH_INST_RETIRED_SCALAR_DOUBLE
+PMC2  FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE
+PMC3  FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+DP [MFLOP/s]  1.0E-06*(PMC0*2.0+PMC1+PMC2*4.0+PMC3*8.0)/time
+AVX DP [MFLOP/s] 1.0E-06*(PMC2*4.0+PMC3*8.0)/time
+AVX512 DP [MFLOP/s]  1.0E-06*(PMC3*8.0)/time
+Packed [MUOPS/s]   1.0E-06*(PMC0+PMC2+PMC3)/time
+Scalar [MUOPS/s] 1.0E-06*PMC1/time
+Vectorization ratio [%] 100*(PMC0+PMC2+PMC3)/(PMC0+PMC1+PMC2+PMC3)
+
+LONG
+Formulas:
+DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE*2+FP_ARITH_INST_RETIRED_SCALAR_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime
+AVX DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime
+AVX512 DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime
+Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE)/runtime
+Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED_SCALAR_DOUBLE/runtime
+Vectorization ratio [%] = 100*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE)/(FP_ARITH_INST_RETIRED_SCALAR_DOUBLE+FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE)
+-
+SSE scalar and packed double precision FLOP rates.
+
diff --git a/groups/GNR/FLOPS_SP.txt b/groups/GNR/FLOPS_SP.txt
@@ -0,0 +1,35 @@
+SHORT Single Precision MFLOP/s
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+FIXC3 TOPDOWN_SLOTS
+PMC0  FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE
+PMC1  FP_ARITH_INST_RETIRED_SCALAR_SINGLE
+PMC2  FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE
+PMC3  FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+SP [MFLOP/s]  1.0E-06*(PMC0*4.0+PMC1+PMC2*8.0+PMC3*16.0)/time
+AVX SP [MFLOP/s] 1.0E-06*(PMC2*8.0+PMC3*16.0)/time
+AVX512 SP [MFLOP/s]  1.0E-06*(PMC3*16.0)/time
+Packed [MUOPS/s]   1.0E-06*(PMC0+PMC2+PMC3)/time
+Scalar [MUOPS/s] 1.0E-06*PMC1/time
+Vectorization ratio [%] 100*(PMC0+PMC2+PMC3)/(PMC0+PMC1+PMC2+PMC3)
+
+LONG
+Formulas:
+SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE*4+FP_ARITH_INST_RETIRED_SCALAR_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE*16)/runtime
+AVX SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE*16)/runtime
+AVX512 SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE*16)/runtime
+Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE)/runtime
+Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED_SCALAR_SINGLE/runtime
+Vectorization ratio [%] [%] = 100*(FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE)/(FP_ARITH_INST_RETIRED_SCALAR_SINGLE+FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE)
+-
+SSE scalar and packed single precision FLOP rates.
+
diff --git a/groups/GNR/L2.txt b/groups/GNR/L2.txt
@@ -0,0 +1,38 @@
+SHORT L2 cache bandwidth in MBytes/s
+
+EVENTSET
+FIXC0 INSTR_RETIRED_ANY
+FIXC1 CPU_CLK_UNHALTED_CORE
+FIXC2 CPU_CLK_UNHALTED_REF
+FIXC3 TOPDOWN_SLOTS
+PMC0  L1D_REPLACEMENT
+PMC1  L2_TRANS_L1D_WB
+
+METRICS
+Runtime (RDTSC) [s] time
+Runtime unhalted [s] FIXC1*inverseClock
+Clock [MHz]  1.E-06*(FIXC1/FIXC2)/inverseClock
+CPI  FIXC1/FIXC0
+L2D load bandwidth [MBytes/s]  1.0E-06*PMC0*64.0/time
+L2D load data volume [GBytes]  1.0E-09*PMC0*64.0
+L2D evict bandwidth [MBytes/s]  1.0E-06*PMC1*64.0/time
+L2D evict data volume [GBytes]  1.0E-09*PMC1*64.0
+L2 bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
+L2 data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
+
+LONG
+Formulas:
+L2D load bandwidth [MBytes/s] = 1.0E-06*L1D_REPLACEMENT*64.0/time
+L2D load data volume [GBytes] = 1.0E-09*L1D_REPLACEMENT*64.0
+L2D evict bandwidth [MBytes/s] = 1.0E-06*L2_TRANS_L1D_WB*64.0/time
+L2D evict data volume [GBytes] = 1.0E-09*L2_TRANS_L1D_WB*64.0
+L2 bandwidth [MBytes/s] = 1.0E-06*(L1D_REPLACEMENT+L2_TRANS_L1D_WB)*64/time
+L2 data volume [GBytes] = 1.0E-09*(L1D_REPLACEMENT+L2_TRANS_L1D_WB)*64
+-
+Profiling group to measure L2 cache bandwidth. The bandwidth is computed by the
+number of cache line allocated in the L1 and the number of modified cache lines
+evicted from the L1. The group also output total data volume transferred between
+L2 and L1. Note that this bandwidth also includes data transfers due to a write
+allocate load on a store miss in L1. It does not include data loaded into the L1
+instruction cache.
+