Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory Bandwdth is Overreported by ~3x on AMD Zen 2 Desktop #527

Open
biergaizi opened this issue May 28, 2023 · 9 comments
Open

[BUG] Memory Bandwdth is Overreported by ~3x on AMD Zen 2 Desktop #527

biergaizi opened this issue May 28, 2023 · 9 comments
Labels

Comments

@biergaizi
Copy link

biergaizi commented May 28, 2023

Describe the bug
A clear and concise description of what the bug is.

I tried to use likwid-bench to measure memory bandwidth on a AMD Ryzen 5 3500X (Zen 2) desktop system with single-channel DDR4 3200MT/s DIMM. This has a theoretical bandwidth around 20 GB/s. But the "sum" and "max" values reported by likwid-perfctr -g MEM is around 60 GB/s.

How does the bandwidth calculation really work? How do I correctly interpret this performance counter?

To Reproduce with a LIKWID command

Run:

likwid-perfctr -g MEM likwid-bench -t daxpy_avx -w S0:1GB:1

The benchmark reports than:

--------------------------------------------------------------------------------
Cycles:			4337216388
CPU Clock:		3599789537
Cycle Clock:		3599789537
Time:			1.204853e+00 sec
Iterations:		16
Iterations per thread:	16
Inner loop executions:	2604166
Size (Byte):		999999744
Size per thread:	999999744
Number of Flops:	1999999488
MFlops/s:		1659.95
Data volume (Byte):	23999993856
MByte/s:		19919.44
Cycles per update:	4.337217
Cycles per cacheline:	34.697740
Loads per update:	2
Stores per update:	1
Load bytes per element:	16
Store bytes per elem.:	8
Load/store ratio:	2.00
Instructions:		874999793
UOPs:			1583332928
--------------------------------------------------------------------------------

But the measured results are:

+----------------------+---------+-------------+------------+------------+------------+------------+------------+
|         Event        | Counter |  HWThread 0 | HWThread 1 | HWThread 2 | HWThread 3 | HWThread 4 | HWThread 5 |
+----------------------+---------+-------------+------------+------------+------------+------------+------------+
|   ACTUAL_CPU_CLOCK   |  FIXC1  | 13306109196 |    9369907 |   10993340 |    3229449 |   41204353 |   14484508 |
|     MAX_CPU_CLOCK    |  FIXC2  | 11861354232 |   15355152 |   18418500 |    4465116 |   60136164 |   21251376 |
| RETIRED_INSTRUCTIONS |   PMC0  |  2690558104 |     140577 |     194833 |     111628 |   10404413 |    1392965 |
|  CPU_CLOCKS_UNHALTED |   PMC1  | 12474540176 |     431106 |     296028 |     279488 |    7216432 |    1320395 |
|    DRAM_CHANNEL_0    |   DFC0  |           0 |          0 |          0 |          0 |          0 |          0 |
|    DRAM_CHANNEL_1    |   DFC1  |   984741900 |          0 |          0 |          0 |          0 |          0 |
+----------------------+---------+-------------+------------+------------+------------+------------+------------+

+---------------------------+---------+-------------+---------+-------------+--------------+
|           Event           | Counter |     Sum     |   Min   |     Max     |      Avg     |
+---------------------------+---------+-------------+---------+-------------+--------------+
|   ACTUAL_CPU_CLOCK STAT   |  FIXC1  | 13385390753 | 3229449 | 13306109196 | 2.230898e+09 |
|     MAX_CPU_CLOCK STAT    |  FIXC2  | 11980980540 | 4465116 | 11861354232 |   1996830090 |
| RETIRED_INSTRUCTIONS STAT |   PMC0  |  2702802520 |  111628 |  2690558104 | 4.504671e+08 |
|  CPU_CLOCKS_UNHALTED STAT |   PMC1  | 12484083625 |  279488 | 12474540176 | 2.080681e+09 |
|    DRAM_CHANNEL_0 STAT    |   DFC0  |           0 |       0 |           0 |            0 |
|    DRAM_CHANNEL_1 STAT    |   DFC1  |   984741900 |       0 |   984741900 |    164123650 |
+---------------------------+---------+-------------+---------+-------------+--------------+

+-----------------------------+------------+------------+------------+------------+------------+------------+
|            Metric           | HWThread 0 | HWThread 1 | HWThread 2 | HWThread 3 | HWThread 4 | HWThread 5 |
+-----------------------------+------------+------------+------------+------------+------------+------------+
|     Runtime (RDTSC) [s]     |     4.3066 |     4.3066 |     4.3066 |     4.3066 |     4.3066 |     4.3066 |
|     Runtime unhalted [s]    |     3.6962 |     0.0026 |     0.0031 |     0.0009 |     0.0114 |     0.0040 |
|         Clock [MHz]         |  4038.4568 |  2196.7457 |  2148.6913 |  2603.7202 |  2466.6414 |  2453.6653 |
|             CPI             |     4.6364 |     3.0667 |     1.5194 |     2.5037 |     0.6936 |     0.9479 |
| Memory bandwidth [MBytes/s] | 58537.1990 |          0 |          0 |          0 |          0 |          0 |
| Memory data volume [GBytes] |   252.0939 |          0 |          0 |          0 |          0 |          0 |
+-----------------------------+------------+------------+------------+------------+------------+------------+

+----------------------------------+------------+-----------+------------+-----------+
|              Metric              |     Sum    |    Min    |     Max    |    Avg    |
+----------------------------------+------------+-----------+------------+-----------+
|     Runtime (RDTSC) [s] STAT     |    25.8396 |    4.3066 |     4.3066 |    4.3066 |
|     Runtime unhalted [s] STAT    |     3.7182 |    0.0009 |     3.6962 |    0.6197 |
|         Clock [MHz] STAT         | 15907.9207 | 2148.6913 |  4038.4568 | 2651.3201 |
|             CPI STAT             |    13.3677 |    0.6936 |     4.6364 |    2.2280 |
| Memory bandwidth [MBytes/s] STAT | 58537.1990 |         0 | 58537.1990 | 9756.1998 |
| Memory data volume [GBytes] STAT |   252.0939 |         0 |   252.0939 |   42.0157 |
+----------------------------------+------------+-----------+------------+-----------+

The "sum" and "max" values reported by likwid-perfctr -g MEM is around 60 GB/, around 3x greater than physically possible.

Hardware

The hardware topology is:

Machine (16GB total)
  Package L#0
    NUMANode L#0 (P#0 16GB)
    L3 L#0 (16MB)
      L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
      L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
    L3 L#1 (16MB)
      L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
      L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
      L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
  HostBridge
    PCIBridge
      PCI 01:00.1 (SATA)
      PCIBridge
        PCIBridge
          PCI 03:00.0 (Network)
            Net "wlp3s0"
        PCIBridge
          PCI 04:00.0 (Ethernet)
            Net "enp4s0"
    PCIBridge
      PCI 0a:00.0 (VGA)
  Misc(MemoryModule)
  Misc(MemoryModule)
  Misc(MemoryModule)
@biergaizi biergaizi added the bug label May 28, 2023
@TomTheBear
Copy link
Member

Memory traffic measurements on AMD Zen* systems are not reliable. AMD documented that the metric is "approximate" and the system has to be in NPS1 mode. Based on your hwloc topology, your system is in NPS1 mode. Desktop chips are commonly not tested on our side because we don't have those, only server class chips.

You get the metric formula by running: likwid-perfctr -g MEM -H
The metric used by LIKWID contains a scaling factor (4.0/(num_numadomains/num_sockets)) to fix the NPS1 limitation. For your system, the factor would be 4.0. This could be the reason for the "physically impossible" numbers. Furthermore, the data fabric covers not only the memory controllers but all I/O.

But there is definitely something wrong. The kernel reports 23999993856 Bytes (approx. 24 GB) while likwid-perfctr reports 252 GB data volume. Maybe the kernel info is wrong and the data volume is calculated wrong. Is it happening also for other kernels like update_avx?

@biergaizi
Copy link
Author

But there is definitely something wrong. The kernel reports 23999993856 Bytes (approx. 24 GB) while likwid-perfctr reports 252 GB data volume. Maybe the kernel info is wrong and the data volume is calculated wrong. Is it happening also for other kernels like update_avx?

Running update_avx on all cores (6 threads), kernel reports 128 GiB of DRAM traffic, meanwhile likwid-perfctr reports 690 GiB. Strangely, the "average" traffic of 115 GiB seems to be close to the actual traffic.

$ likwid-perfctr -g MEM likwid-bench -t update_avx -w S0:1GB
--------------------------------------------------------------------------------
CPU name:	AMD Ryzen 5 3500X 6-Core Processor             
CPU type:	AMD K17 (Zen2) architecture
CPU clock:	3.60 GHz
--------------------------------------------------------------------------------
Warning: Sanitizing vector length to a multiple of the loop stride 16 and thread count 6 from 125000000 elements (1000000000 bytes) to 124999968 elements (999999744 bytes)
Allocate: Process running on hwthread 0 (Domain S0) - Vector length 124999968/999999744 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: update_avx
--------------------------------------------------------------------------------
Using 1 work groups
Using 6 threads
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
Group: 0 Thread 5 Global Thread 5 running on hwthread 5 - Vector length 20833328 Offset 104166640
Group: 0 Thread 3 Global Thread 3 running on hwthread 3 - Vector length 20833328 Offset 62499984
Group: 0 Thread 0 Global Thread 0 running on hwthread 0 - Vector length 20833328 Offset 0
Group: 0 Thread 4 Global Thread 4 running on hwthread 4 - Vector length 20833328 Offset 83333312
Group: 0 Thread 2 Global Thread 2 running on hwthread 2 - Vector length 20833328 Offset 41666656
Group: 0 Thread 1 Global Thread 1 running on hwthread 1 - Vector length 20833328 Offset 20833328
--------------------------------------------------------------------------------
Cycles:			30698287164
CPU Clock:		3599684322
Cycle Clock:		3599684322
Time:			8.528050e+00 sec
Iterations:		384
Iterations per thread:	64
Inner loop executions:	1302083
Size (Byte):		999999744
Size per thread:	166666624
Number of Flops:	0
MFlops/s:		0.00
Data volume (Byte):	127999967232
MByte/s:		15009.29
Cycles per update:	3.837287
Cycles per cacheline:	30.698295
Loads per update:	1
Stores per update:	1
Load bytes per element:	8
Store bytes per elem.:	8
Load/store ratio:	1.00
Instructions:		5499998608
UOPs:			6999998208
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: MEM
+----------------------+---------+-------------+-------------+-------------+-------------+-------------+-------------+
|         Event        | Counter |  HWThread 0 |  HWThread 1 |  HWThread 2 |  HWThread 3 |  HWThread 4 |  HWThread 5 |
+----------------------+---------+-------------+-------------+-------------+-------------+-------------+-------------+
|   ACTUAL_CPU_CLOCK   |  FIXC1  | 43169698230 | 33670150743 | 33676628176 | 33689950087 | 33710800799 | 33698608955 |
|     MAX_CPU_CLOCK    |  FIXC2  | 39088946700 | 30709607076 | 30712390812 | 30706751412 | 30733032708 | 30718880208 |
| RETIRED_INSTRUCTIONS |   PMC0  |  3136226710 |   916793547 |   921177233 |   942113831 |   948511457 |   950893028 |
|  CPU_CLOCKS_UNHALTED |   PMC1  | 42057649563 | 33382852709 | 33377259671 | 33389652332 | 33423999650 | 33411905706 |
|    DRAM_CHANNEL_0    |   DFC0  |           0 |           0 |           0 |           0 |           0 |           0 |
|    DRAM_CHANNEL_1    |   DFC1  |  2697065040 |           0 |           0 |           0 |           0 |           0 |
+----------------------+---------+-------------+-------------+-------------+-------------+-------------+-------------+

+---------------------------+---------+--------------+-------------+-------------+--------------+
|           Event           | Counter |      Sum     |     Min     |     Max     |      Avg     |
+---------------------------+---------+--------------+-------------+-------------+--------------+
|   ACTUAL_CPU_CLOCK STAT   |  FIXC1  | 211615836990 | 33670150743 | 43169698230 |  35269306165 |
|     MAX_CPU_CLOCK STAT    |  FIXC2  | 192669608916 | 30706751412 | 39088946700 |  32111601486 |
| RETIRED_INSTRUCTIONS STAT |   PMC0  |   7815715806 |   916793547 |  3136226710 |   1302619301 |
|  CPU_CLOCKS_UNHALTED STAT |   PMC1  | 209043319631 | 33377259671 | 42057649563 | 3.484055e+10 |
|    DRAM_CHANNEL_0 STAT    |   DFC0  |            0 |           0 |           0 |            0 |
|    DRAM_CHANNEL_1 STAT    |   DFC1  |   2697065040 |           0 |  2697065040 |    449510840 |
+---------------------------+---------+--------------+-------------+-------------+--------------+

+-----------------------------+------------+------------+------------+------------+------------+------------+
|            Metric           | HWThread 0 | HWThread 1 | HWThread 2 | HWThread 3 | HWThread 4 | HWThread 5 |
+-----------------------------+------------+------------+------------+------------+------------+------------+
|     Runtime (RDTSC) [s]     |    11.8726 |    11.8726 |    11.8726 |    11.8726 |    11.8726 |    11.8726 |
|     Runtime unhalted [s]    |    11.9940 |     9.3547 |     9.3565 |     9.3602 |     9.3660 |     9.3626 |
|         Clock [MHz]         |  3975.0323 |  3946.2665 |  3946.6680 |  3948.9543 |  3948.0193 |  3948.4097 |
|             CPI             |    13.4103 |    36.4126 |    36.2333 |    35.4412 |    35.2384 |    35.1374 |
| Memory bandwidth [MBytes/s] | 58154.6320 |          0 |          0 |          0 |          0 |          0 |
| Memory data volume [GBytes] |   690.4487 |          0 |          0 |          0 |          0 |          0 |
+-----------------------------+------------+------------+------------+------------+------------+------------+

+----------------------------------+------------+-----------+------------+-----------+
|              Metric              |     Sum    |    Min    |     Max    |    Avg    |
+----------------------------------+------------+-----------+------------+-----------+
|     Runtime (RDTSC) [s] STAT     |    71.2356 |   11.8726 |    11.8726 |   11.8726 |
|     Runtime unhalted [s] STAT    |    58.7940 |    9.3547 |    11.9940 |    9.7990 |
|         Clock [MHz] STAT         | 23713.3501 | 3946.2665 |  3975.0323 | 3952.2250 |
|             CPI STAT             |   191.8732 |   13.4103 |    36.4126 |   31.9789 |
| Memory bandwidth [MBytes/s] STAT | 58154.6320 |         0 | 58154.6320 | 9692.4387 |
| Memory data volume [GBytes] STAT |   690.4487 |         0 |   690.4487 |  115.0748 |
+----------------------------------+------------+-----------+------------+-----------+

@TomTheBear
Copy link
Member

I just tested perf_event on a Zen2 node and got similar counts. So maybe the hardware unit is just not good. In my case the scaling factor is 1.0, as the system is in NPS4 mode.
Moreover, I saw in various AMD docs that the transfer unit for the data fabric is partly 32 Byte and partly 64 Byte. This could cause some further inaccuracy.

In your case, I would remove the scaling factor and check results again.

$ mkdir -p ~/.likwid/groups/zen2
$ cp $GROUPSFOLDER/zen2/MEM.txt ~/.likwid/groups/zen2/MYMEM.txt
$ vi ~/.likwid/groups/zen2/MYMEM.txt (remove the scaling factor in `METRICS` and `LONG` section)
$ likwid-perfctr -g MYMEM ...

@biergaizi
Copy link
Author

biergaizi commented May 29, 2023

I deleted the scaling factor *(4.0/(num_numadomains/num_sockets)) from the source code and reinstalled LIKWID, and now the measured memory bandwidth looks more accurate. Although it can still be off by 30% in some benchmarks, perhaps it's a hardware limitation.

 likwid-perfctr -g MEM likwid-bench -t daxpy_avx -w S0:1GB:1
--------------------------------------------------------------------------------
CPU name:	AMD Ryzen 5 3500X 6-Core Processor             
CPU type:	AMD K17 (Zen2) architecture
CPU clock:	3.60 GHz
--------------------------------------------------------------------------------
Warning: Sanitizing vector length to a multiple of the loop stride 24 and thread count 1 from 62500000 elements (500000000 bytes) to 62499984 elements (499999872 bytes)
Allocate: Process running on hwthread 0 (Domain S0) - Vector length 62499984/499999872 Offset 0 Alignment 512
Allocate: Process running on hwthread 0 (Domain S0) - Vector length 62499984/499999872 Offset 0 Alignment 512
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: daxpy_avx
--------------------------------------------------------------------------------
Using 1 work groups
Using 1 threads
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
Group: 0 Thread 0 Global Thread 0 running on hwthread 0 - Vector length 62499984 Offset 0
--------------------------------------------------------------------------------
Cycles:			4379510736
CPU Clock:		3599877577
Cycle Clock:		3599877577
Time:			1.216572e+00 sec
Iterations:		16
Iterations per thread:	16
Inner loop executions:	2604166
Size (Byte):		999999744
Size per thread:	999999744
Number of Flops:	1999999488
MFlops/s:		1643.96
Data volume (Byte):	23999993856
MByte/s:		19727.56
Cycles per update:	4.379512
Cycles per cacheline:	35.036095
Loads per update:	2
Stores per update:	1
Load bytes per element:	16
Store bytes per elem.:	8
Load/store ratio:	2.00
Instructions:		874999793
UOPs:			1583332928
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: MEM
+----------------------+---------+-------------+------------+------------+------------+------------+------------+
|         Event        | Counter |  HWThread 0 | HWThread 1 | HWThread 2 | HWThread 3 | HWThread 4 | HWThread 5 |
+----------------------+---------+-------------+------------+------------+------------+------------+------------+
|   ACTUAL_CPU_CLOCK   |  FIXC1  | 13414089221 |   14930116 |   10718375 |   20897519 |    4114426 |   12775651 |
|     MAX_CPU_CLOCK    |  FIXC2  | 11959589448 |   25076700 |   13682124 |   26685936 |    5913036 |   18326628 |
| RETIRED_INSTRUCTIONS |   PMC0  |  2690421180 |     203273 |     864193 |    1245692 |       8058 |    1356898 |
|  CPU_CLOCKS_UNHALTED |   PMC1  | 12596389791 |     335960 |    1313472 |    2357267 |      50844 |    1264603 |
|    DRAM_CHANNEL_0    |   DFC0  |           0 |          0 |          0 |          0 |          0 |          0 |
|    DRAM_CHANNEL_1    |   DFC1  |   984493731 |          0 |          0 |          0 |          0 |          0 |
+----------------------+---------+-------------+------------+------------+------------+------------+------------+

+---------------------------+---------+-------------+---------+-------------+--------------+
|           Event           | Counter |     Sum     |   Min   |     Max     |      Avg     |
+---------------------------+---------+-------------+---------+-------------+--------------+
|   ACTUAL_CPU_CLOCK STAT   |  FIXC1  | 13477525308 | 4114426 | 13414089221 |   2246254218 |
|     MAX_CPU_CLOCK STAT    |  FIXC2  | 12049273872 | 5913036 | 11959589448 |   2008212312 |
| RETIRED_INSTRUCTIONS STAT |   PMC0  |  2694099294 |    8058 |  2690421180 |    449016549 |
|  CPU_CLOCKS_UNHALTED STAT |   PMC1  | 12601711937 |   50844 | 12596389791 | 2.100285e+09 |
|    DRAM_CHANNEL_0 STAT    |   DFC0  |           0 |       0 |           0 |            0 |
|    DRAM_CHANNEL_1 STAT    |   DFC1  |   984493731 |       0 |   984493731 | 1.640823e+08 |
+---------------------------+---------+-------------+---------+-------------+--------------+

+-----------------------------+------------+------------+------------+------------+------------+------------+
|            Metric           | HWThread 0 | HWThread 1 | HWThread 2 | HWThread 3 | HWThread 4 | HWThread 5 |
+-----------------------------+------------+------------+------------+------------+------------+------------+
|     Runtime (RDTSC) [s]     |     4.3339 |     4.3339 |     4.3339 |     4.3339 |     4.3339 |     4.3339 |
|     Runtime unhalted [s]    |     3.7264 |     0.0041 |     0.0030 |     0.0058 |     0.0011 |     0.0035 |
|         Clock [MHz]         |  4037.5319 |  2143.2056 |  2819.9829 |  2818.9238 |  2504.7811 |  2509.4094 |
|             CPI             |     4.6819 |     1.6528 |     1.5199 |     1.8923 |     6.3098 |     0.9320 |
| Memory bandwidth [MBytes/s] | 14538.1570 |          0 |          0 |          0 |          0 |          0 |
| Memory data volume [GBytes] |    63.0076 |          0 |          0 |          0 |          0 |          0 |
+-----------------------------+------------+------------+------------+------------+------------+------------+

+----------------------------------+------------+-----------+------------+-----------+
|              Metric              |     Sum    |    Min    |     Max    |    Avg    |
+----------------------------------+------------+-----------+------------+-----------+
|     Runtime (RDTSC) [s] STAT     |    26.0034 |    4.3339 |     4.3339 |    4.3339 |
|     Runtime unhalted [s] STAT    |     3.7439 |    0.0011 |     3.7264 |    0.6240 |
|         Clock [MHz] STAT         | 16833.8347 | 2143.2056 |  4037.5319 | 2805.6391 |
|             CPI STAT             |    16.9887 |    0.9320 |     6.3098 |    2.8314 |
| Memory bandwidth [MBytes/s] STAT | 14538.1570 |         0 | 14538.1570 | 2423.0262 |
| Memory data volume [GBytes] STAT |    63.0076 |         0 |    63.0076 |   10.5013 |
+----------------------------------+------------+-----------+------------+-----------+

Another one:

$ likwid-perfctr -g MEM likwid-bench -t update_avx -w S0:1GB
--------------------------------------------------------------------------------
CPU name:	AMD Ryzen 5 3500X 6-Core Processor             
CPU type:	AMD K17 (Zen2) architecture
CPU clock:	3.60 GHz
--------------------------------------------------------------------------------
Warning: Sanitizing vector length to a multiple of the loop stride 16 and thread count 6 from 125000000 elements (1000000000 bytes) to 124999968 elements (999999744 bytes)
Allocate: Process running on hwthread 0 (Domain S0) - Vector length 124999968/999999744 Offset 0 Alignment 512
Initialization: First thread in domain initializes the whole stream
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: update_avx
--------------------------------------------------------------------------------
Using 1 work groups
Using 6 threads
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
Group: 0 Thread 5 Global Thread 5 running on hwthread 5 - Vector length 20833328 Offset 104166640
Group: 0 Thread 4 Global Thread 4 running on hwthread 4 - Vector length 20833328 Offset 83333312
Group: 0 Thread 3 Global Thread 3 running on hwthread 3 - Vector length 20833328 Offset 62499984
Group: 0 Thread 0 Global Thread 0 running on hwthread 0 - Vector length 20833328 Offset 0
Group: 0 Thread 2 Global Thread 2 running on hwthread 2 - Vector length 20833328 Offset 41666656
Group: 0 Thread 1 Global Thread 1 running on hwthread 1 - Vector length 20833328 Offset 20833328
--------------------------------------------------------------------------------
Cycles:			30701085408
CPU Clock:		3599965906
Cycle Clock:		3599965906
Time:			8.528160e+00 sec
Iterations:		384
Iterations per thread:	64
Inner loop executions:	1302083
Size (Byte):		999999744
Size per thread:	166666624
Number of Flops:	0
MFlops/s:		0.00
Data volume (Byte):	127999967232
MByte/s:		15009.10
Cycles per update:	3.837637
Cycles per cacheline:	30.701093
Loads per update:	1
Stores per update:	1
Load bytes per element:	8
Store bytes per elem.:	8
Load/store ratio:	1.00
Instructions:		5499998608
UOPs:			6999998208
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: MEM
+----------------------+---------+-------------+-------------+-------------+-------------+-------------+-------------+
|         Event        | Counter |  HWThread 0 |  HWThread 1 |  HWThread 2 |  HWThread 3 |  HWThread 4 |  HWThread 5 |
+----------------------+---------+-------------+-------------+-------------+-------------+-------------+-------------+
|   ACTUAL_CPU_CLOCK   |  FIXC1  | 43083054062 | 33578628214 | 33580304342 | 33615693909 | 33602087908 | 33610510415 |
|     MAX_CPU_CLOCK    |  FIXC2  | 39094248924 | 30717370620 | 30719175696 | 30734790624 | 30709689336 | 30721298472 |
| RETIRED_INSTRUCTIONS |   PMC0  |  3138544641 |   917128708 |   919429596 |   935464695 |   933914764 |   940417699 |
|  CPU_CLOCKS_UNHALTED |   PMC1  | 41958268642 | 33276095792 | 33284432156 | 33309565345 | 33327410586 | 33327488126 |
|    DRAM_CHANNEL_0    |   DFC0  |           0 |           0 |           0 |           0 |           0 |           0 |
|    DRAM_CHANNEL_1    |   DFC1  |  2697067901 |           0 |           0 |           0 |           0 |           0 |
+----------------------+---------+-------------+-------------+-------------+-------------+-------------+-------------+

+---------------------------+---------+--------------+-------------+-------------+--------------+
|           Event           | Counter |      Sum     |     Min     |     Max     |      Avg     |
+---------------------------+---------+--------------+-------------+-------------+--------------+
|   ACTUAL_CPU_CLOCK STAT   |  FIXC1  | 211070278850 | 33578628214 | 43083054062 | 3.517838e+10 |
|     MAX_CPU_CLOCK STAT    |  FIXC2  | 192696573672 | 30709689336 | 39094248924 |  32116095612 |
| RETIRED_INSTRUCTIONS STAT |   PMC0  |   7784900103 |   917128708 |  3138544641 | 1.297483e+09 |
|  CPU_CLOCKS_UNHALTED STAT |   PMC1  | 208483260647 | 33276095792 | 41958268642 | 3.474721e+10 |
|    DRAM_CHANNEL_0 STAT    |   DFC0  |            0 |           0 |           0 |            0 |
|    DRAM_CHANNEL_1 STAT    |   DFC1  |   2697067901 |           0 |  2697067901 | 4.495113e+08 |
+---------------------------+---------+--------------+-------------+-------------+--------------+

+-----------------------------+------------+------------+------------+------------+------------+------------+
|            Metric           | HWThread 0 | HWThread 1 | HWThread 2 | HWThread 3 | HWThread 4 | HWThread 5 |
+-----------------------------+------------+------------+------------+------------+------------+------------+
|     Runtime (RDTSC) [s]     |    11.8723 |    11.8723 |    11.8723 |    11.8723 |    11.8723 |    11.8723 |
|     Runtime unhalted [s]    |    11.9676 |     9.3275 |     9.3280 |     9.3378 |     9.3340 |     9.3363 |
|         Clock [MHz]         |  3967.2724 |  3935.2953 |  3935.2605 |  3937.4063 |  3939.0297 |  3938.5281 |
|             CPI             |    13.3687 |    36.2829 |    36.2012 |    35.6075 |    35.6857 |    35.4390 |
| Memory bandwidth [MBytes/s] | 14539.1157 |          0 |          0 |          0 |          0 |          0 |
| Memory data volume [GBytes] |   172.6123 |          0 |          0 |          0 |          0 |          0 |
+-----------------------------+------------+------------+------------+------------+------------+------------+

+----------------------------------+------------+-----------+------------+-----------+
|              Metric              |     Sum    |    Min    |     Max    |    Avg    |
+----------------------------------+------------+-----------+------------+-----------+
|     Runtime (RDTSC) [s] STAT     |    71.2338 |   11.8723 |    11.8723 |   11.8723 |
|     Runtime unhalted [s] STAT    |    58.6312 |    9.3275 |    11.9676 |    9.7719 |
|         Clock [MHz] STAT         | 23652.7923 | 3935.2605 |  3967.2724 | 3942.1321 |
|             CPI STAT             |   192.5850 |   13.3687 |    36.2829 |   32.0975 |
| Memory bandwidth [MBytes/s] STAT | 14539.1157 |         0 | 14539.1157 | 2423.1860 |
| Memory data volume [GBytes] STAT |   172.6123 |         0 |   172.6123 |   28.7687 |
+----------------------------------+------------+-----------+------------+-----------+

@TomTheBear
Copy link
Member

As I said, they are not that accurate. I still have no idea why the data volume does not fit. For update* and daxpy*, the counts from likwid-bench and likwid-perfctr should match. There are other kernels where the difference is caused by cache management options (write-allocate, RFO) when storing to a stream without loading it before (store*, copy*, stream*, ...). On your system only DFC1 contains counts, maybe related to the single-rank memory, but there is no public documentation about the data fabric, only the bandwidth metric is defined and derived all info from that. If you want more accurate counts, send a request to AMD. I have tried to free more info from AMD on hardware counting for some years now.

@biergaizi
Copy link
Author

biergaizi commented May 29, 2023

On your system only DFC1 contains counts, maybe related to the single-rank memory, but there is no public documentation about the data fabric, only the bandwidth metric is defined and derived all info from that.

Not single rank, just single channel. Only a single DIMM module is plugged in, so the other integrated memory controller is disabled, as LIKWID correctly reports here. And since memory traffic is calculated by summing both counters on each memory controller, I don't expect two-channel configuration to be more or less accurate than one-channel, but who knows...

If you want more accurate counts, send a request to AMD. I have tried to free more info from AMD on hardware counting for some years now.

Understandable. I didn't expect to get tech support from a community project when even AMD doesn't officially provide any, I just wanted to learn more about this situation from experienced maintainers.

@TomTheBear
Copy link
Member

Well, then it's just lucky that the event name fits. They are not documented, so I made the names up.

On my Zen2 test system (2x AMD EPYC 7352), the DFC0 counter is commonly larger:

|    DRAM_CHANNEL_0    |   DFC0  |   969533443 |
|    DRAM_CHANNEL_1    |   DFC1  |     1230126 |

but also there, the counts are not really accurate and somewhat similar to your system (daxpy_avx kernel):

MByte/s:		19934.36
| Memory bandwidth [MBytes/s] | 14571.3764 |

You could try to use the other DIMM slot, maybe that helps.

@biergaizi
Copy link
Author

Well, then it's just lucky that the event name fits. They are not documented, so I made the names up.

I physically moved the DIMM from slot 2 to slot 4, and indeed, now the counter shifted from DRAM_CHANNEL_1 to DRAM_CHANNEL_0. I'm feeling lucky.

$ likwid-perfctr -g MEM likwid-bench -t daxpy_sse -w S0:1GB:1
[...]
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: daxpy_sse
--------------------------------------------------------------------------------
MFlops/s:		1690.63
Data volume (Byte):	24000000000
MByte/s:		20287.56
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: MEM
+----------------------+---------+-------------+------------+------------+------------+------------+------------+
|         Event        | Counter |  HWThread 0 | HWThread 1 | HWThread 2 | HWThread 3 | HWThread 4 | HWThread 5 |
+----------------------+---------+-------------+------------+------------+------------+------------+------------+
|   ACTUAL_CPU_CLOCK   |  FIXC1  | 13071323132 |   37725770 |    6585375 |   15781722 |    5079497 |   18900728 |
|     MAX_CPU_CLOCK    |  FIXC2  | 11653149528 |   48101040 |   11395116 |   20104344 |    6111756 |   20442888 |
| RETIRED_INSTRUCTIONS |   PMC0  |  6438025207 |    1339036 |       2050 |     777827 |     196215 |    3738628 |
|  CPU_CLOCKS_UNHALTED |   PMC1  | 12256831310 |    3074610 |      11045 |     751912 |     291693 |    2770921 |
|    DRAM_CHANNEL_0    |   DFC0  |   984198619 |          0 |          0 |          0 |          0 |          0 |
|    DRAM_CHANNEL_1    |   DFC1  |           0 |          0 |          0 |          0 |          0 |          0 |
+----------------------+---------+-------------+------------+------------+------------+------------+------------+

@TomTheBear
Copy link
Member

This issue seems resolved even if the DataFabric unit counts are off. If so, please close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants