Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Access Uncore Counters on SPR #633

Open
Mockingjay1316 opened this issue Sep 4, 2024 · 5 comments
Open

Unable to Access Uncore Counters on SPR #633

Mockingjay1316 opened this issue Sep 4, 2024 · 5 comments

Comments

@Mockingjay1316
Copy link

Hi,

I am trying to run likwid 5.3 on an SPR machine with a fairly new version of linux (5.15 from uname -a). I have followed the steps in build instructions, including the boot option and capabilities. Secure boot is off.

boot option:

$ sudo dmesg | grep allow_writes
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.0-119-generic root=[###] ro msr.allow_writes=on
[    1.422673] Kernel command line: BOOT_IMAGE=/vmlinuz-5.15.0-119-generic root=[###] ro msr.allow_writes=on

capabilities (I have set them all):

$ getcap -r .
./likwid-perfscope cap_sys_rawio,cap_sys_admin=ep
./likwid-topology cap_sys_rawio,cap_sys_admin=ep
./likwid-setFrequencies cap_sys_rawio,cap_sys_admin=ep
./likwid-pin cap_sys_rawio,cap_sys_admin=ep
./likwid-features cap_sys_rawio,cap_sys_admin=ep
./likwid-lua cap_sys_rawio,cap_sys_admin=ep
./likwid-memsweeper cap_sys_rawio,cap_sys_admin=ep
./likwid-bench cap_sys_rawio,cap_sys_admin=ep
./likwid-mpirun cap_sys_rawio,cap_sys_admin=ep
./likwid-powermeter cap_sys_rawio,cap_sys_admin=ep
./likwid-genTopoCfg cap_sys_rawio,cap_sys_admin=ep

msrs:

$ ll /dev/cpu/*/msr
crw----rw- 1 root root 202,  0 Sep  2 17:21 /dev/cpu/0/msr
crw----rw- 1 root root 202, 10 Sep  2 17:21 /dev/cpu/10/msr
crw----rw- 1 root root 202, 11 Sep  2 17:21 /dev/cpu/11/msr
crw----rw- 1 root root 202, 12 Sep  2 17:21 /dev/cpu/12/msr
crw----rw- 1 root root 202, 13 Sep  2 17:21 /dev/cpu/13/msr
crw----rw- 1 root root 202, 14 Sep  2 17:21 /dev/cpu/14/msr
crw----rw- 1 root root 202, 15 Sep  2 17:21 /dev/cpu/15/msr
crw----rw- 1 root root 202, 16 Sep  2 17:21 /dev/cpu/16/msr
crw----rw- 1 root root 202, 17 Sep  2 17:21 /dev/cpu/17/msr
crw----rw- 1 root root 202, 18 Sep  2 17:21 /dev/cpu/18/msr
crw----rw- 1 root root 202, 19 Sep  2 17:21 /dev/cpu/19/msr
crw----rw- 1 root root 202,  1 Sep  2 17:21 /dev/cpu/1/msr
crw----rw- 1 root root 202, 20 Sep  2 17:21 /dev/cpu/20/msr
crw----rw- 1 root root 202, 21 Sep  2 17:21 /dev/cpu/21/msr
crw----rw- 1 root root 202, 22 Sep  2 17:21 /dev/cpu/22/msr
crw----rw- 1 root root 202, 23 Sep  2 17:21 /dev/cpu/23/msr
crw----rw- 1 root root 202, 24 Sep  2 17:21 /dev/cpu/24/msr
crw----rw- 1 root root 202, 25 Sep  2 17:21 /dev/cpu/25/msr
crw----rw- 1 root root 202, 26 Sep  2 17:21 /dev/cpu/26/msr
crw----rw- 1 root root 202, 27 Sep  2 17:21 /dev/cpu/27/msr
crw----rw- 1 root root 202, 28 Sep  2 17:21 /dev/cpu/28/msr
crw----rw- 1 root root 202, 29 Sep  2 17:21 /dev/cpu/29/msr
crw----rw- 1 root root 202,  2 Sep  2 17:21 /dev/cpu/2/msr
crw----rw- 1 root root 202, 30 Sep  2 17:21 /dev/cpu/30/msr
crw----rw- 1 root root 202, 31 Sep  2 17:21 /dev/cpu/31/msr
crw----rw- 1 root root 202,  3 Sep  2 17:21 /dev/cpu/3/msr
crw----rw- 1 root root 202,  4 Sep  2 17:21 /dev/cpu/4/msr
crw----rw- 1 root root 202,  5 Sep  2 17:21 /dev/cpu/5/msr
crw----rw- 1 root root 202,  6 Sep  2 17:21 /dev/cpu/6/msr
crw----rw- 1 root root 202,  7 Sep  2 17:21 /dev/cpu/7/msr
crw----rw- 1 root root 202,  8 Sep  2 17:21 /dev/cpu/8/msr
crw----rw- 1 root root 202,  9 Sep  2 17:21 /dev/cpu/9/msr

I have tried accessdeamon mode and direct mode, however likwid-perfctr -e gives out only core counters, but no uncore counters. I also tried likwid-perfctr -C 0 -g MEM ls as a test, and the results shows no info for memory, but with an error message (in direct mode):

Group 1: MEM
+-----------------------+----------+------------+
|         Event         |  Counter | HWThread 0 |
+-----------------------+----------+------------+
|   INSTR_RETIRED_ANY   |   FIXC0  |     533551 |
| CPU_CLK_UNHALTED_CORE |   FIXC1  |     870762 |
|  CPU_CLK_UNHALTED_REF |   FIXC2  |     700756 |
|     TOPDOWN_SLOTS     |   FIXC3  |    5224572 |
|      CAS_COUNT_RD     |  MBOX0C0 |      -     |
|      CAS_COUNT_WR     |  MBOX0C1 |      -     |
|      CAS_COUNT_RD     |  MBOX1C0 |      -     |
|      CAS_COUNT_WR     |  MBOX1C1 |      -     |
|      CAS_COUNT_RD     |  MBOX2C0 |      -     |
|      CAS_COUNT_WR     |  MBOX2C1 |      -     |
|      CAS_COUNT_RD     |  MBOX3C0 |      -     |
|      CAS_COUNT_WR     |  MBOX3C1 |      -     |
|      CAS_COUNT_RD     |  MBOX4C0 |      -     |
|      CAS_COUNT_WR     |  MBOX4C1 |      -     |
|      CAS_COUNT_RD     |  MBOX5C0 |      -     |
|      CAS_COUNT_WR     |  MBOX5C1 |      -     |
|      CAS_COUNT_RD     |  MBOX6C0 |      -     |
|      CAS_COUNT_WR     |  MBOX6C1 |      -     |
|      CAS_COUNT_RD     |  MBOX7C0 |      -     |
|      CAS_COUNT_WR     |  MBOX7C1 |      -     |
|      CAS_COUNT_RD     |  MBOX8C0 |      -     |
|      CAS_COUNT_WR     |  MBOX8C1 |      -     |
|      CAS_COUNT_RD     |  MBOX9C0 |      -     |
|      CAS_COUNT_WR     |  MBOX9C1 |      -     |
|      CAS_COUNT_RD     | MBOX10C0 |      -     |
|      CAS_COUNT_WR     | MBOX10C1 |      -     |
|      CAS_COUNT_RD     | MBOX11C0 |      -     |
|      CAS_COUNT_WR     | MBOX11C1 |      -     |
|      CAS_COUNT_RD     | MBOX12C0 |      -     |
|      CAS_COUNT_WR     | MBOX12C1 |      -     |
|      CAS_COUNT_RD     | MBOX13C0 |      -     |
|      CAS_COUNT_WR     | MBOX13C1 |      -     |
|      CAS_COUNT_RD     | MBOX14C0 |      -     |
|      CAS_COUNT_WR     | MBOX14C1 |      -     |
|      CAS_COUNT_RD     | MBOX15C0 |      -     |
|      CAS_COUNT_WR     | MBOX15C1 |      -     |
+-----------------------+----------+------------+
...
ERROR - [./src/includes/perfmon_sapphirerapids.h:perfmon_finalizeCountersThread_sapphirerapids:2222] No such file or directory.
MSR read operation failed

likwid-perfctr -i gives out information like this:

$ likwid-perfctr -i
--------------------------------------------------------------------------------
CPU name:       Intel(R) Xeon(R) Gold 5415+
CPU type:       Intel SapphireRapids processor
CPU clock:      2.90 GHz
CPU family:     6
CPU model:      143
CPU short:      SPR
CPU stepping:   8
CPU features:   FP ACPI MMX SSE SSE2 HTT TM RDTSCP MONITOR VMX EIST TM2 SSSE FMA SSE4.1 SSE4.2 AES AVX RDRAND AVX2 AVX512 RDSEED SSE3
CPU arch:       x86_64
--------------------------------------------------------------------------------
PERFMON version:                        5
PERFMON number of counters:             8
PERFMON width of counters:              48
PERFMON number of fixed counters:       4
--------------------------------------------------------------------------------

Can you shed some light on how to proceed? Thank you!

@TomTheBear
Copy link
Member

TomTheBear commented Sep 4, 2024

I havn't tested LIKWID with capabilities for a long time, especially not on SPR. I remember that they where hard to configure correctly. Quick guess, remove the capabilities from all except likwid-lua with ACCESSMODE=direct. Warning: This is a security issue as anyone using this interpreter can use the capabilities! If you use ACCESSMODE=accessdaemon, the daemons in sbin require the capabilities and there should be no security problem. Where did you get the info how to configure the capabilities correctly? I test it myself and update the docs.

In order to set the capabilities you probably had root privileges. Try an installation with ACCESSMODE=accessdaemon to some user-local prefix but with sudo (The user-local prefix is just to easily delete the whole installation again). Adjust the PATH and LD_LIBRARY_PATH and check again whether it still does not work.

You should get some better understanding where it fails with debugging mode -V 3.

@Mockingjay1316
Copy link
Author

Thank you for the reply! I just rebuilt with ACCESSMODE=accessdaemon and the problem persists. With -V 3 I got more error messages (thousands of them) like:

...
DEBUG - [access_client_check:562] Device check for dev 199 on socket 0 with accessDaemon failed
...

As a result MBOX cannot be accesses. Seems other errors are of similar type.

I also examined the syslog:

Sep  4 16:39:06 accessD: AccessDaemon runs with UID 1172040, eUID 0
Sep  4 16:39:08 accessD: Failed to read data from register 0x63a on core 0
Sep  4 16:39:08 accessD: Input/output error
Sep  4 16:39:08 accessD: Failed to read data from register 0x641 on core 0
Sep  4 16:39:08 accessD: Input/output error
Sep  4 16:39:08 accessD: Failed to write data to register 0x8000 on core 0
Sep  4 16:39:08 accessD: Input/output error
Sep  4 16:39:08 accessD: Failed to write data to register 0x8000 on core 0
Sep  4 16:39:08 accessD: Input/output error
Sep  4 16:39:08 accessD: Failed to write data to register 0x8000 on core 0
Sep  4 16:39:08 accessD: Input/output error

Capabilities cap_sys_rawio,cap_sys_admin=ep are given to likwid-accessD.
Am I missing something here?

@TomTheBear
Copy link
Member

Have you tried CAP_DAC_OVERRIDE? Based on the answer at this SO entry, the two capabilities you specified are not enough to fully read /dev/mem.

@TomTheBear
Copy link
Member

According to our tests, the additional CAP_DAC_OVERRIDE capability fixes the issue. We update the docs. Can you confirm?

cap_sys_rawio,cap_sys_admin,cap_dac_override=ep

@ipatix
Copy link
Contributor

ipatix commented Nov 8, 2024

Capabilities cap_sys_rawio,cap_sys_admin=ep are given to likwid-accessD. Am I missing something here?

Does likwid-accessD have root-suid permissions? If not, there is not much sense in using the access daemon at all. Its purpose is to perform security checks before accessing. In either access daemon or direct case, it is still possible to make it work. All you need is appropriate rw permissions not only on /dev/cpu/*/msr, but also on /dev/mem (in addition to cap_sys_rawio=ep). However this is a security risk, which could not be more severe (user access to ALL memory).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants