Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD family 25, model 68 support missing (e.g. AMD Ryzen 7 PRO 6850U) #635

Open
LadnerJonas opened this issue Sep 27, 2024 · 17 comments
Open

Comments

@LadnerJonas
Copy link

Why do you need support for this specific architecture?
This CPU is a Zen3+ model, which is used for plenty of recent thinkpads.

Which architecture model, family and further information? CPU

> cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 68
model name      : AMD Ryzen 7 PRO 6850U with Radeon Graphics
stepping        : 1
microcode       : 0xa404105

> likwid-perfctr --version
likwid-perfctr -- Version 5.3.0

> uname -a
Linux 6.6.47-1-Manjaro Linux

Is the documentation of the hardware counters publicly available?
Linux perf is working fine, so I assume yes.
Are there already any usable tools (commercial or open-source)?

> perf stat /bin/true           

 Performance counter stats for '/bin/true':

              0,56 msec task-clock                       #    0,333 CPUs utilized             
                 1      context-switches                 #    1,796 K/sec                     
                 0      cpu-migrations                   #    0,000 /sec                      
                52      page-faults                      #   93,367 K/sec                     
         1.652.810      cycles                           #    2,968 GHz                       
           569.581      stalled-cycles-frontend          #   34,46% frontend cycles idle      
           935.173      instructions                     #    0,57  insn per cycle            
                                                  #    0,61  stalled cycles per insn   
           212.933      branches                         #  382,325 M/sec                     
            26.256      branch-misses                    #   12,33% of all branches           

       0,001672605 seconds time elapsed

       0,001560000 seconds user
       0,000000000 seconds sys

If you need any more information, I am happy to provide it.
Thank you!

@TomTheBear
Copy link
Member

Unfortunately, I cannot find matching documentation at AMD TechDocs for family 19h model 44h. Getting the information out of the kernel is possible but time-consuming.

@tnibler
Copy link

tnibler commented Sep 30, 2024

It's a huge pain, PPR docs for the last 3 (actually 4 now) generations are missing and AMD support just says they will be released "later".

In the meantime I do have some time to give it a shot myself (model 75h personally). There is the BKDG), but lots of events are missing from that so I guess the files in perf are the reference? Is there some method or do you just check all event numbers and bitmasks one by one, and how likely is it that big changes are needed? Because just changing the model number in perfmon_zen4_counters.h does work, but I'd like to make sure the counters are not mixed up and producing wrong results.

Thank you :)

@tnibler
Copy link

tnibler commented Oct 10, 2024

Update: seems like there are at least some differences between model 75h and the Zen4 counter mappings in likwid (don't know about 68h), and even a good number of e.g. cache events in perf don't work or give weird values compared to uprof. AMD support still won't send over manuals, soo yeah I guess that's that for the time being.

@TomTheBear
Copy link
Member

You mean there are some wrong configs in LIKWID regarding Model 75H? Could you point out a few so I have a starting point?

As far as I know the code for perf_event in the Linux kernel for AMD chips, there is basically only a differentiation between AMD K17 and K19. Only for K17, the cache events are defined in perf_event directly. K19 uses the same list. I never compared the two K's that deeply. Not sure whether this is true or a mistake.

@tnibler
Copy link

tnibler commented Oct 10, 2024

That was a misleading way to phrase it, I'm not sure LIKWID is doing anything wrong actually. perf has the same L3 lookup state events as LIKWID, but they are not shown in perf list and don't work if referred to by name with perf stat.

likwid-perfctr debug-prints Cannot access counter register CPMC0 and only shows 0 counts with -g L3CACHE, but actually perf can also not read anything if given -e r04ff for instance for L3 lookup state. The rest is just baseless speculation, I don't really know how to debug much further than that.

@TomTheBear
Copy link
Member

The important part in the linked JSON file is: https://github.com/torvalds/linux/blob/9852d85ec9d492ebef56dc5f229416c925758edc/tools/perf/pmu-events/arch/x86/amdzen4/cache.json#L658

The L3PMC is similar to LIKWID's CPMC (Cache Performance Monitoring Counter). It is a different unit, you have to specify explicitly. For LIKWID it is encoded in the counter name CPMC0, for perf, you have to specify it -e amd_l3/config=0x04ff/. But if your LIKWID installation was built with ACCESSMODE=perf_event, the reason why neither LIKWID nor perf work is that the perfmon unit is not exposed by your system through perf_event (folder /sys/devices/amd_l3 does not exist). LIKWID with ACCESSMODE=accessdaemon is capable of using these units even if not exposed by the kernel. But since you run on some laptop-dedicated chip, the unit might really not exist at hardware level.

@tnibler
Copy link

tnibler commented Oct 10, 2024

/sys/devices/amd_l3 does indeed not exist, but with the msr module and Linux hardening stuff disabled everything seems to work, thank you very much!

Although when using the marker API with -m I get Cannot access counter register CPMC0 again :/ I'll look into that, but it's still likwid-accessD doing the MSR access right (I've setcapped every binary involved, so it can't be that I think).

@TomTheBear
Copy link
Member

Make sure you rebuild LIKWID completely (make distclean && make) after changes to config.mk and ensure at runtime that your application finds the right LIKWID library. I have often seen these issues with multiple LIKWID installations with a wrong pick by the linker at runtime.

If I understand the capabilities system correctly, you have to use setcap on your application. The access daemon inherits the capabilities of your application. But I have not played around with capabilities much but enough to tell most users to not use it since you have to give capabilities to the Lua interpreter (likwid-lua). So every Lua script executed by this interpreter gets the MSR access capabilities.

@tnibler
Copy link

tnibler commented Oct 10, 2024

Hmm, every binary (and .so?) appearing in strace -f ... | grep exec has been setcapped (too many for comfort) and it still does it. PMC events work, CPMC without marker works so there must be something. But whatever it's fine, it's not super necessary and not worth the security implications as you said. The important stuff does work fine.

What's your policy for adding in more supported models in topology.h then? They're all checked one by one in if statements, so it might get a bit janky to add 20 models per vendor per year.

@TomTheBear
Copy link
Member

As I said, I do not have much experience with capabilities. All my assumptions might be wrong. If PMC works, it sounds like a different issue. Can you please provide the output of a run with -V 3 (as file) with the CPMC counters.

The whole topology lookup code was already there when I took over the project. In the meantime, it got quite fat, correct. For some architecture, we create a macro like ARCHGROUP(arch) (((arch) == X) || ((arch) == Y)) to simplify it in the code. But, nevertheless, the topology code needs a major update, so there is an opportunity to make it better.

@tnibler
Copy link

tnibler commented Oct 10, 2024

https://gist.github.com/tnibler/ffdb00f27dfdfaae4522448934053ae1

In order: CPMC, marker (broken) - CPMC no marker (works, just measures the crash b/c run without marker) - PMC with marker (works).

@TomTheBear
Copy link
Member

And the library used by the application is the one you installed fresh? Or is it linked (RPATH?) with another version? The PMC MarkerAPI run uses perf_event internally (first perfmon_setupCounterThread_zen4, then perfmon_setupCountersThread_perfevent). If the same thing happens for the CPMC MarkerAPI run. The issue is then caused by the missing /sys/devices/amd_l3.

@tnibler
Copy link

tnibler commented Nov 28, 2024

It's definitely not linking to anything else, there's only the freshly built one available.

The issue is then caused by the missing /sys/devices/amd_l3

I guess that's the problem, but little we can do about that right :/

@TomTheBear
Copy link
Member

There has to be second installation/build because one LIKWID build supports either ACCESSMODE=accessdaemon or ACCESSMODE=perf_event (bulid-time configuration). In your runs, it starts with one library in accessdaemon mode but the application is linked to a library with perf_event. Maybe you executed inside a folder where a liblikwid.so is present and for convenience you have . in your LD_LIBRARY_PATH? What does ldd <app> tell?

@tnibler
Copy link

tnibler commented Nov 28, 2024

This is how I'm building it, chaning accessdaemon to perf_event, just rebased to newest likwid version.

ldd:

linux-vdso.so.1 (0x00007ffff7fc4000)
liblikwid.so.5.3 => /nix/store/maibj0q0f96k3k84g6h4mlzw17fmvb7h-likwid-5.3.0/lib/liblikwid.so.5.3 (0x00007ffff6400000)
libstdc++.so.6 => /nix/store/97f3gw9vpyxvwjv2i673isvg92q65mwn-gcc-13.3.0-lib/lib/libstdc++.so.6 (0x00007ffff6000000)
libgcc_s.so.1 => /nix/store/97f3gw9vpyxvwjv2i673isvg92q65mwn-gcc-13.3.0-lib/lib/libgcc_s.so.1 (0x00007ffff7f97000)
libm.so.6 => /nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libm.so.6 (0x00007ffff7eb0000)
libc.so.6 => /nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libc.so.6 (0x00007ffff5007000)
/nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/ld-linux-x86-64.so.2 => /nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib64/ld-linux-x86-64.so.2 (0x00007ffff7fc6000)
liblikwid-hwloc.so.5.3 => /nix/store/maibj0q0f96k3k84g6h4mlzw17fmvb7h-likwid-5.3.0/lib/liblikwid-hwloc.so.5.3 (0x00007ffff7e4f000)
liblikwid-lua.so.5.3 => /nix/store/maibj0q0f96k3k84g6h4mlzw17fmvb7h-likwid-5.3.0/lib/liblikwid-lua.so.5.3 (0x00007ffff7bb8000)
librt.so.1 => /nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/librt.so.1 (0x00007ffff7e48000)
libdl.so.2 => /nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libdl.so.2 (0x00007ffff7e43000)
libpthread.so.0 => /nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libpthread.so.0 (0x00007ffff7e3e000)
libz.so.1 => /nix/store/y6bcc1vzg315zzzpir608zkhnmvp49kc-zlib-1.3.1/lib/libz.so.1 (0x00007ffff7e20000)

@TomTheBear
Copy link
Member

If only this installation is present, likwid-perfctr and your application (MarkerAPI) has to use perf_event for the measurements.

For accessdaemon mode (and thus the L3 unit with the CPMC counters), there should be one line like this in the -V 3 output:
DEBUG - [access_client_startDaemon_direct:150] Starting daemon
In MarkerAPI mode, the line above should occur twice in the output, once likwid-perfctr and then the MarkerAPI.

@tnibler
Copy link

tnibler commented Nov 28, 2024

Right, sorry forgot about that. With the accessdaemon the trouble is accessing the MSRs. I don't think I ever managed to get it working with any combination of kernel command line, setcap and all that. But I don't have the time to muck around further with my kernel right now unfortunately, so I'm just falling back to an intel machine for the time of my thesis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants