Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catalina enable-hdmi20 CoreDisplay patch leads to Code Signing crash of WindowServer #1335

Closed
lambdaupb opened this issue Nov 28, 2020 · 26 comments

Comments

@lambdaupb
Copy link

see appleserial/DeskMini#10

DeskMini 310, i5-8500 UHD630, Catalina 10.15.7, Opencore 0.6.3

related code (probably): https://github.com/acidanthera/WhateverGreen/blob/7d30dd8a624d0d3b2d4882fcc689b9db4964efd5/WhateverGreen/kern_cdf.cpp#L182


enable-hdmi20 patches CoreDisplay at runtime.
When in a High Memory Pressure situation it apparently happens that the CoreDisplay library memory is moved to swap.

When reloading the library memory to RAM, a code signing check is done and fails, causing a WindowServer crash.

I am able to reproduce this by using Prime95 > Torture Test > Large FFTs which allocates almost all of system memory and then doing some UI stuff involving animations etc (~1min).

Possible fixes

logs

Process:               WindowServer [5465]
Path:                  /System/Library/PrivateFrameworks/SkyLight.framework/Versions/A/Resources/WindowServer
Identifier:            WindowServer
Version:               600.00 (451.4)
Code Type:             X86-64 (Native)
Parent Process:        launchd [1]
Responsible:           WindowServer [5465]
User ID:               88

PlugIn Path:             /System/Library/Frameworks/CoreDisplay.framework/Versions/A/CoreDisplay
PlugIn Identifier:       com.apple.CoreDisplay
PlugIn Version:          1.0 (186.6.15)

Date/Time:             2020-11-16 19:09:29.410 +0100
OS Version:            Mac OS X 10.15.7 (19H15)
Report Version:        12
Anonymous UUID:        066D0EDF-3DB8-4976-B736-5BD0416F165D

Sleep/Wake UUID:       E94190B2-19CB-47AB-B1AE-97DCA13B6988

Time Awake Since Boot: 150000 seconds
Time Since Wake:       100000 seconds

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (Code Signature Invalid)
Exception Codes:       0x0000000000000032, 0x00007fff347d72d9
Exception Note:        EXC_CORPSE_NOTIFY

Termination Reason:    Namespace CODESIGNING, Code 0x2

kernel messages:

VM Regions Near 0x7fff347d72d9:
    __TEXT                 00007fff347b8000-00007fff347d7000 [  124K] r-x/r-x SM=COW  /System/Library/Frameworks/CoreDisplay.framework/Versions/A/CoreDisplay
--> __TEXT                 00007fff347d7000-00007fff347d8000 [    4K] r-x/rwx SM=COW  /System/Library/Frameworks/CoreDisplay.framework/Versions/A/CoreDisplay
    Submap                 00007fff347d8000-00007fff40000000 [184.2M] r--/rwx SM=PRV  process-only VM submap

Application Specific Information:
StartTime:2020-11-16 18:31:50
GPU:IG
MetalDevice for accelerator(0x312b): 0x7ff210d29038 (MTLDevice: 0x7ff1e8048000)
IOService:/AppleACPIPlatformExpert/PCI0@0/AppleACPIPCI/IGPU@2/AppleIntelFramebuffer@0
2020-11-17 01:00:58.772582+0100  localhost kernel[0]: CODE SIGNING: process 241[WindowServer]: rejecting invalid page at address 0x7fff330bf000 from offset 0xcfb7000 in file "/private/var/db/dyld/dyld_shared_cache_x86_64h" (cs_mtime:1605366281.472771946 == mtime:1605366281.472771946) (signed:0 validated:0 tainted:0 nx:0 wpmapped:0 dirty:1 depth:2)
@lambdaupb
Copy link
Author

/*
 * The MAP_RESILIENT_* flags can be used when the caller wants to map some
 * possibly unreliable memory and be able to access it safely, possibly
 * getting the wrong contents rather than raising any exception.
 * For safety reasons, such mappings have to be read-only (PROT_READ access
 * only).
 *
 * MAP_RESILIENT_CODESIGN:
 * 	accessing this mapping will not generate code-signing violations,
 *	even if the contents are tainted.
 * MAP_RESILIENT_MEDIA:
 *	accessing this mapping will not generate an exception if the contents
 *	are not available (unreachable removable or remote media, access beyond
 *	end-of-file, ...).  Missing contents will be replaced with zeroes.
 */
#define MAP_RESILIENT_CODESIGN	0x2000 /* no code-signing failures */
#define MAP_RESILIENT_MEDIA	0x4000 /* no backing-store failures */

Seems that only works for read only mappings.

@vit9696
Copy link
Contributor

vit9696 commented Nov 30, 2020

That's very interesting, but I believe we cannot quite remap things here. Instead we should adjust the codesign flags as we already do, but perhaps in a slightly different manner. It may be possible that I missed some for the latest 10.15 version. Could you play with it and try setting/dropping different flags?

CC @usr-sse2 @osy86 @lvs1974 @07151129

@al3xtjames
Copy link

al3xtjames commented Dec 12, 2020

Can easily reproduce on 10.14.6 here: run P95 large FFTs until some swapping occurs, and then try to open About This Mac. This should cause WindowServer to crash.

sudo sysctl vm.cs_debug=255 adds some more info:

2020-12-11 19:35:59.509 Df kernel[0:1f4918] vm_fault: signed: no validate: no tainted: no wpmapped: no prot: 0x5
2020-12-11 19:35:59.509 Df kernel[0:1f4918] CODE SIGNING: cs_invalid_page(0x7fff3ad17000): p=38037[WindowServer]
2020-12-11 19:35:59.509 Df kernel[0:1f4918] CODE SIGNING: cs_invalid_page(0x7fff3ad17000): p=38037[WindowServer] final status 0x23007b01, denying page sending SIGKILL
2020-12-11 19:35:59.509 Df kernel[0:1f4918] CODE SIGNING: process 38037[WindowServer]: rejecting invalid page at address 0x7fff3ad17000 from offset 0xb89e000 in file "/private/var/db/dyld/dyld_shared_cache_x86_64h" (cs_mtime:1605723499.64038983 == mtime:1605723499.64038983) (signed:0 validated:0 tainted:0 nx:0 wpmapped:0 dirty:1 depth:2)
2020-12-11 19:35:59.509 Df kernel[0:1f4918] CODESIGNING: vm_fault_enter(0x7fff3ad17000): *** INVALID PAGE ***

sending SIGKILL means that CS_KILL was set (note that cs_invalid_page hasn't changed in 10.15).

@lvs1974
Copy link

lvs1974 commented Dec 12, 2020

@al3xtjames: try to add a boot-arg -liluuseroff.

@vit9696
Copy link
Contributor

vit9696 commented Dec 13, 2020

@al3xtjames @lambdaupb could you check whether the offset found by UserPatcher::vmProtect is correct? Because it clearly strips CS_KILL from the process.

@lambdaupb
Copy link
Author

I'm not a C programmer and have no real Idea how to do that.
If I'm provided with step-by-step instruction, I can repro this though.

This machine is my daily driver at the moment so I'm reluctant to dive into it since my issue was solved by removing the enable-hdmi20 setting.

@vit9696
Copy link
Contributor

vit9696 commented Dec 13, 2020

The easiest test is to enable Lilu debug logging and create a debug log in /var/log/Lilu_x.x.x.txt via -liludbgall liludump=60 boot arguments. Upload it here, and perhaps it sheds some light on the issue.

@al3xtjames
Copy link

al3xtjames commented Dec 15, 2020

Lilu is using 308 as the offset for p_csflags.
Lilu_1.5.1_18.7.txt

@stevezhengshiqi
Copy link

stevezhengshiqi commented Dec 16, 2020

@al3xtjames thx a lot for the CoreDisplay fix on weg. Would you mind providing some more information about max-pixel-clock-frequency value? If you have time to update Manual in weg, then will be so nice.

@zearp
Copy link

zearp commented Dec 20, 2020

I tried to reproduce on my NUC but couldn't. System becomes laggy but not unresponsive and it doesn't crash or even overheat. CPU usage went up and down, I guess thats part of the Large FFT torture test? I left it running for about 10 minutes whilst browsing Github and opening/closing the about my Mac dialog every now and then. My config can be found here.

As I mentioned here I believe these forced logouts on NUC 8th gens are due to missing ACPI patches and/or the OpenCore configuration used. But thats just my guess since I have no issues and run multiple NUCs. I have stress tested them with stress-ng quite heavily a few months ago. No problems whatsoever, these Kaby Lake NUCs are rock solid with OpenCore for me.

I'm running the latest versions of OpenCore/Lilu/etc and compiling everything from source now but also had no problems when I didn't do that and just used the release versions. Are there any other ways for me to try and reproduce this?

Screenshot 2020-12-20 at 13 41 30

@lambdaupb
Copy link
Author

@zearp thank you for your attempt at reproducing this issue!

I think you have SIP disabled with

<key>csr-active-config</key>
<data>/wcAAA==</data>

where /wcAAA== b64 is equal to ff 07 00 00 hex. Which according to Dorthania
https://dortania.github.io/OpenCore-Install-Guide/troubleshooting/extended/post-issues.html#disabling-sip

disables all SIP on Mojave / Catalina.

So code signing would be disabled and not kill WindowServer.

@zearp
Copy link

zearp commented Dec 20, 2020

@lambdaupb Good point! I have it disabled cuz I use VoltageShift. I just repeated the test with SIP enabled. It did run a little hotter but after ~10 minutes of running Prime95 and opening about this Mac and Launchpad/Notification Centre a bunch of times I didn't get any crash. The fading animation varies from smooth to choppy but nothing grinds to a halt.

I'm thinking that the logouts people experienced on the NUC may have nothing to do with this, which is why I can't reproduce. Unless it also happens to you on a NUC but it seems you're using a different mini computer. I'm only here cuz you mentioned this in a NUC issue I was still subscribed to haha. But I can't seem to reproduce it on my NUCs.

@lambdaupb
Copy link
Author

lambdaupb commented Dec 20, 2020

@zearp I have little experience with that setting, but could you check if SIP is really disabled enabled? The dorthania guide mentions it will not overwrite old values in NVRAM unless the property is mentioned in the delete section as well.

Note: Disabling SIP with OpenCore is quite a bit different compared to Clover, specifically that NVRAM variables will not be overwritten unless explicitly told so under the Delete section. So if you've already set SIP once either via OpenCore or in macOS, you must override the variable:

NVRAM -> Block -> 7C436110-AB2A-4BBB-A880-FE41995C9F82 -> csr-active-config

@zearp
Copy link

zearp commented Dec 20, 2020

@lambdaupb Yes it was really enabled. I checked with csrutil status after rebooting and reset NVRAM in between boots for good measure. I was also prompted with a bunch of security warnings, those are due voltageShift, Intel Power Gadget and some other kexts I use. So my guess its that it's really turned on. Does this happen to you on a Kaby Lake NUC too or only on your DeskMini?

@lambdaupb
Copy link
Author

My deskmini has a Coffee Lake R (I think) i5-8500 CPU.

There might be something else going on as well.
The crash report of WindowServer clearly shows a code signing crash on the NUC

appleserial/NUC8I5BEH#13

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (Code Signature Invalid)
Exception Codes:       0x0000000000000032, 0x00007fff37028253
Exception Note:        EXC_CORPSE_NOTIFY

Termination Reason:    Namespace CODESIGNING, Code 0x2

kernel messages:

VM Regions Near 0x7fff37028253:
    __TEXT                 00007fff37009000-00007fff37028000 [  124K] r-x/r-x SM=COW  /System/Library/Frameworks/CoreDisplay.framework/Versions/A/CoreDisplay
--> __TEXT                 00007fff37028000-00007fff37029000 [    4K] r-x/rwx SM=COW  /System/Library/Frameworks/CoreDisplay.framework/Versions/A/CoreDisplay
    Submap                 00007fff37029000-00007fff40000000 [143.8M] r--/rwx SM=PRV  process-only VM submap

So the issue exists and is fixed by removing enable-hdmi20 for me on 10.15 and @al3xtjames on 10.14.

It might very well be a combination with another setting or ACPI patch that triggers it though.

@zearp
Copy link

zearp commented Dec 20, 2020

It might very well be a combination with another setting or ACPI patch that triggers it though.

@lambdaupb Yeah thats my guess too. What I will do is try the EFI from the repo you linked and report back in a bit. When I wrote Kaby Lake I meant Coffee Lake of course. I'm a pro at messing up those Intel codenames, sorry for any confusion it may have caused.

@lambdaupb
Copy link
Author

Thanks for the help. I will try to reproduce this issue with opencore updated to 0.6.4 and all other modules updated as well.

@vit9696
Copy link
Contributor

vit9696 commented Dec 20, 2020

Let me be clear:

  • The issue does exist and is specific to Lilu user patcher
  • Disabling SIP may hide the issue, but is not recommended
  • @al3xtjames provided an alternative to CDF patches
  • Lilu user patcher is not supported on 11.x, and that will unlikely change (thus the issue will unlikely be fixed)

@zearp
Copy link

zearp commented Dec 20, 2020

@lambdaupb Just ran the same tests using the EFI from the repo you linked and again no crashes, SIP is enabled and the hdmi setting too. I'm thinking these random logouts people experienced on the NUC have nothing to do with this issue, which would explain my failure to reproduce it. But it doesn't mean there is no issue of course. I don't have a DeskMini 310 to play with but it looks like a fun little machine so I hope you can get this sorted.

The issue with the WindowServer crash you linked seems to be solved by a comment on a blog thats linked but I can't read the comment because the comments are not loading for me for some reason. I've not done any upgrading from 10.14.x to 10.15.x and only ever used Catalina and Big Sur on my NUCs. Maybe those crashes were related to the upgrade or something else in their setup? I think this specific issue isn't present on the NUC Coffee Lake models but do let me know if there's anything else I can try.

@likaci
Copy link

likaci commented Jan 1, 2021

@lambdaupb Just ran the same tests using the EFI from the repo you linked and again no crashes, SIP is enabled and the hdmi setting too. I'm thinking these random logouts people experienced on the NUC have nothing to do with this issue, which would explain my failure to reproduce it. But it doesn't mean there is no issue of course. I don't have a DeskMini 310 to play with but it looks like a fun little machine so I hope you can get this sorted.

The issue with the WindowServer crash you linked seems to be solved by a comment on a blog thats linked but I can't read the comment because the comments are not loading for me for some reason. I've not done any upgrading from 10.14.x to 10.15.x and only ever used Catalina and Big Sur on my NUCs. Maybe those crashes were related to the upgrade or something else in their setup? I think this specific issue isn't present on the NUC Coffee Lake models but do let me know if there's anything else I can try.

@zearp Hi,
I can reproduce WindowServer crash with your EFI and https://github.com/appleserial/NUC8I5BEH 's EFI by running "Large FFTs".
And my NUC is upgraded from 10.14 .
Can you post the blog link?
Thank you.

@zearp
Copy link

zearp commented Jan 1, 2021

@likaci You can’t follow the link I referred to and find the blog post yourself? Please
don't quote an entire post to only add a sentence.

Try if you can also reproduce it on a system that wasn’t upgraded from 10.14.x because no matter how long I let it run I get no crashes and I directly installed Catalina on mine.

I don’t have a 10.14.x installer laying around to do a clean install with and then upgrade to Catalina but I might try for the fun of it and see if I get crashes that way.

@likaci
Copy link

likaci commented Jan 1, 2021

@zearp Sorry for my disturbing and bad english.
I have read the entire page but can't find the link that mentioned about upgrad from 10.14 may cause the problem.

I have only one NUC running some services , so I can't reinstall it.
I confirmed that Disable SIP or Disable HDMI2.0 can void the problem.

Thank you for your help, Happy new year.

@Sher1ocks
Copy link

Sher1ocks commented Mar 21, 2021

스크린샷 2021-03-21 오후 11 50 30
I also had this problem in Big Sur.
In the Skylake laptop, only the freq of 1.5ghz or more was maintained, and the overheating phenomenon was constantly maintained, leading to poor performance.
It was resolved by turning off the enable-hdmi20 option.
thank you for tip!

@vit9696
Copy link
Contributor

vit9696 commented Mar 28, 2021

enable-hdmi20 is deprecated in favour of max-pixel-clock feature (acidanthera/WhateverGreen#79). Although the issue is not exclusive to CDF side of WEG, userspace patching is implemented differently on Big Sur and above, and is not affected by this issue. I no longer use Catalina or older, and thus decided not to address this issue. Closing.

@zearp
Copy link

zearp commented Mar 30, 2021

Does this mean that enable-max-pixel-clock-override replaces the enable-hdmi20 option? Will the option stay or will it be removed in future builds?

Because at the moment removing enable-hdmi20 and replacing it with enable-max-pixel-clock-override breaks 4k on Catalina and earlier.

It seems its not doing the same as the hdmi20 option did. But I may have misunderstood and/or not implemented it properly.

@vit9696
Copy link
Contributor

vit9696 commented Mar 30, 2021

You may need higher max-pixel-clock-frequency (in Hz, defaults to 675000000). https://github.com/acidanthera/WhateverGreen/blob/master/Manual/FAQ.IntelHD.en.md#hdmi-in-uhd-resolution-with-60fps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

8 participants