-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Root causing our "weird" crash logs #2759
Comments
I'm genuinely considering investigating what it would take to ship a single release with all of the malloc/zombies/sanitiser debugging features enabled, just to see if the nature of these reports changes (ideally in a way that gives us more information to go on). Edit: Turns out our release builds have partial ASAN enabled, Main Thread Checker, and Zombie Checker, enabled already. |
Since we're apparently already paying the (approx 2x) performance cost of the Address Sanitiser in release builds, #2760 enables some of the additional options. I think we should carry these all for a single release that lasts 1-2 weeks and see what, if any, difference it makes to the quantity/quality of the crash reports we get, then revert the change. Any objections? |
No objections; It will take me a little longer to go through the items above and the list of crashes to see if I have any immediate insights, but as to capturing more detail in crashlogs, I'm all for seeing if this helps. |
Coroutines that work with the Hammerspoon modules are a pretty new thing... and even so, I doubt more than a handful of our users are using them yet. And even coroutine lua code runs on the main thread, so until a coroutine yields, which is the proper lua way of transitioning out of a coroutine, a queued event callback can't run.
Unless they're using my Once I got co-routines working, I abandoned my testing with advancing the main event loop -- it introduces too much uncertainty, in my opinion. I'm in agreement that 3 or 4 seem most likely, though. I've also noticed in some of my own debugging that occasionally if invalid data is passed into an Objective-C method, it might be as many as 10 levels deeper in the traceback before the actual crash -- for example passing in a NULL value to a method marked non-nullable -- it's not always the called method but something it calls (or that calls, or...) before the crash occurs. It might be worth biting the bullet and addressing all of the warnings generated when building Hammerspoon... I know most are already set to cause an error during build, but not all are. Would it be worth renaming all of the |
Couple of random thoughts while doing the dishes:
|
Wrapping debug.traceback would at least make it clear if the crash occurs before the callback invokes lua or after... if it's before (or during, though during would put at least one of the |
potentially it does, yes, in the sense that we'd see function names and filenames |
So, I've checked through the codebase and there's only one place we call I did notice that when we do |
Similarly we have some inconsistencies between adding source with either |
regarding the modes, I vaguely remember uncovering an issue with the mode used for timers when I was working on a menubar replacement (development on it stalled, because we fixed some errors in ours, but I eventually want to get back to it)... my replacement allowed for changes to the menu while it was open but timer's didn't fire while the menu was showing -- eventtaps did, though, which is why I looked at the modes and noticed the difference. I'll try to dig up the specific issue number when I get back home tonight, but if I recall my finding correctly, there were no issues when I changed the timer modes during my initial tests... A long winded way of agreeing we should standardized, but I think we want to standardize on what eventtap uses in this case -- I'll confirm tonight. |
Food for thought... as CommandPost has a consistent Lua codebase (as in, very few, if any users modify the Lua code), it might be worth having a look at the CommandPost Sentry to see if there's any similar crashes between CommandPost and Hammerspoon? CommandPost is pretty much always in-sync with the master Hammerspoon branch too, with the note that I generally update pods more regularly in CommandPost. |
@latenitefilms interestingly, you don't seem to have the same sorts of crashes in CommandPost |
I'm going to put out a release now, to get these updated sanitiser/assert things in the wild, and we can see what reports come in over the next week (assuming it doesn't all explode immediately!) |
Well, that is... odd, given it's the same codebase, and I'm almost using every Hammerspoon extension. |
I take it as a positive sign - it likely means that people are using our API in ways we didn't think of and are triggering weirdness. The challenge now, is to figure out what it is. If one/several of them would file GH issues, this would go a lot quicker, I suspect! |
So there's an interesting crash report now: https://sentry.io/organizations/hammerspoon/issues/2271630958/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d It's only happening for one user so far, but the symptoms are super weird. It looks likeeverything goes wrong at MJLua.m:881 which is trying to save references to the completion word table and evalfn, but the check in luaRef for lua_isnil() (which is actually just lua_type() fails with an invalid index. The index we ask for is @asmagill does that look right to you? I'm struggling to see how a malformed init.lua could cause this, but it certainly looks like it's possible somehow |
@cmsj, can't fault your logic... I've come to the same conclusion. The only time I've seen something similar is when making updates to You could check the stack size before In I also noted in |
Agreed on checking the size and types of the stack. I'll get going on that. There are some more crash reports rolling in today, which I'll be diving into later, but I'm also starting to think about how we can encourage the (very few) people who are getting these crashes, to talk to us. I believe Sentry offers some kind of UI for users to send a message with the crash report, so I'll look into that, but if anyone else has suggestions for ways to improve the communication here, I'm all ears :) |
Some random thoughts/questions:
|
Also, FYI - I can no longer access Hammerspoon's Sentry account - it just says:
I assume this is because I'm not a member of Hammerspoon on GitHub. I probably don't need access, so all good - but just wanted to let you know. |
Ah, interesting... does this mean I can double the performance of Hammerspoon/CommandPost, simply by disabling the Address Sanitiser? |
It couldn't hurt!
It's actually non-trivial - we'd essentially need to maintain a version of
Certainly our startup process has gotten very complex and fragile. I can definitely see an argument for separating out coresetup from the user's init.lua.
I just asked it to send you a link, but let me know if it still doesn't work. I also added you to the Hammerspoon org on github. I'm pretty sure you're well past the point where that is deserved :)
Turns out I was wrong, the Release scheme only has all those things enabled for Test builds. I'm going to experiment with whether I can produce a full release build with all those things enabled though - even if it's just for a week or two, I want to catch more crashers. The current release has shaken out a few more, from having Lua's assertions fully enabled. |
Ok, so now that 0.9.88 has been out for a few days, with the explicit Sentry events when The log also shows that the watchers' luaSkinUUID is empty, which suggests (although not conclusively) that its The reported events thus far have only been for The only other thing I have in mind is how we could run some tests that create a lot of watchers, cause a lot of events to happen, and reload the config a bunch of times, to try and provoke these crashes more directly, for debugging purposes. Thoughts anyone? :) |
FWIW, I personally very rarely see CommandPost crash during reloads. The main time I see it crash is after waking my laptop from sleep. It'll be running when I put the Mac to sleep, but when I wake up, it's no longer running. Sentry tells you when most recent sleep event happened, right? Did you end up already implementing a reload counter in the Sentry logs? If not, that might be interesting to see how many times Hammerspoon is reloaded before a crash occurs. |
I don't have a counter yet, and curiously, a lot of the crashes seem to be happening long after a reload (as much as a couple of hours later, in one instance). |
Good people of the Internet, I believe I have finally root-caused at least one variant of these crashes. I was looking through the 8 hs.audiodevice related instances of That ruled out the idea that these objects were leftovers from a previous Lua instance. However, as I was looking around the hs.audiodevice code, I looked at The code will run on the next iteration of the Objective C runloop, but that won't be until Lua has finished doing whatever it was already doing, and that led me to the realisation that if Lua is working on stuff that happens to trigger one of these C callbacks, it's also possible that while Lua is then still doing other things, it might decide to garbage collect the object. This leaves you in a situation where Lua has discarded an object, but Objective C is going to do something with that object on its next runloop iteration. Since Lua doesn't seem to defensively wipe the memory used by a userdata object, it's still likely to be present enough for the callback to be able to do something with the pointer, but it quickly goes off the rails. I'm not 100% sure if this hs.audiodevice instance is identical to the ones that had been plaguing hs.timer and hs.eventtap, but I can well believe it is, and I guess we'll find out as 0.9.88 rolls out more widely and the Sentry data starts coming in. For reference, the reproducer is: foo = hs.audiodevice.defaultOutputDevice()
foo:watcherCallback(function(uid, event, selector, scope, element) print(uid, " :: ", event) end)
foo:watcherStart()
foo:setVolume(80)
foo = nil
collectgarbage()
collectgarbage() (it only works if your audio volume is something other than 80% to start with) The So, now the hard part comes: Do we switch to |
Awesome detective work! Will leave @asmagill to discuss with you best plan of attack, as this is all very above my pay grade. Curious... What's the disadvantage of using |
@latenitefilms it's coming up on half-midnight here, I am not sufficiently brained to reason about that right now, but I believe the original thought behind using |
That make complete sense, in which case, relying on |
If my reasoning there is correct, then yes, by accident I wrote a debugging feature that ended up indirectly doing what we needed. If that is the decision we come to, I'll rework it so it doesn't log things to Sentry, because we will have agreed that this isn't really a bug, but an expected side-effect of userdata lifecycles, and then we'll need to roll it out to every C->Lua entrypoint. It'll need a new name too, maybe something like |
Don't have the code in front of me at the moment, and I'm about to head out the door, but my first question is are we checking that the callback ref hasn't been cleared by __gc in the async block? That has been sufficient in most of the cases we've run into this runloop queue vs timing issue before. In general async will allow the system to be more responsive, but sync is required if we require feedback from the Lua callback function... Switching everything to |
Ok, took a moment to look at the watcher code and it is checking, but it still creates the skin instance and issues the |
that's a really good point actually, and explains why hs.audiodevice wasn't showing up in the crash reports before, because it does check the ref. hs.timer doesn't. hs.eventtap does though, so I'm digging back through Sentry to try and find some of the crashes there to see where it was going wrong. |
hs.hotkey does check for Edit: On second thought, I won't do that, I've changed enough stuff today that I should wait and test some more before releasing again. |
I'll merge all your changes into CommandPost today, and test it out on my machine, and let you know if anything weird pops up. |
@latenitefilms #2859 and #2860 will be relevant here - the former just renames all the |
Ok, so now that we have data from 0.9.89 it's clear that we didn't fix everything here. There are a few crash reports that started out in hs.hotkey and looking more closely at the code, I understand why and have a rough idea how to fix it. Finally, there are the lingering crashes from hs.webview.toolbar that I suspect are related to garbage collection, so we should think about a smart way to avoid crashing in those scenarios. |
Yeah, apologies for all the CommandPost crash logs. I haven't recompiled yet, so my local CommandPost is crashing every time I reload, because I'm messing around with |
#2867 is an attempt at fixing hs.hotkey, but I haven't had a chance to run it through its paces yet, but I did merge some hs.hotkey tests at least. |
I've not yet been able to manually reproduce the hs.hotkey crash, in part because |
Random idea... have you looked into the use of |
Since @asmagill indicated he might look at the Sentry logs soon, I thought we could resurrect the discussion here... There are a few relatively recent crashes which seem to come from Lua's API checks failing. I'm not sure if we've regressed or if users are doing a new thing. But on the whole the crashes are coming from C callbacks jumping into Lua and then somehow a pointer is stale/wrong and things explode. hotkey and eventtap seem to be most common, but I suspect that's because those are the most used callbacks. I thought the LSGCCanary checks would fix these, and while it has avoided some crashes, it hasn't had anything close to the impact I was hoping, so I'm currently out of ideas. Edit: BTW I would recommend going to the Releases view in Sentry and picking 0.9.95 - we still have 30-40% of our users on previous versions, so looking at the raw Issues section will show your their crashes too, many of which we've fixed. |
So, this is not the first time I've decided I'm determined to get to the bottom of why we always have some "weird" crash reports, and it probably won't be the last time, but I'd like to discuss it again anyway.
Tagging @asmagill for his excellent thoughts.
Here are the "weird" crash reports received thus far for 0.9.84:
Here are some evidence points I have so far:
protectedCallAndTraceback
. There are a couple of others that happen during the first load ofinit.lua
, but we'll ignore those for now._lua_stackguard
macros[LuaSkin sharedWithState:NULL]
which guarantees them the main Lua State, they're not coming in on a co-routine state (this kind of crash report predates co-routines anyway AFAIK, but we don't have historical data to back that up). This would also be covered by the LuaSkin UUID check, at least in hs.timer.So, it seems relatively safe to say that these aren't happening when Lua is being torn down, and their relative consistency makes me suspect they're not stack/heap corruption, although I can't rule that out. I have some hypotheses:
My bet is on 2, 3 or 4, but I don't have any evidence to support that yet.
So....... thoughts? :)
The text was updated successfully, but these errors were encountered: