-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stall-detector: Try hard not to crash while collecting backtrace #2420
base: master
Are you sure you want to change the base?
Conversation
static void print_with_backtrace(backtrace_buffer& buf, bool oneline) noexcept { | ||
if (sigsetjmp(stall_detector_env, 0)) { | ||
buf.append(" ¯\\_(ツ)_/¯\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Sometimes stall-detector signal comes in the middle of exception handling. If the stall is detected, stack unwiding starts to collect the stalled backtrace. Since exception handling means unwiding the stack as well, those two unwinders need to cooperate carefully, which is not guaranteed (spoiler: they don't cooperate carefully). In unlucky case, segmentation fault happens, the app is killed with SEGV. This patch helps stall detector to bail out in case of SEGV arrival while collecting the backtrace with minimally possible yet detailed enough stall report. Signed-off-by: Pavel Emelyanov <[email protected]>
6b368ce
to
ce84a03
Compare
Doesn't solve the problem entirely, since SIGSEGV isn't the only possible symptom (you could get an infinite loop for example, why not), but I guess it prevents a crash in the cases it's enough (which is probably a great majority of cases), and doesn't hurt in the others, so why not. |
goto out; | ||
} | ||
in_stall_detector = true; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be technically correct, we need an std::atomic_signal_fence(std::memory_order_relaxed). This prevents a magical compiler from delaying the write to memory because no one reads it.
reactor::test::set_stall_detector_crash_collecting_backtrace(); | ||
engine().update_blocked_reactor_notify_ms(100ms); | ||
spin(500ms); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you also reproduce the crash during unwinding? It's not given that siglongjmp is a safe way to unwind. If the unwinder takes a lock, it will leak it (though I'm guessing it doesn't).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you also reproduce the crash during unwinding?
In labs -- unfortunately, no :(
It's not given that siglongjmp is a safe way to unwind.
Yes, sure, at this point the situation is already screwed up, and it's questionable whether these tricks are making things even worse or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can override __cxa_throw and whatever function it uses to exist unwinding (but maybe there isn't one), and call them via RTLD_NEXT. Then we can set flags when unwinding is in progress, and just avoid going into the stall detector again (or perhaps: ask the stall detector to run on the exit path of __cxa_throw).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it will work.
Also, tracing exception throwers is important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can override __cxa_throw and whatever function it uses to exist unwinding (but maybe there isn't one)
There isn't one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe have a blacklist of functions that are known to crash. Every time we see a crash, add the triggering function to the blacklist. In a few short years we'll have a robust filter.
Sometimes stall-detector signal comes in the middle of exception handling. If the stall is detected, stack unwiding starts to collect the stalled backtrace. Since exception handling means unwiding the stack as well, those two unwinders need to cooperate carefully, which is not guaranteed (spoiler: they don't cooperate carefully). In unlucky case, segmentation fault happens, the app is killed with SEGV.
This patch helps stall detector to bail out in case of SEGV arrival while collecting the backtrace with minimally possible yet detailed enough stall report.