-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[3.11] gh-106883 Fix deadlock in threaded application #117332
Conversation
When using threaded applications, there is a high risk of a deadlock in the intepreter. It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL. It has been suggested to disable GC during the _PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls.
This will break users that have disabled the gc, we need to do the dance only if it has not been disabled |
Also there are return paths for errors where this leaves the gc disabled |
@pablogsal thanks for your comments! Of course this was more an "attempt" to check if the approach was in the right direction and to check if the C API I used are the correct ones. Next week I will produce a more complete patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this is the right general approach. I left some inline comments to expand on what @pablogsal wrote.
When using threaded applications, there is a high risk of a deadlock in the intepreter. It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL. By disabling the GC during the _PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls fixes the issue.
@pablogsal @colesbury I've updated the PR addressing your comments and adding a test to verify the patch. Please let me know what you think. |
It is currently failing on the Address Sanitiser build, I'm investigating what the problem is. |
I have pushed a new test for it. It is slower compare to the existent tests: it takes about 10 seconds on my machine (Parallels on MBP) when it succeeds. If it fails (it shouldn't) then it's the timeout of 40 seconds. Thoughts? |
We have a thing to consider which is that 3.11.9 was the last bug fix release of 3.11 and that is already released. I could make an extra bug fix release to include a fix for this PR but technically this is too late. I will check with the other release managers to see what they think |
Sure, no problem. Whatever the outcome is, it has been entertaining debugging it :) |
I just want to give a gentle ping on this PR. What will be its fate? Merge or close? :) Joking apart, it would be nice to have a closure. |
Unfortunately unless you want to argue that this is a security fix, we cannot backport bug fixes anymore to 3.11. If this issue is something that's affecting a considerable number of users I am willing to reconsider, though. But in any case the release will be folded with the next security release of 3.11. |
Hello, I don't think I can argue that this is a security fix (it isn't :) ) but also I left a comment on dask/distributed#8616 issue as they bumped into the same problem. I requested them to comment directly here: if they have found a way to workaround the issue, we can just close the PR otherwise we need to understand what the blockers are and what's the magnitude of the issue. |
When using threaded applications, there is a high risk of a deadlock in the intepreter.
It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL.
It has been suggested to disable GC during the
_PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls.
sys._current_frames
#106883