-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with Python 3.11 and dask[distributed] with high number of threads #116969
Comments
I was able to reproduce this issue locally on 3.11.8. It's a lock ordering deadlock with Thread 1 holds the GIL and is blocked on Line 820 in db85d51
Thread 2 holds the runtime lock and is trying to execute Line 1408 in db85d51
I think this is not an issue (or less of an issue at least) in 3.12+ because allocating objects won't directly run the GC. |
I think this is the same underlying issue as #106883 |
@colesbury thanks for looking into it. For completeness I tested it on AArch64 and had the same behaviour. As suggested in the Issue you mentioned, I back ported @pablogsal commit (83eb827) to run the GC only on eval breaker to 3.11 and ran the tests again. I confirm that no dead locks are happening anymore (both on x86 and AArch64). |
@colesbury are you able to share the code snippet to reproduce the bug? |
@diegorusso, I followed your instructions to reproduce the issue. I don't have a small reproducer. |
oops, my bad. I thought you had something with 2 threads. I have made the patch but I'd like to accompany it with some testing. |
For completeness I report the PR I have created to fix the issue: #117332 There are discussions to check if the PR could be accepted or not because in theory 3.11.9 was the last bug fix release. |
Bug report
Bug description:
I have noticed that the dask benchmark in pyperformance hangs when running it with Python 3.11 with a "high" number of cores on the machine. I have seen issues with 191 and 384 cores.
I started investigated the problem and seen that the issue manifested itself on a machine with a high number of cores.
The benchmarks that hangs is https://github.com/python/pyperformance/blob/main/pyperformance/data-files/benchmarks/bm_dask/run_benchmark.py
When the Worker class get instantiated, it sets the nthreads to the number of CPUs present on the system (here the code)
When this number is relatively high, it causes Python3.11 to hang and all the underlying threads to deadlock on the GIL.
To replicate the issue:
and wait to hang. It does it at random time.
With the process hanging, gdb shows on a thread (out of the hundreds):
A strace of a thread shows (continuously)
I tried upgrading Dask[distributed] the latest version but I have the same effects. I think there is something going on in Python 3.11.
This happens only with Python 3.11: 3.9 and 3.12 work as expected.
I've seen it on x86, aarch64 still to test.
CPython versions tested on:
3.11
Operating systems tested on:
Linux
The text was updated successfully, but these errors were encountered: