Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Python 3.11 and dask[distributed] with high number of threads #116969

Open
diegorusso opened this issue Mar 18, 2024 · 7 comments
Open
Labels
type-bug An unexpected behavior, bug, or error

Comments

@diegorusso
Copy link
Contributor

diegorusso commented Mar 18, 2024

Bug report

Bug description:

I have noticed that the dask benchmark in pyperformance hangs when running it with Python 3.11 with a "high" number of cores on the machine. I have seen issues with 191 and 384 cores.

I started investigated the problem and seen that the issue manifested itself on a machine with a high number of cores.
The benchmarks that hangs is https://github.com/python/pyperformance/blob/main/pyperformance/data-files/benchmarks/bm_dask/run_benchmark.py

When the Worker class get instantiated, it sets the nthreads to the number of CPUs present on the system (here the code)

When this number is relatively high, it causes Python3.11 to hang and all the underlying threads to deadlock on the GIL.

To replicate the issue:

  • make a copy of the dask benchmark file
  • set the nthreads of the Worker class to a relatively high number (E.g. 1000).
async with Worker(scheduler.address, nthreads=1000):
...
  • Create/activate a venv with Python 3.11 and install the dependencies
pip install dask[distributed]==2022.2.0 pyperf
  • Run a quick stress test
while true; do python run_benchmark.py; done 

and wait to hang. It does it at random time.

With the process hanging, gdb shows on a thread (out of the hundreds):

 (gdb) thread 4
[Switching to thread 4 (Thread 0x7f5aeffff640 (LWP 402351))]
#0  __futex_abstimed_wait_common64 (private=-1457409528, cancel=true, abstime=0x7f5aefffde20, op=137, expected=0, futex_word=0x5640a959d354 <_PyRuntime+436>) at ./nptl/futex-internal.c:57
57      in ./nptl/futex-internal.c
(gdb) py-bt
Traceback (most recent call first):
  Waiting for the GIL
  File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/psutil/_common.py", line 788, in open_binary
    return open(fname, "rb", buffering=FILE_READ_BUFFER_SIZE)
  File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/psutil/_pslinux.py", line 1967, in memory_info
    with open_binary("%s/%s/statm" % (self._procfs_path, self.pid)) as f:
  File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/psutil/_pslinux.py", line 1714, in wrapper
    return fun(self, *args, **kwargs)
  File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/psutil/__init__.py", line 1102, in memory_info
    return self._proc.memory_info()
  File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/psutil/_common.py", line 495, in wrapper
    return fun(self)
  File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/distributed/utils_perf.py", line 188, in _gc_callback
    rss = self._proc.memory_info().rss
  <built-in method _current_frames of module object at remote 0x7f5dc0a32ca0>
  File "/home/ent-user/venv/cpython3.11-324490c70469-compat-2d3356be745c/lib/python3.11/site-packages/distributed/profile.py", line 270, in _watch
    frame = sys._current_frames()[thread_id]
  File "/home/ent-user/ci-scripts/tmpdir/prefix/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ent-user/ci-scripts/tmpdir/prefix/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/home/ent-user/ci-scripts/tmpdir/prefix/lib/python3.11/threading.py", line 1002, in _bootstrap
    self._bootstrap_inner()

A strace of a thread shows (continuously)

...
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=468031783}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=473122144}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=478228035}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=483319687}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=488417438}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=493521779}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=498608771}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=503711922}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=508813993}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=513919325}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55707e87f358, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55707e87f350, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=6498067, tv_nsec=519022166}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
...

I tried upgrading Dask[distributed] the latest version but I have the same effects. I think there is something going on in Python 3.11.
This happens only with Python 3.11: 3.9 and 3.12 work as expected.

I've seen it on x86, aarch64 still to test.

CPython versions tested on:

3.11

Operating systems tested on:

Linux

@diegorusso diegorusso added the type-bug An unexpected behavior, bug, or error label Mar 18, 2024
@colesbury
Copy link
Contributor

colesbury commented Mar 18, 2024

I was able to reproduce this issue locally on 3.11.8. It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL.

Thread 1 holds the GIL and is blocked on HEAD_LOCK(&_PyRuntime);. The other threads are blocked on the GIL acquisition.

HEAD_LOCK(runtime);

Thread 2 holds the runtime lock and is trying to execute _PyThread_CurrentFrames. It's blocked trying to re-acquire the GIL while running the garbage collector.

HEAD_LOCK(runtime);

I think this is not an issue (or less of an issue at least) in 3.12+ because allocating objects won't directly run the GC.

@colesbury
Copy link
Contributor

I think this is the same underlying issue as #106883

@diegorusso
Copy link
Contributor Author

diegorusso commented Mar 19, 2024

@colesbury thanks for looking into it. For completeness I tested it on AArch64 and had the same behaviour.

As suggested in the Issue you mentioned, I back ported @pablogsal commit (83eb827) to run the GC only on eval breaker to 3.11 and ran the tests again. I confirm that no dead locks are happening anymore (both on x86 and AArch64).

@diegorusso
Copy link
Contributor Author

diegorusso commented Apr 3, 2024

I was able to reproduce this issue locally on 3.11.8. It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL.

@colesbury are you able to share the code snippet to reproduce the bug?

@colesbury
Copy link
Contributor

@diegorusso, I followed your instructions to reproduce the issue. I don't have a small reproducer.

@diegorusso
Copy link
Contributor Author

@diegorusso, I followed your instructions to reproduce the issue. I don't have a small reproducer.

oops, my bad. I thought you had something with 2 threads. I have made the patch but I'd like to accompany it with some testing.

@diegorusso
Copy link
Contributor Author

For completeness I report the PR I have created to fix the issue: #117332

There are discussions to check if the PR could be accepted or not because in theory 3.11.9 was the last bug fix release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

2 participants