Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_threading.test_interrupt_main_subthread: release unlocked lock #118433

Open
colesbury opened this issue Apr 30, 2024 · 9 comments
Open

test_threading.test_interrupt_main_subthread: release unlocked lock #118433

colesbury opened this issue Apr 30, 2024 · 9 comments
Labels
tests Tests in the Lib/test dir topic-free-threading type-bug An unexpected behavior, bug, or error

Comments

@colesbury
Copy link
Contributor

colesbury commented Apr 30, 2024

Bug report

======================================================================
ERROR: test_interrupt_main_subthread (test.test_threading.InterruptMainTests.test_interrupt_main_subthread)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 659, in wait
    signaled = self._cond.wait(timeout)
               ~~~~~~~~~~~~~~~^^^^^^^^^
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 368, in wait
    self._acquire_restore(saved_state)
    ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 315, in _acquire_restore
    def _acquire_restore(self, x):
    
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/test/test_threading.py", line 2035, in test_interrupt_main_subthread
    t.start()
    ~~~~~~~^^
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 977, in start
    self._started.wait()  # Will set ident and native_id
    ~~~~~~~~~~~~~~~~~~^^
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 656, in wait
    with self._cond:
    ...<3 lines>...
        return signaled
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 307, in __exit__
    return self._lock.__exit__(*args)
           ~~~~~~~~~~~~~~~~~~~^^^^^^^
RuntimeError: release unlocked lock
----------------------------------------------------------------------

Seen on the ARM64 macOS nogil refleaks buildbot:

https://buildbot.python.org/all/#/builders/1368/builds/865/steps/5/logs/stdio

Reported by encoku

Linked PRs

@colesbury colesbury added type-bug An unexpected behavior, bug, or error tests Tests in the Lib/test dir topic-free-threading labels Apr 30, 2024
@colesbury
Copy link
Contributor Author

I'm able to reproduce it locally ~3-8% of the time with a command like on Linux (x86-64) and macOS (arm64):

./python -m test test_threading -m test_interrupt_main_subthread -R 3:3

@colesbury
Copy link
Contributor Author

The bug is triggered because the KeyboardInterrupt is raised inside the finally handler before it's able to actually restore the state:

self._acquire_restore(saved_state)

I'm able to reproduce the issue reliably, including in the default build, with the following patch: https://gist.github.com/colesbury/74e2523c3622f479f7e39fc62c5dc948

@colesbury
Copy link
Contributor Author

@mpage I'd appreciate your thoughts on this since you've recently looked at related thread-safety issues. I'm not sure if there's a reliable fix without moving a lot of the Condition code into C.

@mpage
Copy link
Contributor

mpage commented Apr 30, 2024

I'm not sure what the right thing to do here is. I think the fundamental challenge is that we need to make wait() (and probably other parts of Condition) reliable in the face of asynchronous exceptions, which is really hard (impossible?) to do in managed Python code. Another alternative to moving the Condition code into C might be to add functionality to block/unblock delivery of asynchronous exceptions and use it at the appropriate points in wait(). Something like the following:

def wait(self, timeout=None):
    if not self._is_owned():
        raise RuntimeError("cannot wait on un-acquired lock")
    waiter = _allocate_lock()
    waiter.acquire()
    self._waiters.append(waiter)
    # Queue delivery of asynchronous exceptions until the body of the context
    # manager completes, or they are unblocked explicitly
    with _thread.block_async_exceptions():
        saved_state = self._release_save()
        gotit = False
        try:    # restore state no matter what (e.g., KeyboardInterrupt)
            if timeout is None:
                # Allow asynchronous exceptions, queued or otherwise, to be
                # raised in the body of the context manager
                with _thread.allow_async_exceptions():
                    waiter.acquire()
                gotit = True
            else:
                if timeout > 0:
                    with _thread.allow_async_exceptions():
                        gotit = waiter.acquire(True, timeout)
                else:
                    with _thread.allow_async_exceptions():
                        gotit = waiter.acquire(False)
            return gotit
        finally:
            self._acquire_restore(saved_state)
            if not gotit:
                try:
                    self._waiters.remove(waiter)
                except ValueError:
                    pass

@encukou
Copy link
Member

encukou commented May 1, 2024

Reminds me of trio's KeyboardInterrupt protection, with an 2017 blog post.
Marking a frame to block the delivery seems a bit neater than two context managers?

@colesbury
Copy link
Contributor Author

@encukou - thanks for the link, that sounds like the same issue. Marking the frame is tricky because the self._acquire_restore() (in the finally block) call is a call to a Python function and the KeyboardInterrupt happens on entry. So marking the Condition.wait() frame doesn't seem like it'd be sufficient.

@colesbury
Copy link
Contributor Author

colesbury commented May 1, 2024

So I think the options we've come up with are:

  1. Move things to C. For making the test pass reliably, it's enough if the Thread._started event is implemented in C. Here's an example colesbury@d0118fa I tested yesterday. There may also be a way to do this as part of the ThreadHandle that @mpage wrote.

  2. Suppress asynchronous exceptions in Condition (and maybe other synchronization primitives written in Python)

Either approach seems reasonable to me. What do both of you think?

EDIT: trio changes the signal handler, but that seems less appropriate in CPython than option 2.

mpage added a commit to mpage/cpython that referenced this issue May 1, 2024
Free-threaded builds can intermittently tickle a longstanding bug (24 years!)
in the implementation of `threading.Condition`, leading to flakiness in the
test suite. Fixing the underlying issue will require more discussion, and will
likely apply to most of the concurrency primitives in the `threading` module
that are written in Python. See pythongh-118433 for more details.

Disable the test in free-threaded builds as a temporary measure to eliminate
flakiness.
colesbury pushed a commit that referenced this issue May 1, 2024
…hreaded builds (#118485)

Free-threaded builds can intermittently tickle a longstanding bug (24 years!)
in the implementation of `threading.Condition`, leading to flakiness in the
test suite. Fixing the underlying issue will require more discussion, and will
likely apply to most of the concurrency primitives in the `threading` module
that are written in Python. See gh-118433 for more details.
@mpage
Copy link
Contributor

mpage commented May 1, 2024

Recording the gist of my conversation with @colesbury.

If we can make it work, I think option (2) is preferable, for a few reasons:

  1. The issue that's affecting Condition likely applies to the other synchronization primitives in the threading module, so we'd need to move those to C as well.
  2. As @colesbury pointed out, Condition can accept an arbitrary lock object, which could execute arbitrary Python code, leaving us open to the original problem.
  3. Other Python code outside of CPython may need to deal with the same issue and moving to C might not be a viable solution in those cases. Providing a solution in Python would be valuable for those use cases.

@encukou
Copy link
Member

encukou commented May 2, 2024

I don't think I'll have time to have a proper look at this before beta, so a few quick comments:

self._acquire_restore() (in the finally block) call is a call to a Python function

I guess, that would need to be inlined (that is, always set in __init__).
Or also guarded.

trio changes the signal handler

AFAIK, this just trio doing whatever it can from Python code -- if it could do something like disable some of the eval breaker, it'd do that instead.

Other Python code outside of CPython may need to deal with the same issue

Yup, that's why I'd look at what works for trio :)

SonicField pushed a commit to SonicField/cpython that referenced this issue May 8, 2024
…free-threaded builds (python#118485)

Free-threaded builds can intermittently tickle a longstanding bug (24 years!)
in the implementation of `threading.Condition`, leading to flakiness in the
test suite. Fixing the underlying issue will require more discussion, and will
likely apply to most of the concurrency primitives in the `threading` module
that are written in Python. See pythongh-118433 for more details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Tests in the Lib/test dir topic-free-threading type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants