You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While working on GH-127049, I noticed that on non-Windows platforms Popen.poll can return None even for an exited child when Popen.poll races against Popen.[poll | wait] in another thread.
I think Popen.poll is violating its docs when this happens, because it returns without checking the child.
Writing a reproducer for this is tricky, but the following hits the race condition more than half the time on my machine:
from __future__ importannotationsimportosimportsubprocessfromconcurrent.futuresimportThreadPoolExecutorwithThreadPoolExecutor(1) asexecutor:
process=subprocess.Popen(
(
"python",
"-c",
"import time; time.sleep(1)",
),
stdin=subprocess.DEVNULL,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
executor.submit(process.wait)
try:
os.waitid(os.P_PID, process.pid, os.WEXITED|os.WNOWAIT)
exceptChildProcessError:
# P_PIDFD would avoid this ECHILD, but writing the reproducer this way for# portability.passassertprocess.poll() isnotNone
IIUC the reason that poll can't block on _waitpid_lock.acquire here is because poll must be non-blocking API, but a wait call can make a blocking call (waitpid without WNOHANG) while holding that lock.
IIUC this should be fixable with the two-step (1) wait-without-reaping and (2) hold a lock to atomically reap-without-waiting & set .returncode approach described at #82811 (comment) and #86724 (comment) and used by Trio. That approach should also be able to fix case 1 of the thread-unsafety ignored in GH-20010, but not case 21. It could be a bit of a pain though2.
To be clear about impact, though, I have only seen this poll retval bug happen while testing a fix for GH-127049. And in that situation, a very slight modification to said fix for GH-127049 can easily avoid this bug (as well as case 1 from GH-20010).
CPython versions tested on:
3.9, 3.10, 3.11, 3.12, 3.13, 3.14, CPython main branch
Operating systems tested on:
Linux
Footnotes
Tangent: Case 2 should be fixable on Linux >= 5.4 (via pidfd_open & P_PIDFD). (It's possibly also fixable on BSDs & macOS, though I am not currently familiar there. What happens to an already-existing kevent for a PID after that PID has been freed? Does it still work, like a pidfd? It looks like Trio assumes that it does.) ↩
Does the "fail gracefully when SIGCLD is set to be ignored or waiting for child processes has otherwise been disabled for our process" case need different handling than the "an external caller reaped the PID and stole the return code from the Popen" case? From Popen's perspective (assuming a platform without pidfd/kqueue), both of these cases look like an ECHILD (either from "WNOWAITwaitid (wait-without-reaping) or WNOHANGwait[p]id (reap-without-waiting)). Is it okay if we can't disambiguate between these two cases? ↩
The text was updated successfully, but these errors were encountered:
Note that we call Popen.poll/wait in the event loop thread to avoid
Popen's thread-unsafety. Without this workaround, when testing
_ThreadedChildWatcher with the reproducer shared in the linked issue on my
machine:
* Case 1 of the known race condition ignored in pythonGH-20010 is met (and an
unsafe kill call is issued) about 1 in 10 times.
* The race condition pythonGH-127050 is met (causing _process_exited's assert
returncode is not None to raise) about 1 in 30 times.
Bug report
Bug description:
While working on GH-127049, I noticed that on non-Windows platforms
Popen.poll
can returnNone
even for an exited child whenPopen.poll
races againstPopen.[poll | wait]
in another thread.I think
Popen.poll
is violating its docs when this happens, because it returns without checking the child.Writing a reproducer for this is tricky, but the following hits the race condition more than half the time on my machine:
The culprit is https://github.com/python/cpython/blob/v3.14.0a1/Lib/subprocess.py#L1989-L1992, from d65ba51.
IIUC the reason that
poll
can't block on_waitpid_lock.acquire
here is becausepoll
must be non-blocking API, but await
call can make a blocking call (waitpid
withoutWNOHANG
) while holding that lock.IIUC this should be fixable with the two-step (1) wait-without-reaping and (2) hold a lock to atomically reap-without-waiting & set
.returncode
approach described at #82811 (comment) and #86724 (comment) and used by Trio. That approach should also be able to fix case 1 of the thread-unsafety ignored in GH-20010, but not case 21. It could be a bit of a pain though2.To be clear about impact, though, I have only seen this
poll
retval bug happen while testing a fix for GH-127049. And in that situation, a very slight modification to said fix for GH-127049 can easily avoid this bug (as well as case 1 from GH-20010).CPython versions tested on:
3.9, 3.10, 3.11, 3.12, 3.13, 3.14, CPython main branch
Operating systems tested on:
Linux
Footnotes
Tangent: Case 2 should be fixable on Linux >= 5.4 (via
pidfd_open
&P_PIDFD
). (It's possibly also fixable on BSDs & macOS, though I am not currently familiar there. What happens to an already-existing kevent for a PID after that PID has been freed? Does it still work, like a pidfd? It looks like Trio assumes that it does.) ↩Does the "fail gracefully when
SIGCLD
is set to be ignored or waiting for child processes has otherwise been disabled for our process" case need different handling than the "an external caller reaped the PID and stole the return code from thePopen
" case? FromPopen
's perspective (assuming a platform without pidfd/kqueue), both of these cases look like anECHILD
(either from "WNOWAIT
waitid
(wait-without-reaping) orWNOHANG
wait[p]id
(reap-without-waiting)). Is it okay if we can't disambiguate between these two cases? ↩The text was updated successfully, but these errors were encountered: