Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent CI failures in test_local.py on PyPy #379

Closed
njsmith opened this issue Dec 21, 2017 · 5 comments · Fixed by #441
Closed

Intermittent CI failures in test_local.py on PyPy #379

njsmith opened this issue Dec 21, 2017 · 5 comments · Fixed by #441

Comments

@njsmith
Copy link
Member

njsmith commented Dec 21, 2017

As originally tracked in #200, we've been seeing an intermittent freeze in test_local.py on PyPy:

The test uses threads + blocking queue operations to control sequencing, so it's the sort of thing where you're not shocked to see some weird little freeze, but after staring at it for a while I can't see any bug in the test itself.

njsmith added a commit to njsmith/trio that referenced this issue Dec 21, 2017
Hopefully this will help us diagnose weird freezes like python-triogh-379
@njsmith
Copy link
Member Author

njsmith commented Feb 14, 2018

Here's another one, on PyPy 5.8 in test_run_local_simultaneous_runs: https://travis-ci.org/python-trio/trio/jobs/341367598

I'll paste the key bits below in case something happens to the Travis log:

../testenv/site-packages/trio/_core/tests/test_local.py::test_run_local_simultaneous_runs Timeout (0:01:00)!
Thread 0x7FF52834D700 (most recent call first, approximate line numbers):
  File "/home/travis/build/python-trio/trio/pypy3.5-5.8-beta-linux_x86_64-portable/lib-python/3/threading.py", line 261 in wait
  File "/home/travis/build/python-trio/trio/pypy3.5-5.8-beta-linux_x86_64-portable/lib-python/3/queue.py", line 147 in get
  File "/home/travis/build/python-trio/trio/testenv/site-packages/trio/_core/tests/test_local.py", line 100 in main
  File "/home/travis/build/python-trio/trio/testenv/site-packages/trio/_core/_run.py", line 1241 in run_impl
  File "/home/travis/build/python-trio/trio/testenv/site-packages/trio/_core/_run.py", line 1111 in run
  File "/home/travis/build/python-trio/trio/pypy3.5-5.8-beta-linux_x86_64-portable/lib-python/3/threading.py", line 851 in run
  File "/home/travis/build/python-trio/trio/pypy3.5-5.8-beta-linux_x86_64-portable/lib-python/3/threading.py", line 899 in _bootstrap_inner
  File "/home/travis/build/python-trio/trio/pypy3.5-5.8-beta-linux_x86_64-portable/lib-python/3/threading.py", line 868 in _bootstrap
Thread 0x7FF536390780 (most recent call first, approximate line numbers):
  File "/home/travis/build/python-trio/trio/pypy3.5-5.8-beta-linux_x86_64-portable/lib-python/3/threading.py", line 261 in wait
  File "/home/travis/build/python-trio/trio/pypy3.5-5.8-beta-linux_x86_64-portable/lib-python/3/queue.py", line 147 in get
  File "/home/travis/build/python-trio/trio/testenv/site-packages/trio/_core/tests/test_local.py", line 97 in test_run_local_simultaneous_runs
  File "/home/travis/build/python-trio/trio/testenv/site-packages/_pytest/python.py", line 145 in pytest_pyfunc_call
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/callers.py", line 151 in _multicall
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/__init__.py", line 213 in <lambda>
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/__init__.py", line 219 in _hookexec
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/__init__.py", line 603 in __call__
  File "/home/travis/build/python-trio/trio/testenv/site-packages/_pytest/python.py", line 1176 in runtest
  File "/home/travis/build/python-trio/trio/testenv/site-packages/_pytest/runner.py", line 106 in pytest_runtest_call
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/callers.py", line 151 in _multicall
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/__init__.py", line 213 in <lambda>
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/__init__.py", line 219 in _hookexec
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/__init__.py", line 603 in __call__
  File "/home/travis/build/python-trio/trio/testenv/site-packages/_pytest/runner.py", line 178 in <lambda>
  File "/home/travis/build/python-trio/trio/testenv/site-packages/_pytest/runner.py", line 186 in __init__
  File "/home/travis/build/python-trio/trio/testenv/site-packages/_pytest/runner.py", line 175 in call_runtest_hook
  File "/home/travis/build/python-trio/trio/testenv/site-packages/_pytest/runner.py", line 157 in call_and_report
  File "/home/travis/build/python-trio/trio/testenv/site-packages/_pytest/runner.py", line 69 in runtestprotocol
  File "/home/travis/build/python-trio/trio/testenv/site-packages/_pytest/runner.py", line 58 in pytest_runtest_protocol
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/callers.py", line 151 in _multicall
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/__init__.py", line 213 in <lambda>
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/__init__.py", line 219 in _hookexec
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/__init__.py", line 603 in __call__
  File "/home/travis/build/python-trio/trio/testenv/site-packages/_pytest/main.py", line 150 in pytest_runtestloop
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/callers.py", line 151 in _multicall
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/__init__.py", line 213 in <lambda>
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/__init__.py", line 219 in _hookexec
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/__init__.py", line 603 in __call__
  File "/home/travis/build/python-trio/trio/testenv/site-packages/_pytest/main.py", line 134 in _main
  File "/home/travis/build/python-trio/trio/testenv/site-packages/_pytest/main.py", line 89 in wrap_session
  File "/home/travis/build/python-trio/trio/testenv/site-packages/_pytest/main.py", line 130 in pytest_cmdline_main
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/callers.py", line 151 in _multicall
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/__init__.py", line 213 in <lambda>
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/__init__.py", line 219 in _hookexec
  File "/home/travis/build/python-trio/trio/testenv/site-packages/pluggy/__init__.py", line 603 in __call__
  File "/home/travis/build/python-trio/trio/testenv/site-packages/_pytest/config.py", line 40 in main
  File "/home/travis/build/python-trio/trio/testenv/bin/pytest", line 4 in <module>
  File "<builtin>/app_main.py", line 541 in exec_
  File "<builtin>/app_main.py", line 771 in execfile
  File "<builtin>/app_main.py", line 84 in run_toplevel
  File "<builtin>/app_main.py", line 545 in run_command_line
  File "<builtin>/app_main.py", line 842 in entry_point
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

So we see two threads both blocked in get. Unfortunately the line numbers in test_local.py just refer to the def test_run_local_simultaneous_runs and the async def main lines – there's no information on which call to get is blocked.

The weird thing though is that there are only two threads there: the main thread and one other. We ought to have 3 threads: the main thread and two others. We can't tell which of the two workers is missing, but if it's the one the test calls t1, then it would explain the lockup: t2 waits for the main thread, the main thread waits for t1, and... if t1 is missing that's a deadlock.

One possibility is that t1 is somehow crashing with an exception, but pytest is hiding the traceback.

@njsmith
Copy link
Member Author

njsmith commented Feb 15, 2018

Here's another fun example – pypy nightly this time, and faulthandler segfaults instead of printing a backtrace: https://travis-ci.org/python-trio/trio/jobs/341720284

I managed to reproduce it once locally, by running

while true; do ~/pypy/pypy3.5-5.8-beta-linux_x86_64-portable/bin/pypy3.5 -m pytest trio -k test_run_local_simultaneous_runs; done

I wasn't able to get much debugging info out, but when I hit control-C, it printed a backtrace claiming that it was actually stuck at:

  File "/home/njs/trio/trio/_core/tests/test_local.py", line 122, in test_run_local_simultaneous_runs
    q2.get()

which I guess doesn't say much useful – that's what would happen e.g. if t2 somehow disappeared.

njsmith added a commit to njsmith/trio that referenced this issue Feb 15, 2018
We used to use a single queue to send messages to and from the
threads, which of course is unreliable because the main thread could
end up reading back its own message. In particular, on PyPy this
happened regularly, and occasionally it meant that the test
deadlocked. So this fixed python-triogh-379.

This patch also updates the test harness to actually pull errors back
from the child threads, so that if the test does fail then we can
detect it.
@njsmith
Copy link
Member Author

njsmith commented Feb 15, 2018

As usual, I feel very silly after having found it.

Trick to tracking it down: for whatever reason, this time I was able to reproduce it by running pytest in a loop (specifically: for X in $(seq 1000000); do echo $X; ~/pypy/pypy3.5-5.8-beta-linux_x86_64-portable/bin/pypy3.5 -u -m pytest -vs trio -k test_run_local_simultaneous_runs; done, which would freeze after maybe 50 iterations). And then I added a bunch of print statements, and immediately went wtf, because the threads weren't executing in the order I expected at all, on any of the executions, including the ones that passed. (This is a test where we have threads but we manually control the scheduling in a very strict way... or so I thought.) And then the issue became obvious.

In the process I noticed that the test was written in a very fragile way, because we don't have a nursery abstraction for threads. Maybe it would be worth rewriting that test using a parent trio + two run_sync_in_worker_thread calls that each invoke their own trio, just for that? Probably the work needed to control the sequencing after that is too annoying to be worthwhile though (currently it involves both threads making blocking calls, which will also mess up the nice trio properties).

@pquentin
Copy link
Member

I can't comment on the trio/run_sync_in_worker_thread rewrite, but I wanted to know: are we going to keep this test after #420 is done?

@njsmith
Copy link
Member Author

njsmith commented Feb 15, 2018

Yeah. The details will change because we'll probably replace RunLocal with some kind of RunContextVar, but the same thing still needs to be tested for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants