Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quiche CI job occasionally fails during a download abort test #13439

Open
jay opened this issue Apr 21, 2024 · 5 comments
Open

Quiche CI job occasionally fails during a download abort test #13439

jay opened this issue Apr 21, 2024 · 5 comments
Labels
CI Continuous Integration

Comments

@jay
Copy link
Member

jay commented Apr 21, 2024

For the last few days there are occasional CI failures in the quiche job. It is hard to diagnose them because the logs are so large due to millions of "curl_multi_poll() -> 0" output (from h2-download.c). The output is truncated and downloading the raw archive fails repeatedly.

The most recent commit to have the failure: "http: acknowledge a returned error code". After repeated attempts github finally gave me the whole file. I stripped out most of the excess curl_multi_poll() -> 0 to make it readable: log.txt. Excerpt:

2024-04-19T13:23:10.4146231Z =================================== FAILURES ===================================
2024-04-19T13:23:10.4147994Z _______________ TestDownload.test_02_23a_lib_abort_paused[0-h3] ________________
2024-04-19T13:23:10.4148530Z 
2024-04-19T13:23:10.4148762Z self = <test_02_download.TestDownload object at 0x7f6229f10220>
2024-04-19T13:23:10.4149272Z env = <testenv.env.Env object at 0x7f622a029de0>
2024-04-19T13:23:10.4151755Z httpd = <testenv.httpd.Httpd object at 0x7f622a0c8cd0>
2024-04-19T13:23:10.4152554Z nghttpx = <testenv.nghttpx.NghttpxQuic object at 0x7f6229e01a50>, proto = 'h3'
2024-04-19T13:23:10.4153075Z repeat = 0
2024-04-19T13:23:10.4153232Z 
2024-04-19T13:23:10.4153536Z     @pytest.mark.parametrize("proto", ['http/1.1', 'h2', 'h3'])
2024-04-19T13:23:10.4154389Z     def test_02_23a_lib_abort_paused(self, env: Env, httpd, nghttpx, proto, repeat):
2024-04-19T13:23:10.4155785Z         if proto == 'h3' and env.curl_uses_ossl_quic():
2024-04-19T13:23:10.4156581Z             pytest.skip('OpenSSL QUIC fails here')
2024-04-19T13:23:10.4157297Z         if proto in ['h2', 'h3']:
2024-04-19T13:23:10.4157859Z             count = 200
2024-04-19T13:23:10.4158352Z             max_parallel = 100
2024-04-19T13:23:10.4158884Z             pause_offset = 64 * 1024
2024-04-19T13:23:10.4159411Z         else:
2024-04-19T13:23:10.4159791Z             count = 10
2024-04-19T13:23:10.4160242Z             max_parallel = 5
2024-04-19T13:23:10.4160735Z             pause_offset = 12 * 1024
2024-04-19T13:23:10.4161363Z         docname = 'data-1m'
2024-04-19T13:23:10.4162094Z         url = f'https://localhost:{env.https_port}/{docname}'
2024-04-19T13:23:10.4162969Z         client = LocalClient(name='h2-download', env=env)
2024-04-19T13:23:10.4163644Z         if not client.exists():
2024-04-19T13:23:10.4164444Z             pytest.skip(f'example client not built: {client.name}')
2024-04-19T13:23:10.4165217Z         r = client.run(args=[
2024-04-19T13:23:10.4165944Z             '-n', f'{count}', '-m', f'{max_parallel}', '-a',
2024-04-19T13:23:10.4166849Z             '-P', f'{pause_offset}', '-V', proto, url
2024-04-19T13:23:10.4167483Z         ])
2024-04-19T13:23:10.4167901Z >       r.check_exit_code(0)
2024-04-19T13:23:10.4168254Z 
2024-04-19T13:23:10.4168463Z tests/http/test_02_download.py:352: 
2024-04-19T13:23:10.4169267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2024-04-19T13:23:10.4169846Z 
2024-04-19T13:23:10.4171917Z self = ExecResult[code=-1, exception=TimeoutExpired, args=['/home/runner/work/curl/curl/tests/http/clients/h2-download', '-n'... 0\n', 'curl_multi_poll() -> 0\n', 'curl_multi_poll() -> 0\n', 'curl_multi_poll() -> 0\n', 'curl_multi_poll() -> 0\n']]
2024-04-19T13:23:10.4174046Z code = 0
2024-04-19T13:23:10.4174270Z 
2024-04-19T13:23:10.4174793Z     def check_exit_code(self, code: Union[int, bool]):
2024-04-19T13:23:10.4175484Z         if code is True:
2024-04-19T13:23:10.4176298Z             assert self.exit_code == 0, f'expected exit code {code}, '\
2024-04-19T13:23:10.4177313Z                                         f'got {self.exit_code}\n{self.dump_logs()}'
2024-04-19T13:23:10.4178041Z         elif code is False:
2024-04-19T13:23:10.4178857Z             assert self.exit_code != 0, f'expected exit code {code}, '\
2024-04-19T13:23:10.4179893Z                                                 f'got {self.exit_code}\n{self.dump_logs()}'
2024-04-19T13:23:10.4180599Z         else:
2024-04-19T13:23:10.4181334Z >           assert self.exit_code == code, f'expected exit code {code}, '\
2024-04-19T13:23:10.4182380Z                                            f'got {self.exit_code}\n{self.dump_logs()}'
2024-04-19T13:23:10.4183260Z E           AssertionError: expected exit code 0, got -1

Probably related to 08d10d2 even though its quiche job passed. I'm not finding failures before then and I don't know what else it could be.

/cc @icing

@jay jay added the CI Continuous Integration label Apr 21, 2024
@bagder
Copy link
Member

bagder commented Apr 21, 2024

According to testclutch, the first quiche pytest failures started with https://github.com/curl/curl/actions/runs/8715610798, which was commit 5b94ece - which seems totally innocuous and unrelated... I'm puzzled.

@jay
Copy link
Member Author

jay commented Apr 22, 2024

According to testclutch,

neat

the first quiche pytest failures started with https://github.com/curl/curl/actions/runs/8715610798, which was commit 5b94ece - which seems totally innocuous and unrelated... I'm puzzled.

Yeah that's the first one but notice the quiche job does not always fail, so it's possible the failures have been caused by a parent whose quiche job completed successfully.

@icing
Copy link
Contributor

icing commented Apr 23, 2024

Looking at this now. What happened is that the test client changed and seems to expose a timing related bug in the quiche handling.

icing added a commit to icing/curl that referenced this issue Apr 23, 2024
- when a connection close is detected, all ongoing transfers
  need to expire bc no more POLL events are likely to happen
  for them.
- refs curl#13439
@icing
Copy link
Contributor

icing commented Apr 23, 2024

I was not able to reproduce locally, but I found in the logs that quiche

  • uses 2 connections in these tests
  • reported in a failed log that one connection was closed (by the server?)
  • an ongoing transfer then was never woken up, leading to infinite polling without further events

This maybe the cause of all the failures. Why exactly the connection is considered closed, I do not know. I added some log statement to at least see if it is related to the idle timer. But the overall log times do not suggest this being the case.

@bagder bagder closed this as completed in cfb9991 Apr 24, 2024
@bagder
Copy link
Member

bagder commented Apr 25, 2024

Unfortunately, this issue seems to remain even after #13447

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration
Development

No branches or pull requests

3 participants