Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_pageserver_small_inmemory_layers is unstable #10170

Open
jcsp opened this issue Dec 17, 2024 · 2 comments
Open

test_pageserver_small_inmemory_layers is unstable #10170

jcsp opened this issue Dec 17, 2024 · 2 comments
Assignees
Labels
a/test Area: related to testing c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged

Comments

@jcsp
Copy link
Collaborator

jcsp commented Dec 17, 2024

Something went bad on the 11th/12th of December:
Image

It fails waiting for connections to postgres, apparently in getaddrinfo:

/github/home/.cache/pypoetry/virtualenvs/non-package-mode-_pxWMzVK-py3.11/lib/python3.11/site-packages/asyncpg/connection.py:2329: in connect
    return await connect_utils._connect(
/github/home/.cache/pypoetry/virtualenvs/non-package-mode-_pxWMzVK-py3.11/lib/python3.11/site-packages/asyncpg/connect_utils.py:991: in _connect
    conn = await _connect_addr(
/github/home/.cache/pypoetry/virtualenvs/non-package-mode-_pxWMzVK-py3.11/lib/python3.11/site-packages/asyncpg/connect_utils.py:828: in _connect_addr
    return await __connect_addr(params, True, *args)
/github/home/.cache/pypoetry/virtualenvs/non-package-mode-_pxWMzVK-py3.11/lib/python3.11/site-packages/asyncpg/connect_utils.py:873: in __connect_addr
    tr, pr = await connector
/github/home/.cache/pypoetry/virtualenvs/non-package-mode-_pxWMzVK-py3.11/lib/python3.11/site-packages/asyncpg/connect_utils.py:744: in _create_ssl_connection
    tr, pr = await loop.create_connection(
/home/nonroot/.pyenv/versions/3.11.10/lib/python3.11/asyncio/base_events.py:1046: in create_connection
    infos = await self._ensure_resolved(
/home/nonroot/.pyenv/versions/3.11.10/lib/python3.11/asyncio/base_events.py:1420: in _ensure_resolved
    return await loop.getaddrinfo(host, port, family=family, type=type,
/home/nonroot/.pyenv/versions/3.11.10/lib/python3.11/asyncio/base_events.py:868: in getaddrinfo
    return await self.run_in_executor(
E   asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:
test_runner/regress/test_pageserver_layer_rolling.py:130: in test_pageserver_small_inmemory_layers
    last_flush_lsns = asyncio.run(workload(env, tenant_conf, TIMELINE_COUNT, ENTRIES_PER_TIMELINE))
/home/nonroot/.pyenv/versions/3.11.10/lib/python3.11/asyncio/runners.py:190: in run
    return runner.run(main)
/home/nonroot/.pyenv/versions/3.11.10/lib/python3.11/asyncio/runners.py:118: in run
    return self._loop.run_until_complete(task)
/home/nonroot/.pyenv/versions/3.11.10/lib/python3.11/asyncio/base_events.py:654: in run_until_complete
    return future.result()
test_runner/regress/test_pageserver_layer_rolling.py:54: in workload
    return await asyncio.gather(*workers)
test_runner/regress/test_pageserver_layer_rolling.py:46: in run_worker
    last_flush_lsn = await run_worker_for_tenant(env, entries, tenant)
test_runner/regress/test_pageserver_layer_rolling.py:31: in run_worker_for_tenant
    conn = await ep.connect_async()
test_runner/fixtures/neon_fixtures.py:276: in connect_async
    return await asyncpg.connect(**conn_options)
/github/home/.cache/pypoetry/virtualenvs/non-package-mode-_pxWMzVK-py3.11/lib/python3.11/site-packages/asyncpg/connection.py:2328: in connect
    async with compat.timeout(timeout):
/home/nonroot/.pyenv/versions/3.11.10/lib/python3.11/asyncio/timeouts.py:115: in __aexit__
    raise TimeoutError from exc_val
E   TimeoutError

However, this could be async executor starvation, if it just can't execute the task for getaddrinfo because something else is blocking the executor.

#9994 points out historical timeouts with the same python backtrace, but this became much more frequent 5 days ago

@jcsp jcsp added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver a/test Area: related to testing labels Dec 17, 2024
@jcsp
Copy link
Collaborator Author

jcsp commented Dec 17, 2024

Related to LFC-by-default changes?

@jcsp
Copy link
Collaborator Author

jcsp commented Dec 17, 2024

We will monitor this after addressing #9994 to see if it's still unstable

@jcsp jcsp added the triaged bugs that were already triaged label Dec 17, 2024
github-merge-queue bot pushed a commit that referenced this issue Dec 19, 2024
## Problem

ref #10170
ref #9994

The psql command will block the main thread, causing other async tasks
to timeout (i.e., HTTP connect). Therefore, we need to move it to an I/O
executor thread.

## Summary of changes

* run psql connection in a thread

---------

Signed-off-by: Alex Chi Z <[email protected]>
Co-authored-by: John Spray <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test Area: related to testing c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged
Projects
None yet
Development

No branches or pull requests

2 participants