Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider checks reporting "up" status before being executed #164

Open
cjdcordeiro opened this issue Nov 16, 2022 · 6 comments
Open

Reconsider checks reporting "up" status before being executed #164

cjdcordeiro opened this issue Nov 16, 2022 · 6 comments
Labels
Feature A feature request

Comments

@cjdcordeiro
Copy link
Collaborator

Problem

If we list the checks very early in the lifetime of the Pebble plan, the output will return a default misleading "up" status, even though we aren't yet capable of making such an assessment.

Reproduce

Use an intentionally faulty check, like:

checks:
    chk1:
        override: replace
        level: ready
        http:
            url: http://23234sdffsdfsadf.com/

Start the server: pebble run

Immediately list the checks with pebble checks, and the output will be

Check  Level  Status  Failures
chk1   ready  up      0/3

Notice the Status = up, which is misleading, and will eventually switch to "down".

@benhoyt
Copy link
Contributor

benhoyt commented Nov 17, 2022

This is a good point. It will mean before Pebble starts the /v1/health endpoint (used as a K8s probe endpoint by Juju) will return an error as it should, then when Pebble starts it'll report success (up), then if the check is actually down, it'll report down correctly after N failures. But in the meantime we have a "blip" of up. I originally designed it to default to up as kind of a "give it the benefit of the doubt".

I'm not sure we should just change the default to down, because we don't really want it to stay down for a full 10s (the default check period). We could add a third state unknown and default to that, however, that still has issues for /v1/health which returns an overall true/false binary result, and again I'm not sure you'd want that to report health=false for a 10 full seconds. Or maybe that is okay? For Juju we set the K8s probes' InitialDelaySeconds to 30s, so it would work fine for that.

We could make the check manager run the checks right away ... but that isn't right either, because the service/workload being monitored probably isn't up yet anyway.

@hpidcock Thoughts?

@hpidcock
Copy link
Member

I don't think "up" is correct initially, but I'm also not sure defaulting to "down" is correct either. My instinct is to make the default state configurable with "down" as the default.

@hpidcock
Copy link
Member

checks:
    chk1:
        override: replace
        level: ready
        initial-status: down
        http:
            url: http://23234sdffsdfsadf.com/

@hpidcock
Copy link
Member

The alternative is that we don't return any result until all checks have evaluated. I.e. the api request can long poll if its indeterminate

@cjdcordeiro
Copy link
Collaborator Author

Right, if the expected behaviour is to return a boolean, then yes, both up or down would always be misleading. If True|False are the only two options, then this is very open to interpretation and I don't think there's a correct answer. I personally would expect, in this case, the system to be considered as "up", until proven otherwise. Ideally I'd like to see either unknown or no state at all.

Just as a reference for inspiration, similar healthcheck mechanisms are also employed by Docker and Kubernetes as you know. For Docker, the accepted health status are UNHEALTHY, HEALTHY, STARTING

@benhoyt
Copy link
Contributor

benhoyt commented Nov 17, 2022

@hpidcock has some interesting suggestions (an initial-status config field, or API request waiting till checks have started), however, I think they're both a bit heavy for what we need here. The API request waiting till checks have started raises several questions -- do we wait for all checks? what if the periods are quite long? and so on.

I suggest what we do here: we add a CheckStatusUnknown (unknown) in addition to up and down, and report that status in the /v1/checks API and pebble checks CLI command for checks that haven't run yet at all.

However, I suggest we don't change the behaviour of /v1/health, which needs to return a binary success/failure response. You could argue that either way, and if we default it to down we get the issue I mentioned above: "I'm not sure you'd want that to report health=false for a 10 full seconds". So I suggest we leave the /v1/health response as it is today -- in other words, map the statuses like so:

Check status /v1/health response
unknown success
up success
down failure

@benhoyt benhoyt added Simple Nice for a quick look on a minute or two 24.10 Feature A feature request and removed 24.10 labels Mar 13, 2024
@benhoyt benhoyt changed the title [bug] checks report an "up" status before being executed Reconsider checks reporting "up" status before being executed May 17, 2024
@benhoyt benhoyt removed the Simple Nice for a quick look on a minute or two label Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature A feature request
Projects
None yet
Development

No branches or pull requests

3 participants