Reconsider checks reporting "up" status before being executed #164

cjdcordeiro · 2022-11-16T13:28:25Z

Problem

If we list the checks very early in the lifetime of the Pebble plan, the output will return a default misleading "up" status, even though we aren't yet capable of making such an assessment.

Reproduce

Use an intentionally faulty check, like:

checks:
    chk1:
        override: replace
        level: ready
        http:
            url: http://23234sdffsdfsadf.com/

Start the server: pebble run

Immediately list the checks with pebble checks, and the output will be

Check  Level  Status  Failures
chk1   ready  up      0/3

Notice the Status = up, which is misleading, and will eventually switch to "down".

The text was updated successfully, but these errors were encountered:

benhoyt · 2022-11-17T01:44:42Z

This is a good point. It will mean before Pebble starts the /v1/health endpoint (used as a K8s probe endpoint by Juju) will return an error as it should, then when Pebble starts it'll report success (up), then if the check is actually down, it'll report down correctly after N failures. But in the meantime we have a "blip" of up. I originally designed it to default to up as kind of a "give it the benefit of the doubt".

I'm not sure we should just change the default to down, because we don't really want it to stay down for a full 10s (the default check period). We could add a third state unknown and default to that, however, that still has issues for /v1/health which returns an overall true/false binary result, and again I'm not sure you'd want that to report health=false for a 10 full seconds. Or maybe that is okay? For Juju we set the K8s probes' InitialDelaySeconds to 30s, so it would work fine for that.

We could make the check manager run the checks right away ... but that isn't right either, because the service/workload being monitored probably isn't up yet anyway.

@hpidcock Thoughts?

hpidcock · 2022-11-17T03:45:02Z

I don't think "up" is correct initially, but I'm also not sure defaulting to "down" is correct either. My instinct is to make the default state configurable with "down" as the default.

hpidcock · 2022-11-17T03:46:00Z

checks:
    chk1:
        override: replace
        level: ready
        initial-status: down
        http:
            url: http://23234sdffsdfsadf.com/

hpidcock · 2022-11-17T03:47:43Z

The alternative is that we don't return any result until all checks have evaluated. I.e. the api request can long poll if its indeterminate

cjdcordeiro · 2022-11-17T08:18:22Z

Right, if the expected behaviour is to return a boolean, then yes, both up or down would always be misleading. If True|False are the only two options, then this is very open to interpretation and I don't think there's a correct answer. I personally would expect, in this case, the system to be considered as "up", until proven otherwise. Ideally I'd like to see either unknown or no state at all.

Just as a reference for inspiration, similar healthcheck mechanisms are also employed by Docker and Kubernetes as you know. For Docker, the accepted health status are UNHEALTHY, HEALTHY, STARTING

benhoyt · 2022-11-17T22:10:31Z

@hpidcock has some interesting suggestions (an initial-status config field, or API request waiting till checks have started), however, I think they're both a bit heavy for what we need here. The API request waiting till checks have started raises several questions -- do we wait for all checks? what if the periods are quite long? and so on.

I suggest what we do here: we add a CheckStatusUnknown (unknown) in addition to up and down, and report that status in the /v1/checks API and pebble checks CLI command for checks that haven't run yet at all.

However, I suggest we don't change the behaviour of /v1/health, which needs to return a binary success/failure response. You could argue that either way, and if we default it to down we get the issue I mentioned above: "I'm not sure you'd want that to report health=false for a 10 full seconds". So I suggest we leave the /v1/health response as it is today -- in other words, map the statuses like so:

Check status	/v1/health response
unknown	success
up	success
down	failure

benhoyt added Simple Nice for a quick look on a minute or two 24.10 Feature A feature request and removed 24.10 labels Mar 13, 2024

benhoyt changed the title ~~[bug] checks report an "up" status before being executed~~ Reconsider checks reporting "up" status before being executed May 17, 2024

benhoyt removed the Simple Nice for a quick look on a minute or two label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider checks reporting "up" status before being executed #164

Reconsider checks reporting "up" status before being executed #164

cjdcordeiro commented Nov 16, 2022

benhoyt commented Nov 17, 2022

hpidcock commented Nov 17, 2022

hpidcock commented Nov 17, 2022

hpidcock commented Nov 17, 2022

cjdcordeiro commented Nov 17, 2022

benhoyt commented Nov 17, 2022

Reconsider checks reporting "up" status before being executed #164

Reconsider checks reporting "up" status before being executed #164

Comments

cjdcordeiro commented Nov 16, 2022

Problem

Reproduce

benhoyt commented Nov 17, 2022

hpidcock commented Nov 17, 2022

hpidcock commented Nov 17, 2022

hpidcock commented Nov 17, 2022

cjdcordeiro commented Nov 17, 2022

benhoyt commented Nov 17, 2022