Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: apiserver availability 30d recording rule time scale #990

Closed
4 tasks done
edwintye opened this issue Nov 26, 2024 · 2 comments
Closed
4 tasks done

[Bug]: apiserver availability 30d recording rule time scale #990

edwintye opened this issue Nov 26, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@edwintye
Copy link

What happened?

I have tried using the latest recording rules and the scale of the apiserver availability rules seems off. More concretely, the rule cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase%s has an additional multiplication factor * 24 * %s. I believe the change was introduced by #976 which correctly changes the underlying data to use the bucket at {le="+Inf"}. However, since it removed the avg_over_time function in the query we retrieve the total increase over the period which should not require further scaling.

Let's just say that SLO days %s is 30d (the default) for the sake of my copy and paste. The recording rule cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d uses the metric cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d with explicit bucket label le which already has the * 24 * 30. Without any adjustment, the final rule apiserver_request:availability30d that is composed of

      1 - (
        sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"LIST|GET"})
        -
        (
          # too slow
          (
            sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope=~"resource|",le="1"})
            or
            vector(0)
          )
          +
          sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="namespace",le="5"})
          +
          sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="cluster",le="30"})
        )
        +
        # errors
        sum by (cluster) (code:apiserver_request_total:increase30d{verb="read",code=~"5.."} or vector(0))
      )
      /
      sum by (cluster) (code:apiserver_request_total:increase30d{verb="read"})

will have the hour to total day multiplication factor * 24 * 30 applied twice for the total count while the scoped counts are still calculated directly from the bucket metric itself without adjustment.

From what I can tell everything else is correct and the fix may just be to remove the multiplication factor.

Please provide any helpful snippets.

# previous rule
git checkout f4f0d150fb85b0eb4d57d8a74b387748f068e92f
make prometheus_rules.yaml
mv prometheus_rules.yaml old_rules.yaml

# new rule
git checkout a3affb372fc22fc7ddbf186743b2151fdad63aaf
make prometheus_rules.yaml
diff prometheus_rules.yaml old_rules.yaml

# 17a18,23
# >   - "expr": |
# >       sum by (cluster, verb, scope) (increase(apiserver_request_sli_duration_seconds_count{job="kube-apiserver"}[1h]))
# >     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h"
# >   - "expr": |
# >       sum by (cluster, verb, scope) (avg_over_time(cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h[30d]) * 24 * 30)
# >     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d"
# 24,29d29
# <   - "expr": |
# <       sum by (cluster, verb, scope) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase1h{le="+Inf"})
# <     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h"
# <   - "expr": |
# <       sum by (cluster, verb, scope) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{le="+Inf"} * 24 * 30)
# <     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d"

What parts of the codebase are affected?

Rules

I agree to the following terms:

  • I agree to follow this project's Code of Conduct.
  • I have filled out all the required information above to the best of my ability.
  • I have searched the issues of this repository and believe that this is not a duplicate.
  • I have confirmed this bug exists in the default branch of the repository, as of the latest commit at the time of submission.
@skl
Copy link
Collaborator

skl commented Dec 17, 2024

@edwintye #998 was merged - let me know if this resolves the issue for you 👍

@skl skl self-assigned this Dec 17, 2024
@edwintye
Copy link
Author

LGTM, thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants