Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert not cleared when metric is back within boundd #354

Open
Arend-melissant opened this issue Mar 17, 2024 · 5 comments
Open

Alert not cleared when metric is back within boundd #354

Arend-melissant opened this issue Mar 17, 2024 · 5 comments
Labels
documentation Improvements or additions to documentation

Comments

@Arend-melissant
Copy link

Alerts are sometimes not cleared when returning to a 'safe' value. Alerts which have an upper bound seem to be working OK, but I have an alert with only a lower bound, which will trigger when the value is below this threshold, but will not be cleared when the value is again above this threshold.

I am monitoring incoming requests on Azure service bus through the azure monitor (over the opentelemetry collector). Alerts with an upper bound work as expected, but I have an alert on a lower bound of 1 (I need to see messages coming in every hour), which triggers an alert when it gets below 1, but when messages start coming in again, and so the incoming message count is above 1, the alert is not cleared.

@vmihailenco
Copy link
Member

What version are you using?

@arend-melissant-tnt
Copy link

arend-melissant-tnt commented Mar 17, 2024 via email

@arend-melissant-tnt
Copy link

I did some extra tests and found that the current implementation is probably not what I expected. Currently, this alert is triggered when in the last 15 minutes there are all 0 values, which is what I expect. I guess to clear this alert you need 15 non zero values , which is not what I expect, I would assume the alert gets cancelled after any non zero value in the last 15 minutes.
Azure monitor values can sometimes produce 0 values as it does not always return values (don't know why). Also, the metric monitored can produce 0 values but still be ok.
I have now used a filter (avg_over_time(Val[15m]) which solves the issue and correctly cancels the alarm due to it not having any 0 values as it is a rolling average.

@vmihailenco
Copy link
Member

Your observations are correct.

I would assume the alert gets cancelled after any non zero value in the last 15 minutes.

This way it can be a bit noisy so we went with the current default behavior instead. It probably should be configurable.

I have now used a filter (avg_over_time(Val[15m]) which solves the issue and correctly cancels the alarm due to it not having any 0 values as it is a rolling average.

This is probably not exactly the same as the setting since it also affects the open trigger, but it might work.

@Arend-melissant
Copy link
Author

Having used the described workaround for a few days I can say it is working as expected.

@vmihailenco vmihailenco added the documentation Improvements or additions to documentation label Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants