Alert not cleared when metric is back within boundd #354

Arend-melissant · 2024-03-17T09:14:05Z

Alerts are sometimes not cleared when returning to a 'safe' value. Alerts which have an upper bound seem to be working OK, but I have an alert with only a lower bound, which will trigger when the value is below this threshold, but will not be cleared when the value is again above this threshold.

I am monitoring incoming requests on Azure service bus through the azure monitor (over the opentelemetry collector). Alerts with an upper bound work as expected, but I have an alert on a lower bound of 1 (I need to see messages coming in every hour), which triggers an alert when it gets below 1, but when messages start coming in again, and so the incoming message count is above 1, the alert is not cleared.

vmihailenco · 2024-03-17T09:35:23Z

What version are you using?

arend-melissant-tnt · 2024-03-17T09:37:34Z

1.7.0 has this issue, but it was also present on 1.6.3

…

________________________________ From: Vladimir Mihailenco ***@***.***> Sent: Sunday, March 17, 2024 10:35:47 AM To: uptrace/uptrace ***@***.***> Cc: Subscribed ***@***.***> Subject: Re: [uptrace/uptrace] Alert not cleared when metric is back within boundd (Issue #354) What version are you using? — Reply to this email directly, view it on GitHub<#354 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A6FQRXGNRRTBW5S5HHKEKWDYYVPXHAVCNFSM6AAAAABE2BQ2BWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGM4DANRUGI>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

arend-melissant-tnt · 2024-03-18T06:36:57Z

I did some extra tests and found that the current implementation is probably not what I expected. Currently, this alert is triggered when in the last 15 minutes there are all 0 values, which is what I expect. I guess to clear this alert you need 15 non zero values , which is not what I expect, I would assume the alert gets cancelled after any non zero value in the last 15 minutes.
Azure monitor values can sometimes produce 0 values as it does not always return values (don't know why). Also, the metric monitored can produce 0 values but still be ok.
I have now used a filter (avg_over_time(Val[15m]) which solves the issue and correctly cancels the alarm due to it not having any 0 values as it is a rolling average.

vmihailenco · 2024-03-18T09:23:45Z

Your observations are correct.

I would assume the alert gets cancelled after any non zero value in the last 15 minutes.

This way it can be a bit noisy so we went with the current default behavior instead. It probably should be configurable.

I have now used a filter (avg_over_time(Val[15m]) which solves the issue and correctly cancels the alarm due to it not having any 0 values as it is a rolling average.

This is probably not exactly the same as the setting since it also affects the open trigger, but it might work.

Arend-melissant · 2024-03-21T20:08:04Z

Having used the described workaround for a few days I can say it is working as expected.

vmihailenco added the documentation Improvements or additions to documentation label Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert not cleared when metric is back within boundd #354

Alert not cleared when metric is back within boundd #354

Arend-melissant commented Mar 17, 2024

vmihailenco commented Mar 17, 2024

arend-melissant-tnt commented Mar 17, 2024 via email

arend-melissant-tnt commented Mar 18, 2024

vmihailenco commented Mar 18, 2024

Arend-melissant commented Mar 21, 2024

Alert not cleared when metric is back within boundd #354

Alert not cleared when metric is back within boundd #354

Comments

Arend-melissant commented Mar 17, 2024

vmihailenco commented Mar 17, 2024

arend-melissant-tnt commented Mar 17, 2024 via email

arend-melissant-tnt commented Mar 18, 2024

vmihailenco commented Mar 18, 2024

Arend-melissant commented Mar 21, 2024