Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark the fingerprints of all inhibiting rules #3768

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

grobinson-grafana
Copy link
Contributor

Before this change, Alertmanager would mark just the fingerprint of the first inhibition rule to inhibit an alert, rather than the fingerprints of all inhibition rules that inhibit the alert. This is different from silences which mark the IDs of all silences that silence an alert, not just the first silence.

This commit changes how inhibition rules work to match the behavior of silences, such that the fingerprints of all inhibition rules that inhibit the alert are marked.

This change is expected to increase CPU usage as the inhibiter has to do more work, matching the label set against all inhibition rules instead of stopping at the first match. However, this is no different from how silences mute alerts.

Before this change, Alertmanager would mark just the fingerprint of
the first inhibition rule to inhibit an alert, rather than the
fingerprints of all inhibition rules that inhibit the alert.
This is different from silences which mark the IDs of all silences
that silence an alert, not just the first silence.

This commit changes how inhibition rules work to match the
behavior of silences, such that the fingerprints of all inhibition
rules that inhibit the alert are marked.

This change is expected to increase CPU usage as the inhibiter has to
do more work, matching the label set against all inhibition rules
instead of stopping at the first match. However, this is no different
from how silences mute alerts.

Signed-off-by: George Robinson <[email protected]>
@grobinson-grafana
Copy link
Contributor Author

@beorn7 I see the comments were added in this PR here. What were the performance issues observed that caused inhibition rules to behave different from silences?

@grobinson-grafana
Copy link
Contributor Author

I just realized this code is broken when the source_matchers for an inhibition rule matches two or more alerts. In this case, the inhibition rule cache returns the fingerprint of just one of the inhibiting alerts rather than all fingerprints (link).

@grobinson-grafana grobinson-grafana marked this pull request as draft March 20, 2024 07:47
@grobinson-grafana
Copy link
Contributor Author

Here are the relevant PRs #379 and #1774.

@beorn7
Copy link
Member

beorn7 commented Mar 20, 2024

I really don't remember the detailed motivations here, just that it was complicated. Maybe the problems you have encountered above have to do with the reason.

@grobinson-grafana grobinson-grafana force-pushed the grobinson/mark-all-inhibiting-rules branch from 9381928 to 55080f2 Compare March 20, 2024 12:01
@grobinson-grafana grobinson-grafana marked this pull request as ready for review March 20, 2024 15:24
@grobinson-grafana grobinson-grafana marked this pull request as draft March 21, 2024 16:58
@grobinson-grafana
Copy link
Contributor Author

OK so I created a number of benchmarks to better understand how performance would be affected with this change. The four benchmarks do the following:

  • x_inhibition_rule,_1_inhibiting_alert-8 benchmarks the performance of x inhibition rules, where each inhibition rule has one inhibiting alert.
  • 1_inhibition_rule,_x_inhibiting_alerts-8 benchmarks the performance of 1 inhibition rule, where the inhibition rule has x inhibiting alerts.
  • 100_inhibition_rules,_1000_inhibiting_alerts-8 benchmarks the performance of 100 inhibition rules, which 1000 inhibiting alerts each.
  • x_inhibition_rules,_last_rule_matches-8 benchmarks the performance of x inhibition rules, where just the last inhibition rule inhibits, and all other inhibition rules are no-ops.

Here are the benchmarks from main:

BenchmarkMutes/1_inhibition_rule,_1_inhibiting_alert-8         	 1424942	       810.7 ns/op
BenchmarkMutes/10_inhibition_rules,_1_inhibiting_alert-8       	 1384290	       817.0 ns/op
BenchmarkMutes/100_inhibition_rules,_1_inhibiting_alert-8      	 1355450	       819.5 ns/op
BenchmarkMutes/1000_inhibition_rules,_1_inhibiting_alert-8     	 1334260	       850.6 ns/op
BenchmarkMutes/10000_inhibition_rules,_1_inhibiting_alert-8    	 1355574	       842.5 ns/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts-8       	 1227871	       906.0 ns/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts-8      	  490380	      2227 ns/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts-8     	   97848	     12087 ns/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts-8    	   10827	    104526 ns/op
BenchmarkMutes/100_inhibition_rules,_1000_inhibiting_alerts-8  	  100384	     11609 ns/op
BenchmarkMutes/10_inhibition_rules,_last_rule_matches-8        	 1103008	      1023 ns/op
BenchmarkMutes/100_inhibition_rules,_last_rule_matches-8       	  345433	      3303 ns/op
BenchmarkMutes/1000_inhibition_rules,_last_rule_matches-8      	   40462	     26343 ns/op
BenchmarkMutes/10000_inhibition_rules,_last_rule_matches-8     	    4449	    252129 ns/op

And here are the benchmarks from this branch:

BenchmarkMutes/1_inhibition_rule,_1_inhibiting_alert-8         	 1352847	       819.4 ns/op
BenchmarkMutes/10_inhibition_rules,_1_inhibiting_alert-8       	  283840	      4123 ns/op
BenchmarkMutes/100_inhibition_rules,_1_inhibiting_alert-8      	   30540	     35556 ns/op
BenchmarkMutes/1000_inhibition_rules,_1_inhibiting_alert-8     	    3244	    353088 ns/op
BenchmarkMutes/10000_inhibition_rules,_1_inhibiting_alert-8    	     290	   3649391 ns/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts-8       	  316680	      3682 ns/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts-8      	   37304	     28697 ns/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts-8     	    4122	    280268 ns/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts-8    	     361	   2952918 ns/op
BenchmarkMutes/100_inhibition_rules,_1000_inhibiting_alerts-8  	      34	  32002771 ns/op
BenchmarkMutes/10_inhibition_rules,_last_rule_matches-8        	 1054029	      1065 ns/op
BenchmarkMutes/100_inhibition_rules,_last_rule_matches-8       	  342032	      3387 ns/op
BenchmarkMutes/1000_inhibition_rules,_last_rule_matches-8      	   38838	     27282 ns/op
BenchmarkMutes/10000_inhibition_rules,_last_rule_matches-8     	    4356	    262349 ns/op

I also benchmarked Mutes for silences:

BenchmarkMutes/1_silence_mutes_alert-8         	 1328642	       867.0 ns/op
BenchmarkMutes/10_silences_mute_alert-8        	  558032	      2151 ns/op
BenchmarkMutes/100_silences_mute_alert-8       	  105117	     11372 ns/op
BenchmarkMutes/1000_silences_mute_alert-8      	    8367	    135013 ns/op
BenchmarkMutes/10000_silences_mute_alert-8     	     745	   1576070 ns/op
BenchmarkQuery/100_silences-8                  	   65757	     18499 ns/op
BenchmarkQuery/1000_silences-8                 	    7033	    164574 ns/op
BenchmarkQuery/10000_silences-8                	     298	   3876756 ns/op

The benchmarks shows the following:

  1. Inhibitor.Mutes is O(N) relative to the number of inhibition rules. The more inhibition rules there are the more comparisons need to be made. No surprises here.
  2. Since in main branch inhibitor.Mutes stops at the first inhibiting rule, when there are lots of inhibition rules it's faster for an alert to be inhibited than not inhibited. This is not the case in this branch though, as we are matching all inhibition rules. No surprises here either.
  3. It seems that having lots of silences doesn't scale well either. In fact silences seem to scale the same as inhibition rules do with the changes in this branch. This is surprising and so I want to double check this further as I know Bjorn did some optimizations here.
  4. With sensible limits, I think we can have reasonable performance for both inhibition rules and silences, where we track all inhibiting alerts and silence IDs up to some sensible limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants