-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
600 lines (485 loc) · 16 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Prometheus Unit Testing</title>
<link rel="shortcut icon" href="static/favicon.ico">
<link rel="stylesheet" href="static/reveal.js-3.8.0/css/reveal.css">
<link rel="stylesheet" href="static/reveal.js-3.8.0/css/theme/beige.css" id="theme">
<link rel="stylesheet" href="static/highlight.js-9.16.2/styles/tomorrow-night-eighties.css">
<link rel="stylesheet" href="static/highlight.js-9.16.2/styles/tomorrow-night-eighties.min.css">
<style>
mark {
background-color: inherit;
border: 1px solid #ffff00;
}
.reveal .footer {
position: absolute;
bottom: 0.5em;
left: 1em;
font-size: 0.3em;
z-index: 1;
}
</style>
</head>
<body>
<div class="reveal">
<div class="slides">
<section data-markdown data-separator="^\n---\n$" data-separator-vertical="^\n--\n$">
<textarea data-template>
# Prometheus Unit Testing
Howard Burgess, Core Engineering
---
## Prometheus
### Scraping
* Apps expose their metrics at `/metrics`
```json
# TYPE http_requests counter
http_requests{path="/login",http_code="200"} 657428
http_requests{path="/login",http_code="500"} 29
# TYPE cpu_temperature gauge
cpu_temperature{cpu="3"} 47.1
```
* Prometheus scrapes metrics and ingests them
* `scrape_interval` (default `1m`)
Note:
* Prometheus is a monitoring system and time series database
* Various exporters (e.g. Kafka, Node Exporter)
* Multidimentional data model (labels)
* `/metrics` endpoint is configurable in Prometheus
* Service discovery to find targets to scrape (e.g. Kubernetes)
* No evaluation happens at scrape time
--
## Prometheus
### Rule evaluation
* Prometheus evaluates PromQL rules and alerts
* Recording rules precompute queries for reuse
* Alerting rules trigger alerts for unusual behaviour
* Alerts are sent to [Alertmanager](https://prometheus.io/docs/alerting/alertmanager/)
* `evaluation_interval` (default `1m`)
Note:
* PromQL query language
* Recording rules generate additional time series
* Alertmanager groups and rate-limits alerts, then sends them to e.g. Slack
* We can unit test recording rules and alerts
---
## Evaluation of a simple alert
### Alert rule
```yaml
- alert: ServiceDown
expr: up == 0
labels:
severity: minor
annotations:
description: 'Service {{.Labels.instance}} is down'
```
Note:
* `up` is a built-in time series indicating that a target was successfully scraped
* All labels from `up` are available, plus any specified in `labels:` section
* In reality you'd use `for 5m` to guard against e.g. missed scrape
--
## Evaluation of a simple alert
### Collected metrics at time _`t`_
```json
up{instance="10.0.0.1",job="sample-app"} 1
up{instance="10.0.0.2",job="sample-app"} 0
up{instance="10.0.0.3",job="sample-app"} 0
```
### Synthetic time series generated
```json
ALERTS{alertname="ServiceDown",alertstate="firing",instance="10.0.0.2",job="sample-app"} 1
ALERTS{alertname="ServiceDown",alertstate="firing",instance="10.0.0.3",job="sample-app"} 1
```
--
## Evaluation of a simple alert
### Prometheus sends alert to Alertmanager
```json
[
{
"status": "firing",
"labels": {
"alertname": "ServiceDown",
"instance": "10.0.0.2",
"job": "sample-app",
"severity": "minor"
},
"annotations": {
"description": "Service 10.0.0.2 is down"
}
},
{
"status": "firing",
"labels": {
"alertname": "ServiceDown",
"instance": "10.0.0.3",
"job": "sample-app",
"severity": "minor"
},
"annotations": {
"description": "Service 10.0.0.3 is down"
}
}
]
```
Note:
* Prometheus sends the alert to Alertmanager at every evaluation interval
* Alertmanager will rate-limit the delivery of alerts to e.g. Slack
* Testing was difficult before officially supported. Lack of confidence in alerts.
---
## How did people test before?
### One approach
* Use WireMock to simulate scrape targets
- Inject responses for the `/metrics` endpoint
- Prometheus scrapes it and evaluates rules
- Query Prometheus for `ALERTS{alertname=...}`
Note:
* Problem: can't inject historical data into Prometheus
* Prometheus needs to scrape frequently (e.g. `1s`) otherwise tests take ages
* Con: Can't test rules with `for Xm` clauses (has to be real-time)
* Con: Not very precise - depends on exactly when scrapes/evaluations happen
--
## How did people test before?
### Another approach
* Use another time series database
- Insert historical metrics into InfluxDB
- Configure Prometheus to `remote_read` from it
- Prometheus doesn't scrape; only evaluates rules
- Query Prometheus for `ALERTS{alertname=...}`
Note:
* Con: More moving parts and InfluxDB query language
* Con: Not very precise due to Prometheus interpolation - depends on exactly when evaluations happen
* Pro: Can test rules with `for Xm` clauses
---
## Prometheus [2.5.0](https://github.com/prometheus/prometheus/releases/tag/v2.5.0)
Unit testing framework introduced
Add samples to the time series database
```yaml
- interval: 1m
input_series:
- series: up{instance="10.0.0.1"}
values: 1 0 1 0 0 1
# 0m 1m 2m 3m 4m 5m
```
Evaluate rules at a point in time
```yaml
alert_rule_test:
- alertname: ServiceDown
eval_time: 5m
```
Note:
* No scraping involved
* Times are precise. No interpolation
* Very fast
--
## How to run
It's part of `promtool`
```bash
$ promtool test rules *.test.yml
```
Or use Docker
```bash
$ docker run -v $PWD:/data:ro --entrypoint sh prom/prometheus \
-c 'promtool test rules /data/*.test.yml'
```
Note:
* `promtool` is also used to:
- check config and rules files for syntax
- run queries
- output debug information
---
## A more complex alert
<pre><code data-trim data-noescape class="stretch lang-yaml">
- alert: IceCreamStockLow
expr: temperature_c{area="outside"} > 25
and on (shop)
stock_level{item="ice-cream"} < 5
for: 5m
labels:
severity: major
annotations:
description: >
Low on ice cream at the {{.Labels.shop}} shop and
it is {{.Value | printf "%.1f"}}C {{.Labels.area}}
</code></pre>
We're expecting alerts like:
"Low on ice cream at the Cardiff shop and it is 26.3C outside"
Note:
* Will alert for all `shop`s
* Combining two time series that have different labels on `shop` only
* Actual time series would probably have more labels
* With [logical operators](https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators)
the values for the whole expression come from the left hand side
--
<pre><code data-trim data-noescape class="stretch lang-yaml">
evaluation_interval: 1m
rule_files:
- ice-cream.rules.yml
tests:
- interval: 1m
input_series:
- series: stock_level{item="ice-cream",shop="Birmingham"}
values: 2 2 2 2 2 2
- series: stock_level{item="ice-cream",shop="London"}
values: 4 4 4 4 4 4
- series: temperature_c{area="outside",shop="Birmingham"}
values: 23.98 23.98 23.98 23.98 23.98 23.98
- series: temperature_c{area="outside",shop="London"}
values: 28.75+1x5 # 28.75 29.75 30.75 31.75 32.75
alert_rule_test:
- alertname: IceCreamStockLow
eval_time: 5m
exp_alerts:
- exp_labels:
area: outside
severity: major
shop: London
exp_annotations:
description: Low on ice cream at the London shop and it is 33.8C outside
</code></pre>
Note:
* `rule_files` are relative to where `promtool` was run (not to the location of the test file)
* `interval` is the interval for the sample values. Doesn't have to match `evaluation_interval`
* Compact form to express sample values
* Gotcha: not providing enough samples to cover the `evaluation_time`
--
## Test output
```bash
$ promtool test rules *.test.yml
Unit Testing: ice-cream.rules.test.yml
SUCCESS
```
### Errors
If rule had temperature `>20` instead of `>25`
```bash
$ promtool test rules *.test.yml
Unit Testing: ice-cream.rules.test.yml
FAILED:
alertname:IceCreamStockLow, time:5m0s,
exp:"[Labels:{alertname=\"IceCreamStockLow\", area=\"outside\", severity=\"major\", shop=\"London\"} Annotations:{description=\"Low on ice cream at the London shop and it is 33.8C outside\"}]",
got:"[Labels:{alertname=\"IceCreamStockLow\", area=\"outside\", severity=\"major\", shop=\"Birmingham\"} Annotations:{description=\"Low on ice cream at the Birmingham shop and it is 24.0C outside\"},
Labels:{alertname=\"IceCreamStockLow\", area=\"outside\", severity=\"major\", shop=\"London\"} Annotations:{description=\"Low on ice cream at the London shop and it is 33.8C outside\"}]"
```
--
## Negative tests
<pre><code data-trim data-noescape class="stretch lang-yaml">
# Doesn't fire if stock levels are high enough
- interval: 1m
input_series:
- series: stock_level{item="ice-cream",shop="London"}
values: <mark>5+0x5 # 5 5 5 5 5 5</mark>
- series: temperature_c{area="outside",shop="London"}
values: 25.1+0x5 # 25.1 25.1 25.1 25.1 25.1 25.1
alert_rule_test:
- alertname: IceCreamStockLow
eval_time: 5m
exp_alerts: <mark>[]</mark>
</code></pre>
--
## Negative tests
<pre><code data-trim data-noescape class="stretch lang-yaml">
# Doesn't fire for other products
- interval: 1m
input_series:
- series: stock_level{<mark>item="hot-chocolate"</mark>,shop="London"}
values: 4+0x4 # 4 4 4 4 4 4
- series: temperature_c{area="outside",shop="London"}
values: 25.1+0x5 # 25.1 25.1 25.1 25.1 25.1 25.1
alert_rule_test:
- alertname: IceCreamStockLow
eval_time: 5m
exp_alerts: <mark>[]</mark>
</code></pre>
--
## Negative tests
<pre><code data-trim data-noescape class="stretch lang-yaml">
# Doesn't fire if temperature reduces within 5m
- interval: 1m
input_series:
- series: stock_level{item="ice-cream",shop="London"}
values: 4+0x5 # 4 4 4 4 4 4
- series: temperature_c{area="outside",shop="London"}
values: 25.1+0x4 <mark>25</mark> # 25.1 25.1 25.1 25.1 25.1 250
alert_rule_test:
- alertname: IceCreamStockLow
eval_time: 5m
exp_alerts: <mark>[]</mark>
</code></pre>
---
## Testing rate()
`rate()` computes the per-second rate of the metric.
```yaml
- alert: HighDiskIO
expr: sum by (device) (rate(node_disk_reads_completed_total[2m])) > 1000
labels:
severity: minor
annotations:
description: 'High I/O on {{.Labels.device}}: {{.Value | printf "%.0f"}} reads/sec'
```
We're expecting alerts like:
"High I/O on sda: 2400 reads/sec"
Note:
* Use `printf` because with interpolation values could be fractional
--
### Sample interval `1m`
`rate()` is always per-second, so must multiply values by 60
<pre><code data-trim data-noescape class="stretch lang-yaml">
evaluation_interval: 1m
rule_files:
- disk-io.rules.yml
tests:
- interval: <mark>1m</mark>
input_series:
- series: node_disk_reads_completed_total{device="sda"}
values: <mark>0+74040x2</mark> # 1234/sec for a minute == 74040/min
alert_rule_test:
- alertname: HighDiskIO
eval_time: 2m
exp_alerts:
- exp_labels:
severity: minor
device: sda
exp_annotations:
description: 'High I/O on sda: <mark>1234</mark> reads/sec'
--
### Sample interval `1s`
Per-second rates can be easier to work with
<pre><code data-trim data-noescape class="lang-yaml">
- interval: <mark>1s</mark>
input_series:
- series: node_disk_reads_completed_total{device="sda"}
# rising at 1001/sec for almost 2 mins, then at 10/sec for last second
values: 0+1001x119 119129 # +10 to final sample in seq (119119)
- series: node_disk_reads_completed_total{device="sdb"}
# rising at 1005/sec for 4 mins
values: 0+1005x240
- series: node_disk_reads_completed_total{device="sdc"}
# rising at 1000/sec for almost 4m, then 1200/sec for last second
values: 0+1000x239 240200 # +1200 to final value in seq (239000)
alert_rule_test:
- alertname: HighDiskIO
eval_time: 2m
exp_alerts:
- exp_labels:
severity: minor
device: sdb
exp_annotations:
description: 'High I/O on sdb: 1005 reads/sec'
- alertname: HighDiskIO
eval_time: 4m
exp_alerts:
- exp_labels:
severity: minor
device: sdb
exp_annotations:
description: 'High I/O on sdb: 1005 reads/sec'
- exp_labels:
severity: minor
device: sdc
exp_annotations:
# Actual value without rounding is 1001.6666666666666
description: 'High I/O on sdc: 1002 reads/sec'<code></pre>
---
## Testing recording rules
Consider metrics whose values are timestamps
<pre><code data-trim data-noescape class="lang-yaml">
job_completed_at{job="job-one"} 1514321353 # Tue 26 Dec 2017 20:49:13 GMT
job_completed_at{job="job-two"} 1554321353 # Wed 3 Apr 2019 20:55:53 BST
</code></pre>
Record a new time series containing only those jobs that completed in the past two hours
<pre><code data-trim data-noescape class="lang-yaml">
rules:
- record: job_completed_recently
expr: job_completed_at >= (time() - (2*60*60))
</code></pre>
How can we test it when we can't control the current time?
--
Value of `time()` increases from zero throughout the test
<pre><code data-trim data-noescape class="stretch lang-yaml">
evaluation_interval: 1h
rule_files:
- job_completion.rules.yml
tests:
- interval: 1h
input_series:
- series: job_completed_at{job="job-one"}
values: 0+0x3 # t=0h
- series: job_completed_at{job="job-two"}
values: 3600+0x3 # t=1h
promql_expr_test:
- expr: job_completed_recently
eval_time: 2h
exp_samples:
- labels: job_completed_recently{job="job-one"}
value: 0
- labels: job_completed_recently{job="job-two"}
value: 3600
- expr: job_completed_recently
eval_time: 3h
exp_samples:
- labels: job_completed_recently{job="job-two"}
value: 3600
</code></pre>
---
## Test in Docker
If you're not yet using Prometheus >=2.5.0, you can use a more recent version just for testing
```dockerfile
FROM prom/prometheus:v2.9.2
WORKDIR /data
COPY *.rules.yml .
COPY *.rules.test.yml .
RUN promtool test rules *.test.yml
```
Note:
* Will only build successfully if tests pass
--
You can trigger the build from Gradle
```gradle
task testUsingRecentPrometheus(type:Exec) {
commandLine "docker", "build", "-t", "prometheus-test", "."
}
```
---
## Summary
Advantages:
* Repeatable unit tests with precise control
* Very quick to run
* Can test alerts that only fire after an extended time period
But:
* They are only unit tests
* Only as good as the simulated metrics
* Cannot pick up on changes in the _sources_ of metrics
Note:
* e.g. when [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) makes changes to metric names and labels
</textarea>
</section>
</div>
<div class="footer">
<a href="https://howardburgess.github.io/prometheus-unit-testing/">howardburgess.github.io/prometheus-unit-testing</a>
</div>
</div>
<script src="static/reveal.js-3.8.0/js/reveal.js"></script>
<script>
Reveal.initialize({
controls: true,
controlsTutorial: false,
progress: true,
history: true,
center: false,
transition: 'fade', // none/fade/slide/convex/concave/zoom
transitionSpeed: 'fast', // default/fast/slow
width: "100%",
height: "100%",
margin: 0.05,
dependencies: [
{ src: 'static/reveal.js-3.8.0/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: 'static/reveal.js-3.8.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: 'static/reveal.js-3.8.0/plugin/highlight/highlight.js', async: true, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: 'static/reveal.js-3.8.0/plugin/notes/notes.js' }
],
markdown: {
smartypants: true
}
});
</script>
</body>
</html>