Concept for monitoring OpenShift 4 #20

corvus-ch · 2020-07-14T07:45:41Z

OpenShift 4 includes a cluster monitoring based on Prometheus. This ticket aims to answer the question: how do we make use of this.

Motivation

The documentation about Configuring the monitoring stack lists quite a lot of things that can not be configured. This includes:

Add additional ServiceMonitors
Creating unexpected ConfigMap objects or PrometheusRule objects.
Directly edit the resources and custom resources of the monitoring stack
Using resources of the stack for your purposes
Stopping the Cluster Monitoring Operator from reconciling the monitoring stack
Adding new and edit existing alert rules
Modifying Grafana

We know from experience with OpenShift 3.11: some tweaking will required at some point. This includes adding ServiceMonitors for things not (yet) covered by Cluster Monitoring, adding new rules to cover additional failure scenarios and altering rules that are noisy and or not actionable.

Goals

Enable us to:

monitor things not covered by Cluster Monitoring
tweak existing alert rules in case they do not provide any value to us and or are noisy without being actionable

Non-Goals

Answer the question where alerts are being sent to and thus how the are being acted upon.

Design Proposal

Based on all those restrictions, one could conclude to omit the Cluster Monitoring and do it on your own. This would give full control over everything. But the Cluster Monitoring is a fundamental part of an OpenShift 4 setup and will always be present. It is required for certain things to work properly. The result of doing all again, would be a huge waste of resources both in terms of management/engineering as well as compute and storage resources.

For that reason we will make use of Cluster Monitoring as much as possible. We will operate a second pair of Prometheus instances in parallel to the Cluster Monitoring ones. Yet that second pair of instances only take care of the things we can not do with the Cluster Monitoring.

Those additional Prometheus instance will get the needed metrics from Cluster Monitoring. Targets are only scraped directly, when cluster Monitoring not already is doing so. Alerts will be sent to the Alertmanager instances of the Cluster Monitoring.

User Stories

Noisy and or non actionable alert rule

The Configuring the monitoring stack documentation explicitly prohibits changing of the existing alert rules. From our experience with OpenShift 3.11, we had cases where we were in need to do do so. The reasons being a rule that just produced noise, was not actionable and or did not cover for some edge cases.

The OpenShift 4 monitoring is based on kube-prometheus. We have also experience with this as we are using it for non OpenShift Kubernetes clusters. We also already had to tweak some of those rules. See CPUThrottlingHigh false positives for an example.

For those cases, we can make use of Alertmanager. With the routing configuration we route those troublesome alerts to the void. The second set of Prometheus will then evaluate a replacement alert rule.

Service not monitored

The Configuring the monitoring stack documentation explicitly prohibits creation of additional ServiceMonitors with the cluster monitoring. Instead we will use our second set of Prometheus to scrape metrics from those services. Rules based on those metrics will also be evaluated there.

Failure scenario not covered by existing alert rules

The Configuring the monitoring stack explicitly prohibits creation of additional alert rules.

Additional alert rules will be configured and evaluated on our second set of Prometheus instances. The metrics will come from Cluster Monitoring and or from targets directly scraped.

Custom Dashboards

The Configuring the monitoring stack documentation explicitly prohibits changing of the Grafana instance. In order to have custom dashboards, we can operate our own Grafana instance which uses our second set of Prometheus as its data source.

Implementation Details/Notes/Constraints

Our own pair of Prometheus instances will use remote read to query metrics from the Cluster Monitoring. This does not create additional replicas of the metrics. No additional storage is needed except for the additional targets scraped. The remote read is also efficient on memory usage (see Remote Read Meets Streaming).

Risks and Mitigations

This setup mitigates all the configuration restrictions of Cluster Monitoring. It also does so with non or only a minimal resource overhead.

The remote read is a source of failure usually not present and has to be accounted for.

Cluster Monitoring operates two Prometheus instances configured equally
Thanos Querier load balances queries to those Prometheus instances
Two additional Prometheus instances operated for scraping additional metrics and evaluation of rules
Alert Rules must be engineered to discover if the remote read has issues

See Remote Read Meets Streaming for an in depth discussion on the subject.

Drawbacks

This setup up is specific to OpenShift 4. It can not—or not without major change—be applied to non OpenShift 4 setups.

Alternatives

Remote write

The OpenShift 4 documentation does not mention it but in the source we see that remote write targets can be configured. Prometheus itself does not provide a receive endpoint instead Thanos Receiver could be used.

Thanos Reciever writes the received metrics in the same format as Prometheus does. It is possible to point a Prometheus instance to the same data directory and thus "import" it to Prometheus. While this works technically, this is probably not save for production. Instead, remote read or Thanos Ruler must be used.

So, this is less an alternative but a complement to achieve long term storage.

Cluster Monitoring will be configured to write metrics into a Thanos Receiver. The receiver then stores those metrics into S3. With Thanos Querier, those metrics again, will then be made available to Prometheus using remote read and also to Grafana.

Federation

Federation allows a Prometheus server to scrape selected time series from another Prometheus server. The key word here is selected.

It is possible to use the federation endpoint to scrape all metrics. This has several downsides.

Both the federation and the federated instance need a substantial amount of memory. This is due to the fact that all metrics need to be loaded into memory for marshalling and unmarshalling to and from the transport format. This can be mitigated by splitting up the federation into smaller chunks.
All scraped metrics, need to be stored yet again which probably results in additional costs for the required disk space.
Federation requires planning in advance. Metrics are not available if they have not been scraped.

Federation is meant to built aggregated view in a hierarchical architecture. It is not built to bring most if not all metrics from one Prometheus instance to another.

References

corvus-ch · 2020-07-14T09:54:51Z

RFC: @srueg @tobru @bliemli @madchr1st

tobru · 2020-07-14T14:23:14Z

Generally I really like it and I think it's pretty much what we want!

This setup up is specific to OpenShift 4. It can not—or not without major change—be applied to non OpenShift 4 setups.

Why?

Some other questions:

How would the second Prometheus pair be installed and configured? I guess with the Prometheus Operator, but can we use the already existing one? Or do we have to bring our own Prometheus Operator instance?
How do we make sure that our own ServiceMonitor (and other) objects are not used by the original Prometheus Operator? I guess it's already configured to only watch selected namespaces?
How would we configure the original Prometheus in terms of resources and retention policy? Does this even have to be specified?

bliemli · 2020-07-14T14:33:12Z

* How would the second Prometheus pair be installed and configured? I guess with the Prometheus Operator, but can we use the already existing one? Or do we have to bring our own Prometheus Operator instance?

* How do we make sure that our own `ServiceMonitor` (and other) objects are not used by the original Prometheus Operator? I guess it's already configured to only watch selected namespaces?

* How would we configure the original Prometheus in terms of resources and retention policy? Does this even have to be specified?

Regarding these questions also see https://docs.openshift.com/container-platform/4.5/release_notes/ocp-4-5-release-notes.html#ocp-4-5-monitor-your-own-services-tp.

corvus-ch · 2020-07-14T14:37:01Z

Regarding these questions also see https://docs.openshift.com/container-platform/4.5/release_notes/ocp-4-5-release-notes.html#ocp-4-5-monitor-your-own-services-tp.

This is a technology preview and do not use them.

corvus-ch · 2020-07-14T14:43:03Z

* How would the second Prometheus pair be installed and configured? I guess with the Prometheus Operator, but can we use the already existing one? Or do we have to bring our own Prometheus Operator instance?

The Prometheus Operator of the Cluster Monitoring only takes care of its own. We would have to bring our own Operator. Our own operator would then again only take care of the namespace we place our Prometheus in. This is for the same reasons why the Cluster Monitoring operator is limited to only one namespace.

We can do so by using OLM or bring in the operator by other means (e.g. kube-prometheus).

corvus-ch · 2020-07-14T14:45:38Z

* How do we make sure that our own `ServiceMonitor` (and other) objects are not used by the original Prometheus Operator? I guess it's already configured to only watch selected namespaces?

Yes, Cluster Monitoring only watches a limited set of namespaces. I need to check which ones. We would need to do the same. Preferably we do this by labelling namespaces (need to check if that is possible).

bliemli · 2020-07-14T14:45:52Z

IMHO it is possible to configure which namespaces a Prometheus Operator watches for ServiceMonitor resources using the --namespaces parameter. But @madchr1st knows way more about that than I do.

corvus-ch · 2020-07-14T14:47:49Z

* How would we configure the original Prometheus in terms of resources and retention policy? Does this even have to be specified?

We will for sure make confiugre persistent storage (default is empty dir). Disk size and memory requests/limits needs to be defined on a per cluster level based on actual usage.

Default retention time is 10 days. We might want to change that but I do not thinks so.

corvus-ch · 2020-07-14T14:48:32Z

IMHO it is possible to configure which namespaces a Prometheus Operator watches for ServiceMonitor resources using the --namespaces parameter. But @madchr1st knows way more about that than I do.

That for sure will work. If possible, I would prefer a label based approach.

bliemli · 2020-07-14T14:50:10Z

IMHO it is possible to configure which namespaces a Prometheus Operator watches for ServiceMonitor resources using the --namespaces parameter. But @madchr1st knows way more about that than I do.

That for sure will work. If possible, I would prefer a label based approach.

A specific label will have to be applied on the ServiceMonitor and PrometheusRule resources.

Closes #20

corvus-ch self-assigned this Jul 14, 2020

corvus-ch added the enhancement New feature or request label Jul 14, 2020

srueg added the RFC Request for comments label Jul 14, 2020

srueg added a commit that referenced this issue Oct 5, 2020

Write explanation for cluster monitoring

fe57c7a

Closes #20

srueg mentioned this issue Oct 5, 2020

Write Explanation for Cluster Monitoring #45

Merged

srueg added a commit that referenced this issue Oct 5, 2020

Write explanation for cluster monitoring

490b7a0

Closes #20

srueg added a commit that referenced this issue Oct 5, 2020

Write explanation for cluster monitoring

57a4ea8

Closes #20

srueg closed this as completed in #45 Oct 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concept for monitoring OpenShift 4 #20

Concept for monitoring OpenShift 4 #20

corvus-ch commented Jul 14, 2020 •

edited by tobru

Loading

corvus-ch commented Jul 14, 2020

tobru commented Jul 14, 2020

bliemli commented Jul 14, 2020

corvus-ch commented Jul 14, 2020

corvus-ch commented Jul 14, 2020

corvus-ch commented Jul 14, 2020

bliemli commented Jul 14, 2020

corvus-ch commented Jul 14, 2020

corvus-ch commented Jul 14, 2020

bliemli commented Jul 14, 2020

Concept for monitoring OpenShift 4 #20

Concept for monitoring OpenShift 4 #20

Comments

corvus-ch commented Jul 14, 2020 • edited by tobru Loading

Motivation

Goals

Non-Goals

Design Proposal

User Stories

Noisy and or non actionable alert rule

Service not monitored

Failure scenario not covered by existing alert rules

Custom Dashboards

Implementation Details/Notes/Constraints

Risks and Mitigations

Drawbacks

Alternatives

Remote write

Federation

References

corvus-ch commented Jul 14, 2020

tobru commented Jul 14, 2020

bliemli commented Jul 14, 2020

corvus-ch commented Jul 14, 2020

corvus-ch commented Jul 14, 2020

corvus-ch commented Jul 14, 2020

bliemli commented Jul 14, 2020

corvus-ch commented Jul 14, 2020

corvus-ch commented Jul 14, 2020

bliemli commented Jul 14, 2020

corvus-ch commented Jul 14, 2020 •

edited by tobru

Loading