Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concept for monitoring OpenShift 4 #20

Closed
corvus-ch opened this issue Jul 14, 2020 · 10 comments · Fixed by #45
Closed

Concept for monitoring OpenShift 4 #20

corvus-ch opened this issue Jul 14, 2020 · 10 comments · Fixed by #45
Assignees
Labels
enhancement New feature or request RFC Request for comments

Comments

@corvus-ch
Copy link
Contributor

corvus-ch commented Jul 14, 2020

OpenShift 4 includes a cluster monitoring based on Prometheus. This ticket aims to answer the question: how do we make use of this.

Motivation

The documentation about Configuring the monitoring stack lists quite a lot of things that can not be configured. This includes:

  • Add additional ServiceMonitors
  • Creating unexpected ConfigMap objects or PrometheusRule objects.
  • Directly edit the resources and custom resources of the monitoring stack
  • Using resources of the stack for your purposes
  • Stopping the Cluster Monitoring Operator from reconciling the monitoring stack
  • Adding new and edit existing alert rules
  • Modifying Grafana

We know from experience with OpenShift 3.11: some tweaking will required at some point. This includes adding ServiceMonitors for things not (yet) covered by Cluster Monitoring, adding new rules to cover additional failure scenarios and altering rules that are noisy and or not actionable.

Goals

Enable us to:

  • monitor things not covered by Cluster Monitoring
  • tweak existing alert rules in case they do not provide any value to us and or are noisy without being actionable

Non-Goals

Answer the question where alerts are being sent to and thus how the are being acted upon.

Design Proposal

Based on all those restrictions, one could conclude to omit the Cluster Monitoring and do it on your own. This would give full control over everything. But the Cluster Monitoring is a fundamental part of an OpenShift 4 setup and will always be present. It is required for certain things to work properly. The result of doing all again, would be a huge waste of resources both in terms of management/engineering as well as compute and storage resources.

For that reason we will make use of Cluster Monitoring as much as possible. We will operate a second pair of Prometheus instances in parallel to the Cluster Monitoring ones. Yet that second pair of instances only take care of the things we can not do with the Cluster Monitoring.

Those additional Prometheus instance will get the needed metrics from Cluster Monitoring. Targets are only scraped directly, when cluster Monitoring not already is doing so. Alerts will be sent to the Alertmanager instances of the Cluster Monitoring.

User Stories

Noisy and or non actionable alert rule

The Configuring the monitoring stack documentation explicitly prohibits changing of the existing alert rules. From our experience with OpenShift 3.11, we had cases where we were in need to do do so. The reasons being a rule that just produced noise, was not actionable and or did not cover for some edge cases.

The OpenShift 4 monitoring is based on kube-prometheus. We have also experience with this as we are using it for non OpenShift Kubernetes clusters. We also already had to tweak some of those rules. See CPUThrottlingHigh false positives for an example.

For those cases, we can make use of Alertmanager. With the routing configuration we route those troublesome alerts to the void. The second set of Prometheus will then evaluate a replacement alert rule.

Service not monitored

The Configuring the monitoring stack documentation explicitly prohibits creation of additional ServiceMonitors with the cluster monitoring. Instead we will use our second set of Prometheus to scrape metrics from those services. Rules based on those metrics will also be evaluated there.

Failure scenario not covered by existing alert rules

The Configuring the monitoring stack explicitly prohibits creation of additional alert rules.

Additional alert rules will be configured and evaluated on our second set of Prometheus instances. The metrics will come from Cluster Monitoring and or from targets directly scraped.

Custom Dashboards

The Configuring the monitoring stack documentation explicitly prohibits changing of the Grafana instance. In order to have custom dashboards, we can operate our own Grafana instance which uses our second set of Prometheus as its data source.

Implementation Details/Notes/Constraints

Our own pair of Prometheus instances will use remote read to query metrics from the Cluster Monitoring. This does not create additional replicas of the metrics. No additional storage is needed except for the additional targets scraped. The remote read is also efficient on memory usage (see Remote Read Meets Streaming).

Risks and Mitigations

This setup mitigates all the configuration restrictions of Cluster Monitoring. It also does so with non or only a minimal resource overhead.

The remote read is a source of failure usually not present and has to be accounted for.

  • Cluster Monitoring operates two Prometheus instances configured equally
  • Thanos Querier load balances queries to those Prometheus instances
  • Two additional Prometheus instances operated for scraping additional metrics and evaluation of rules
  • Alert Rules must be engineered to discover if the remote read has issues

See Remote Read Meets Streaming for an in depth discussion on the subject.

Drawbacks

This setup up is specific to OpenShift 4. It can not—or not without major change—be applied to non OpenShift 4 setups.

Alternatives

Remote write

The OpenShift 4 documentation does not mention it but in the source we see that remote write targets can be configured. Prometheus itself does not provide a receive endpoint instead Thanos Receiver could be used.

Thanos Reciever writes the received metrics in the same format as Prometheus does. It is possible to point a Prometheus instance to the same data directory and thus "import" it to Prometheus. While this works technically, this is probably not save for production. Instead, remote read or Thanos Ruler must be used.

So, this is less an alternative but a complement to achieve long term storage.

Cluster Monitoring will be configured to write metrics into a Thanos Receiver. The receiver then stores those metrics into S3. With Thanos Querier, those metrics again, will then be made available to Prometheus using remote read and also to Grafana.

Federation

Federation allows a Prometheus server to scrape selected time series from another Prometheus server. The key word here is selected.

It is possible to use the federation endpoint to scrape all metrics. This has several downsides.

  • Both the federation and the federated instance need a substantial amount of memory. This is due to the fact that all metrics need to be loaded into memory for marshalling and unmarshalling to and from the transport format. This can be mitigated by splitting up the federation into smaller chunks.
  • All scraped metrics, need to be stored yet again which probably results in additional costs for the required disk space.
  • Federation requires planning in advance. Metrics are not available if they have not been scraped.

Federation is meant to built aggregated view in a hierarchical architecture. It is not built to bring most if not all metrics from one Prometheus instance to another.

References

@corvus-ch corvus-ch self-assigned this Jul 14, 2020
@corvus-ch corvus-ch added the enhancement New feature or request label Jul 14, 2020
@corvus-ch
Copy link
Contributor Author

RFC: @srueg @tobru @bliemli @madchr1st

@srueg srueg added the RFC Request for comments label Jul 14, 2020
@tobru
Copy link
Member

tobru commented Jul 14, 2020

Generally I really like it and I think it's pretty much what we want!

This setup up is specific to OpenShift 4. It can not—or not without major change—be applied to non OpenShift 4 setups.

Why?

Some other questions:

  • How would the second Prometheus pair be installed and configured? I guess with the Prometheus Operator, but can we use the already existing one? Or do we have to bring our own Prometheus Operator instance?
  • How do we make sure that our own ServiceMonitor (and other) objects are not used by the original Prometheus Operator? I guess it's already configured to only watch selected namespaces?
  • How would we configure the original Prometheus in terms of resources and retention policy? Does this even have to be specified?

@bliemli
Copy link
Contributor

bliemli commented Jul 14, 2020

* How would the second Prometheus pair be installed and configured? I guess with the Prometheus Operator, but can we use the already existing one? Or do we have to bring our own Prometheus Operator instance?

* How do we make sure that our own `ServiceMonitor` (and other) objects are not used by the original Prometheus Operator? I guess it's already configured to only watch selected namespaces?

* How would we configure the original Prometheus in terms of resources and retention policy? Does this even have to be specified?

Regarding these questions also see https://docs.openshift.com/container-platform/4.5/release_notes/ocp-4-5-release-notes.html#ocp-4-5-monitor-your-own-services-tp.

@corvus-ch
Copy link
Contributor Author

Regarding these questions also see https://docs.openshift.com/container-platform/4.5/release_notes/ocp-4-5-release-notes.html#ocp-4-5-monitor-your-own-services-tp.

This is a technology preview and do not use them.

@corvus-ch
Copy link
Contributor Author

* How would the second Prometheus pair be installed and configured? I guess with the Prometheus Operator, but can we use the already existing one? Or do we have to bring our own Prometheus Operator instance?

The Prometheus Operator of the Cluster Monitoring only takes care of its own. We would have to bring our own Operator. Our own operator would then again only take care of the namespace we place our Prometheus in. This is for the same reasons why the Cluster Monitoring operator is limited to only one namespace.

We can do so by using OLM or bring in the operator by other means (e.g. kube-prometheus).

@corvus-ch
Copy link
Contributor Author

* How do we make sure that our own `ServiceMonitor` (and other) objects are not used by the original Prometheus Operator? I guess it's already configured to only watch selected namespaces?

Yes, Cluster Monitoring only watches a limited set of namespaces. I need to check which ones. We would need to do the same. Preferably we do this by labelling namespaces (need to check if that is possible).

@bliemli
Copy link
Contributor

bliemli commented Jul 14, 2020

IMHO it is possible to configure which namespaces a Prometheus Operator watches for ServiceMonitor resources using the --namespaces parameter. But @madchr1st knows way more about that than I do.

@corvus-ch
Copy link
Contributor Author

* How would we configure the original Prometheus in terms of resources and retention policy? Does this even have to be specified?

We will for sure make confiugre persistent storage (default is empty dir). Disk size and memory requests/limits needs to be defined on a per cluster level based on actual usage.

Default retention time is 10 days. We might want to change that but I do not thinks so.

@corvus-ch
Copy link
Contributor Author

IMHO it is possible to configure which namespaces a Prometheus Operator watches for ServiceMonitor resources using the --namespaces parameter. But @madchr1st knows way more about that than I do.

That for sure will work. If possible, I would prefer a label based approach.

@bliemli
Copy link
Contributor

bliemli commented Jul 14, 2020

IMHO it is possible to configure which namespaces a Prometheus Operator watches for ServiceMonitor resources using the --namespaces parameter. But @madchr1st knows way more about that than I do.

That for sure will work. If possible, I would prefer a label based approach.

A specific label will have to be applied on the ServiceMonitor and PrometheusRule resources.

srueg added a commit that referenced this issue Oct 5, 2020
srueg added a commit that referenced this issue Oct 5, 2020
srueg added a commit that referenced this issue Oct 5, 2020
@srueg srueg closed this as completed in #45 Oct 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request RFC Request for comments
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants