Skip to content
This repository has been archived by the owner on Sep 18, 2020. It is now read-only.

operator: pause reboots when active alerts are detected #158

Open
lucab opened this issue Nov 10, 2017 · 0 comments
Open

operator: pause reboots when active alerts are detected #158

lucab opened this issue Nov 10, 2017 · 0 comments

Comments

@lucab
Copy link
Contributor

lucab commented Nov 10, 2017

Currently update-operator reboots nodes as soon as updates are available. #82 tracks adding support for a user-configured maintenance window. On top of that, even inside a maintenance window there could be situations where reboots should be temporarily paused (e.g. when some critical/unplanned outage is happening).

This can be currently done by setting a reboot-paused annotation on specific nodes, however this is a manual operation and doesn't scale well cluster-wide.

It would be nice to let CLUO know about any existing AlertManager in the cluster and check for specific active alerts before proceeding. @brancz suggested that we could:

  • take a ConfigMap with critical alerts that should cluster-wide pause reboots (and inotify-watch to hot-reload it)
  • reach the AM on its in-cluster public read-only endpoint and check for non-silenced critical alerts before setting reboot-ok

For clarity, this should be completely orthogonal to maintenance window configuration.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant