operator: pause reboots when active alerts are detected #158

lucab · 2017-11-10T15:49:40Z

Currently update-operator reboots nodes as soon as updates are available. #82 tracks adding support for a user-configured maintenance window. On top of that, even inside a maintenance window there could be situations where reboots should be temporarily paused (e.g. when some critical/unplanned outage is happening).

This can be currently done by setting a reboot-paused annotation on specific nodes, however this is a manual operation and doesn't scale well cluster-wide.

It would be nice to let CLUO know about any existing AlertManager in the cluster and check for specific active alerts before proceeding. @brancz suggested that we could:

take a ConfigMap with critical alerts that should cluster-wide pause reboots (and inotify-watch to hot-reload it)
reach the AM on its in-cluster public read-only endpoint and check for non-silenced critical alerts before setting reboot-ok

For clarity, this should be completely orthogonal to maintenance window configuration.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator: pause reboots when active alerts are detected #158

operator: pause reboots when active alerts are detected #158

lucab commented Nov 10, 2017

operator: pause reboots when active alerts are detected #158

operator: pause reboots when active alerts are detected #158

Comments

lucab commented Nov 10, 2017