Shutting down operator controlled workloads with nut-client #507

iamasmith · 2024-10-29T16:12:40Z

Scenario, I have Cloudnative PostgreSQL using a PV on an iSCSI Lun.

CNPG controls the lifecycle of the database pods directly itself (not deployment or statefulset), whilst as an operator it runs as a deployment which is also a target for a validating webhook for checking annotations - also the operator of course fulfills the custom resources.

When using nut-client the host starts to shutdown and appears to dispanse with deployment which under normal circumstances would seem fine, however, the node gets marke NoSchedule (cordoned) and then it appears to try to drain the workloads.

The CNPG operator gets terminated as one would expect of a well behaved deployment leaving the pod not backed by anything or under anything elses control.

An orderly shutdown is normally performed by adding the cnpg.io/hibernation annotation to the CNPG 'cluster' resources (set to on) which tells the operator to perform an orderly shutdowwn of the database and terminate the pods, removing the annotation starts the pods/db again.

It would appear then that one would need some way of running kubectl commands and checking pod statuses prior to issuing the final shutdown command and I don't really see a mechanism for doing this without means of at least calling an endpoint with a customisable shutdown trigger.

It may be possible to do if there was a means of running a script rather than just the shutdown command or waiting on some status from an API endpoint as at least I could create a service/workload that could perform the CNPG declarative hybernation.

Actually, I must admit to being a bit confused about this since a talosctl reboot on this node works well, I suspect there is more orderly shutdown coming from the talos controller when this is done. What I ultimately get on power events is the node set to NoSchedule, most workloads shutdown including the operator deployment, the DB left running because it has no recognised controller and no means of cleanly stopping the PostgreSQL DB since the target for the validating webhook for the annotation is the operator and that fails and of course with the operator stopped the clean shutdown of the workload wouldn't happen anyway. On situations where there is a minor blip but the shutdown signal happens then the cluster is left in this state with my only option to talosctl reboot or untaint the node to get the operator back up and running so I can hybernate the DB before a reboot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutting down operator controlled workloads with nut-client #507

Shutting down operator controlled workloads with nut-client #507

iamasmith commented Oct 29, 2024 •

edited

Loading

Shutting down operator controlled workloads with nut-client #507

Shutting down operator controlled workloads with nut-client #507

Comments

iamasmith commented Oct 29, 2024 • edited Loading

iamasmith commented Oct 29, 2024 •

edited

Loading