Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutting down operator controlled workloads with nut-client #507

Open
iamasmith opened this issue Oct 29, 2024 · 0 comments
Open

Shutting down operator controlled workloads with nut-client #507

iamasmith opened this issue Oct 29, 2024 · 0 comments

Comments

@iamasmith
Copy link

iamasmith commented Oct 29, 2024

Scenario, I have Cloudnative PostgreSQL using a PV on an iSCSI Lun.

CNPG controls the lifecycle of the database pods directly itself (not deployment or statefulset), whilst as an operator it runs as a deployment which is also a target for a validating webhook for checking annotations - also the operator of course fulfills the custom resources.

When using nut-client the host starts to shutdown and appears to dispanse with deployment which under normal circumstances would seem fine, however, the node gets marke NoSchedule (cordoned) and then it appears to try to drain the workloads.

The CNPG operator gets terminated as one would expect of a well behaved deployment leaving the pod not backed by anything or under anything elses control.

An orderly shutdown is normally performed by adding the cnpg.io/hibernation annotation to the CNPG 'cluster' resources (set to on) which tells the operator to perform an orderly shutdowwn of the database and terminate the pods, removing the annotation starts the pods/db again.

It would appear then that one would need some way of running kubectl commands and checking pod statuses prior to issuing the final shutdown command and I don't really see a mechanism for doing this without means of at least calling an endpoint with a customisable shutdown trigger.

It may be possible to do if there was a means of running a script rather than just the shutdown command or waiting on some status from an API endpoint as at least I could create a service/workload that could perform the CNPG declarative hybernation.

Actually, I must admit to being a bit confused about this since a talosctl reboot on this node works well, I suspect there is more orderly shutdown coming from the talos controller when this is done. What I ultimately get on power events is the node set to NoSchedule, most workloads shutdown including the operator deployment, the DB left running because it has no recognised controller and no means of cleanly stopping the PostgreSQL DB since the target for the validating webhook for the annotation is the operator and that fails and of course with the operator stopped the clean shutdown of the workload wouldn't happen anyway. On situations where there is a minor blip but the shutdown signal happens then the cluster is left in this state with my only option to talosctl reboot or untaint the node to get the operator back up and running so I can hybernate the DB before a reboot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant