Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deleteOrphanPvc might remove pvc of pending container #305

Open
applike-ss opened this issue Sep 21, 2022 · 6 comments
Open

deleteOrphanPvc might remove pvc of pending container #305

applike-ss opened this issue Sep 21, 2022 · 6 comments

Comments

@applike-ss
Copy link

We are observing an issue daily where a historical node can not come up again because the pvc got removed.
deleteOrphanPvc is enabled and the druid cluster is running on spot instances which can be taken from us at any point in time by the cloud provider.
I have seen that there was an issue in the past for deleteOrphanPvc with a race condition (#150) and we might be running into a similar problem.

My assumption right now is that this happens:

  • Node gets SchedulingDisabled condition (due to spot termination notice)
  • Pod gets terminated
  • New Pod get created, but due to not enough overhead of resources needs to wait until a new instance is spun up (Pending state)
  • Operator sees that the pod was running, but now terminated and removed the pvc

At least i could create the exact same scenario manually in our cluster when spawning a statefulset with bigger resources than we have headroom and then while the new node gets spawned removing the pvc already.

In addition to applying a similar fix as for #150 here, i would suggest to also kill the pod if it is in a pending state for longer than a configurable timespan (5 minute default maybe?).

@AdheipSingh
Copy link
Contributor

Thanks for checking on this.
Few things:

  1. operator will not delete pvc if a pod is in pending state.
  2. operator will not delete pvc if all the sts in the CR are not in running state.ie replicas should be available.

Also in running on spot we did see at sometimes this happen, and adding #150 fix for the reason.

Can you list the exact steps to re-create the scenario. Post fixing #150 we dint see the issue arise.
Ill have a look if i can re-create the issue

@applike-ss
Copy link
Author

That is interesting, i was about to ask whether it could be a race condition when the pod is being terminated and then for milliseconds doesn't exist in any state (between termination and pending state).

Maybe i should've mentioned that we are running the operator in version 0.0.9 which i see seems to be the latest release. Also that release was made 2 months after the fix for #150 was merged, so i do assume it is in the binary as well.

Can you list the exact steps to re-create the scenario.

If you mean druid-wise, i can not. If you mean in general the steps are like this:

  • have an kubernetes cluster with some node autoscaling mechanism
  • create a statefulset with a volume claim template more resources that you currently have spare on the node with the most free resources so that autoscaling has to kick in
  • remove the pvc while the pod is still in pending state
  • statefulset gets deleted and pod stays in pending state forever as the pvc does not get recreated (which i assume is the correct behaviour from the kubernetes side)

The issue does not happen for us when scaling up or down druid pods to/from a higher number (like mentioned in #150), but instead every now and then when a spot termination happens.

Btw. i can see in our logs that something removed the pvc and i assure you that neither of us with access to the cluster did it at that time. See this screenshot (ignore that the entries are out of order):
image

We do in total have 5 replicas for this statefulset, so it can not simply be that last terminated pod or something.

@AdheipSingh
Copy link
Contributor

@applike-ss how frequent you see this issue ? on every spot node killed did you face this ?

@applike-ss
Copy link
Author

We did see it once a day starting a few days before i reported the issue.
We do not see it on every spot node that gets killed.
Over the last 4 days the issue wasn't visible at all.

@AdheipSingh
Copy link
Contributor

hmmm interesting, i am not sure what extra conditional can be done to check this race. This feature was tested and is running on large spot infra running druid.
If you have any suggestions/improvements feel free to point out.

@applike-ss
Copy link
Author

In fact right now it happened again.

I would like to suggest a "kill pod" feature if it stays in pending state longer than a configurable amount of time (default 5m?).

In addition to that, maybe we could have a delayed pvc removal (1m default?) which prior removal checks again if the pvc is bound and aborts a pending deletion then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants