-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]: Leadership election faulty when network timeout issues present #784
Comments
@exextatic hey, thank you for working on this issue earlier. Could you please have a look if you still see the same behaviour and opine? 🙏 |
I wasn't able to reproduce this in my environment, do you have any logs from the operators and info as to which instance the k8s lease thinks is the leader after step 5? |
Okay I have a new hypothesis around this, but will need someone to confirm if this makes sense. I ran the same test again, added the deny network policy and filtering only to show the resources watcher logs, the logs showed this pod 1
pod 2
I have about 8 custom resources, and 2 of them had their corresponding "EntityRequeueBackgroundService" return a timeout/operation cancelled error. after I deleted the network policy to allow normal operations, the 6 custom resources on pod 1 continued to be processed, BUT for the 2 custom resources that have had EntityRequeueBackgroundService operation canceled errors - they are no longer being requeued. Soon after, pod 2 became the leader, and started processing resources alongside pod 1. However for custom resources on pod 2 that also have EntityRequeueBackgroundService task cancelled exceptions, those resources also have not started being processed. My hypothesis now is that the reason why the resources continued processing on both pods was because while the instance stopped leading, the reconciliation async task was still running, and when it finished, it triggered another process via a requeue. This caused the instance to process resources indefinitely even if the instance itself was not leading. I wonder if we need some way to prevent requeues if the instance is not leading? @exextatic @buehler do you have any thoughts on this? EDIT: Also, another thing to add, the 2 resources that have had the EntityRequeueBackgroundService operation canceled exceptions triggered, have not been picked up by the operator again even after the instance started leading again. So those stopped being processed until deleted and recreated again, or the pod has been restarted :/ |
Holy cow. This could actually be the issue. I do not recall implementing such a check for requeue. As such, this could be the error that the requeued entities are process regardless if the current pod is leading or not. |
That would make sense. Any suggestions on how to approach the fix for it? |
hmmm yes, actually when quitting the watcher (since the connection dropped or the controller went out of scope) we could just clear the queue |
Describe the bug
Still related to the recently closed
#677
it is possible to get the operator in a state where leader election with high availability does not behave as expected
To reproduce
at this point you will see that either both pods are acting as leaders, and both are processing resources, or neither pods doing any work until a restart of either pod happens...
Expected behavior
When the deny network policy is applied for 15 minutes+ and then removed, only one pod should continue processing while other pod should be idle
also if the process exited with an error after a number of retries have been unsuccessful, that would be okay too, but this is up for a wider discussion
Screenshots
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: