-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flannel process gets stuck after etcd outage. #1830
Comments
can you share the systemd file you use to deploy flannel? |
sure @thomasferrandiz, here it is.
|
Thanks for the file. Do you have a specific use case that requires using flannel with k8s but without setting If not, you could deploy flannel as a pod through the manifest kube-flannel.yml which will avoid the issue since flannel won't use etcd. |
Expected Behavior
After etcd recovers, flannel should continue working normally.
Current Behavior
When etcd goes down, eg. reboot all etcd nodes at the same time, flannel process starts to use 200% cpu and stops updating routing table. And it never recovers after etcd comes back online until flannel is restarted.
I've noticed that the issue started with v0.21+. When I tested with v0.20, flannel just crashed when etcd went down and systemd restarted it until etcd was online and it just continued working. But with v0.21+, it just sits there using 200% cpu. I'm assuming whatever changed in the etcd connection logic gets stuck into an infinite loop somewhere. There are no logs shown even with v=10.
Possible Solution
I tried to browse the changes but there is just too many, I can't tell what went wrong and where.
Steps to Reproduce (for bugs)
Context
If you add new nodes while flannel is stuck, routing tables doesn't get updated and the pods scheduled on new nodes cannot communicate with the rest of the cluster. Also all nodes gets a flat 2 cpu threads usage.
Your Environment
I am using charmed k8s from canonical. I have not tested other distributions but there is no reason it should not happen in others as well.
The text was updated successfully, but these errors were encountered: