Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flannel process gets stuck after etcd outage. #1830

Open
alperbas opened this issue Nov 24, 2023 · 3 comments
Open

Flannel process gets stuck after etcd outage. #1830

alperbas opened this issue Nov 24, 2023 · 3 comments

Comments

@alperbas
Copy link

Expected Behavior

After etcd recovers, flannel should continue working normally.

Current Behavior

When etcd goes down, eg. reboot all etcd nodes at the same time, flannel process starts to use 200% cpu and stops updating routing table. And it never recovers after etcd comes back online until flannel is restarted.

I've noticed that the issue started with v0.21+. When I tested with v0.20, flannel just crashed when etcd went down and systemd restarted it until etcd was online and it just continued working. But with v0.21+, it just sits there using 200% cpu. I'm assuming whatever changed in the etcd connection logic gets stuck into an infinite loop somewhere. There are no logs shown even with v=10.

Possible Solution

I tried to browse the changes but there is just too many, I can't tell what went wrong and where.

Steps to Reproduce (for bugs)

  1. Setup a k8s cluster with separate etcd nodes and flannel working on the hosts in systemd. (I'm not sure if it would change things if flannel works as a pod).
  2. Take etcd down for a short time. Either block connections from firewall or just reboot at least 2 nodes at the same time.
  3. Observe flannel CPU usage. Wait for etcd to recover. Flannel still stuck on all nodes.
  4. Add a new worker node to cluster and observe that routing tables are not updated on nodes which flannel is stuck.

Context

If you add new nodes while flannel is stuck, routing tables doesn't get updated and the pods scheduled on new nodes cannot communicate with the rest of the cluster. Also all nodes gets a flat 2 cpu threads usage.

Your Environment

I am using charmed k8s from canonical. I have not tested other distributions but there is no reason it should not happen in others as well.

  • Flannel version: 0.22.1 and also tested 0.23.0
  • Backend used (e.g. vxlan or udp): vxlan
  • Etcd version: 3.4.22
  • Kubernetes version (if used): 1.28
  • Operating System and version: Ubuntu 22.04
@thomasferrandiz
Copy link
Contributor

can you share the systemd file you use to deploy flannel?
In particular, do you set --kube-subnet-mgr to true or false?

@alperbas
Copy link
Author

sure @thomasferrandiz, here it is. --kube-subnet-mgr is not set at all but I assume it is false with this unit file.

[Unit]
Description=Flannel Overlay Network
Documentation=https://github.com/coreos/flannel
Wants=network-online.target
After=network.target network-online.target

[Service]
ExecStart=/usr/local/bin/flanneld -iface=ethX -etcd-endpoints=https://x.x.x.x:2379,https://x.x.x.x:2379,https://x.x.x.x:2379 -etcd-certfile=/xx/client-cert.pem -etcd-keyfile=/xx/client-key.pem  -etcd-cafile=/xx/client-ca.pem --ip-masq
TimeoutStartSec=0
Restart=on-failure
LimitNOFILE=655536

@thomasferrandiz
Copy link
Contributor

Thanks for the file.
Indeed the default value for --kube-subnet-mgr is false which makes flannel use etcd to store its configuration.

Do you have a specific use case that requires using flannel with k8s but without setting kube-subnet-mgr to true?

If not, you could deploy flannel as a pod through the manifest kube-flannel.yml which will avoid the issue since flannel won't use etcd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants