Flannel process gets stuck after etcd outage. #1830

alperbas · 2023-11-24T14:18:00Z

Expected Behavior

After etcd recovers, flannel should continue working normally.

Current Behavior

When etcd goes down, eg. reboot all etcd nodes at the same time, flannel process starts to use 200% cpu and stops updating routing table. And it never recovers after etcd comes back online until flannel is restarted.

I've noticed that the issue started with v0.21+. When I tested with v0.20, flannel just crashed when etcd went down and systemd restarted it until etcd was online and it just continued working. But with v0.21+, it just sits there using 200% cpu. I'm assuming whatever changed in the etcd connection logic gets stuck into an infinite loop somewhere. There are no logs shown even with v=10.

Possible Solution

I tried to browse the changes but there is just too many, I can't tell what went wrong and where.

Steps to Reproduce (for bugs)

Setup a k8s cluster with separate etcd nodes and flannel working on the hosts in systemd. (I'm not sure if it would change things if flannel works as a pod).
Take etcd down for a short time. Either block connections from firewall or just reboot at least 2 nodes at the same time.
Observe flannel CPU usage. Wait for etcd to recover. Flannel still stuck on all nodes.
Add a new worker node to cluster and observe that routing tables are not updated on nodes which flannel is stuck.

Context

If you add new nodes while flannel is stuck, routing tables doesn't get updated and the pods scheduled on new nodes cannot communicate with the rest of the cluster. Also all nodes gets a flat 2 cpu threads usage.

Your Environment

I am using charmed k8s from canonical. I have not tested other distributions but there is no reason it should not happen in others as well.

Flannel version: 0.22.1 and also tested 0.23.0
Backend used (e.g. vxlan or udp): vxlan
Etcd version: 3.4.22
Kubernetes version (if used): 1.28
Operating System and version: Ubuntu 22.04

thomasferrandiz · 2023-11-30T10:30:15Z

can you share the systemd file you use to deploy flannel?
In particular, do you set --kube-subnet-mgr to true or false?

alperbas · 2023-11-30T10:46:20Z

sure @thomasferrandiz, here it is. --kube-subnet-mgr is not set at all but I assume it is false with this unit file.

[Unit]
Description=Flannel Overlay Network
Documentation=https://github.com/coreos/flannel
Wants=network-online.target
After=network.target network-online.target

[Service]
ExecStart=/usr/local/bin/flanneld -iface=ethX -etcd-endpoints=https://x.x.x.x:2379,https://x.x.x.x:2379,https://x.x.x.x:2379 -etcd-certfile=/xx/client-cert.pem -etcd-keyfile=/xx/client-key.pem  -etcd-cafile=/xx/client-ca.pem --ip-masq
TimeoutStartSec=0
Restart=on-failure
LimitNOFILE=655536

thomasferrandiz · 2023-12-08T14:26:45Z

Thanks for the file.
Indeed the default value for --kube-subnet-mgr is false which makes flannel use etcd to store its configuration.

Do you have a specific use case that requires using flannel with k8s but without setting kube-subnet-mgr to true?

If not, you could deploy flannel as a pod through the manifest kube-flannel.yml which will avoid the issue since flannel won't use etcd.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flannel process gets stuck after etcd outage. #1830

Flannel process gets stuck after etcd outage. #1830

alperbas commented Nov 24, 2023

thomasferrandiz commented Nov 30, 2023

alperbas commented Nov 30, 2023

thomasferrandiz commented Dec 8, 2023

Flannel process gets stuck after etcd outage. #1830

Flannel process gets stuck after etcd outage. #1830

Comments

alperbas commented Nov 24, 2023

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

thomasferrandiz commented Nov 30, 2023

alperbas commented Nov 30, 2023

thomasferrandiz commented Dec 8, 2023