Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flannel: windows: host-gw: pod restart: failed to enable forwarding on host interface (not found), yet interface definitely exists #1858

Open
Zombro opened this issue Jan 17, 2024 · 4 comments

Comments

@Zombro
Copy link

Zombro commented Jan 17, 2024

Expected Behavior

flannel (windows) starts healthy after node restart.

Current Behavior

flannel (windows) host-gw backend crashloops after windows node restart. sometimes, after repeated node reboots, flannel recovers and runs in healthy state.

flannel pod logs:

I0117 06:28:21.696487   12576 kube.go:144] Node controller sync successful
I0117 06:28:21.696752   12576 main.go:229] Created subnet manager: Kubernetes Subnet Manager - k-win-w-p28-002.abcdef.net
I0117 06:28:21.696752   12576 main.go:232] Installing signal handlers
I0117 06:28:21.696752   12576 main.go:540] Found network config - Backend type: host-gw
I0117 06:28:21.696752   12576 match.go:73] Searching for interface using 192.168.7.47
I0117 06:28:21.703333   12576 match.go:259] Using interface with name vEthernet (Ethernet0) and address 192.168.7.47
I0117 06:28:21.703333   12576 match.go:281] Defaulting external address to interface address (192.168.7.47)
I0117 06:28:21.703333   12576 hostgw_windows.go:72] HOST-GW config: {Name:cbr0 DNSServerList:}
I0117 06:28:21.717783   12576 hostgw_windows.go:123] Found existing HNSNetwork cbr0
I0117 06:28:21.729705   12576 hostgw_windows.go:200] Found existing bridge HNSEndpoint cbr0_ep
I0117 06:28:21.729705   12576 hostgw_windows.go:229] Waiting to attach bridge endpoint cbr0_ep to host
I0117 06:28:22.236323   12576 hostgw_windows.go:245] Attached bridge endpoint cbr0_ep to host successfully
I0117 06:28:22.241385   12576 hostgw_windows.go:258] Found &{Index:25 MTU:1500 Name:vEthernet (Ethernet0) HardwareAddr:00:50:56:b0:74:f5 Flags:up|broadcast|multicast|running} interface with IP 192.168.7.47
E0117 06:28:26.733747   12576 main.go:332] Error registering network: failed to enable forwarding on vEthernet (Ethernet0) index 25: Element not found.
I0117 06:28:26.733747   12576 main.go:520] Stopping shutdownHandler...

checking net adapters on the windows host, the mentioned vEthernet (Ethernet0) (index 25) is definitely present.

PS C:\Users\administrator> get-netadapter

Name                      InterfaceDescription                    ifIndex Status       MacAddress             LinkSpeed
----                      --------------------                    ------- ------       ----------             ---------
vEthernet (bcdf976673a... Hyper-V Virtual Ethernet Container...#4      12 Up           00-15-5D-74-B3-C2         1 Gbps
vEthernet (254aab5bcf9... Hyper-V Virtual Ethernet Container...#6      64 Up           00-15-5D-74-BD-1A         1 Gbps
vEthernet (8eb2034d852... Hyper-V Virtual Ethernet Container...#5      60 Up           00-15-5D-74-B2-97         1 Gbps
vEthernet (Ethernet0)     Hyper-V Virtual Ethernet Adapter             25 Up           00-50-56-B0-74-F5         1 Gbps
vEthernet (cc0710334f0... Hyper-V Virtual Ethernet Container...#3      36 Up           00-15-5D-74-B8-CF         1 Gbps
vEthernet (cbr0_ep)       Hyper-V Virtual Ethernet Container A...      18 Up           00-15-5D-74-B0-DE         1 Gbps
vEthernet (c9a1590eb25... Hyper-V Virtual Ethernet Container...#7      70 Up           00-15-5D-74-BE-F0         1 Gbps
vEthernet (2a73a339585... Hyper-V Virtual Ethernet Container...#9      79 Up           00-15-5D-74-B0-E5         1 Gbps
Ethernet0                 Intel(R) 82574L Gigabit Network Conn...      15 Up           00-50-56-B0-74-F5         1 Gbps
vEthernet (9e16c65a9e8... Hyper-V Virtual Ethernet Container...#2      32 Up           00-15-5D-74-B0-89         1 Gbps

during this time the other pods running on the node schedule and run without issues

seems like the interface identification logic gets scrambled here:

https://github.com/flannel-io/flannel/blob/v0.24.0/pkg/backend/hostgw/hostgw_windows.go#L254-L268

Possible Solution

unsure

Steps to Reproduce (for bugs)

  1. provision k8s 1.28 cluster control plane
  2. install flannel-linux helm chart v0.24.0
  3. install flannel-windows with host-gw backend. this is using hostprocess containers, see my personal pr for setup & config reference here
  4. join a windows 22 node. first install / deploy works without issues. everything gets created and looks OK.
  5. reboot the windows node. observe flannel fails to start (see above)

Context

trying to deliver a k8s 1.28 linux/windows cluster with flannel host-gw backend

seems like the only current way i can consistently resolve this issue is to drain the node, console into the node, stop kubelet service, disable vEthernet (Ethernet0) adapter, and reboot the node again, forcing flannel through the not-exists && create logic. sometimes, rebooting the node a few times is enough to restore flannel.

Your Environment

  • Flannel version: v0.24.0
  • Backend used (e.g. vxlan or udp): host-gw
  • Etcd version: registry.k8s.io/etcd:3.5.9-0
  • Kubernetes version (if used): 1.28
  • ContainerD version: 1.7.9
  • Operating System and version: windows server 2022 10.0.20348 Build 20348
@manuelbuil
Copy link
Collaborator

What id you try to run the powershell command manually? https://github.com/flannel-io/flannel/blob/master/pkg/ip/iface_windows.go#L134

@Zombro
Copy link
Author

Zombro commented Jan 22, 2024

manually enabling forwarding works... but this isn't an ideal solution for a self-healing environment
still testing, but not seeing this behavior in latest release

@manuelbuil
Copy link
Collaborator

manually enabling forwarding works... but this isn't an ideal solution for a self-healing environment still testing, but not seeing this behavior in latest release

Agree. I just wanted to understand if it was something related to your OS

@Zombro
Copy link
Author

Zombro commented Jan 25, 2024

flannel

I0125 06:16:54.576811   11888 hostgw_windows.go:258] Found &{Index:5 MTU:1500 Name:vEthernet (Ethernet0) HardwareAddr:00:50:56:b0:94:c8 Flags:up|broadcast|multicast|running} interface with IP 192.168.7.46
E0125 06:16:56.223516   11888 main.go:332] Error registering network: failed to enable forwarding on vEthernet (Ethernet0) index 5: error: exit status 1 while running the command: C:\WINDOWS\System32\WindowsPowerShell\v1.0\powershell.exe -NoLogo -NoProfile -NonInteractive -Command $ErrorActionPreference="Stop";try { Set-NetIPInterface -ifIndex 5 -AddressFamily IPv4 -Forwarding Enabled } catch { Write-Host $_; os.Exit(-1) }. Command output: Element not found.
I0125 06:16:56.223516   11888 main.go:520] Stopping shutdownHandler...

host

PS C:\Users\administrator> Get-NetIPInterface

ifIndex InterfaceAlias                  AddressFamily NlMtu(Bytes) InterfaceMetric Dhcp     ConnectionState PolicyStore
------- --------------                  ------------- ------------ --------------- ----     --------------- -----------
26      vEthernet (cbr0_ep)             IPv6                  1500              25 Enabled  Connected       ActiveStore
5       vEthernet (Ethernet0)           IPv6                  1500              25 Enabled  Connected       ActiveStore
1       Loopback Pseudo-Interface 1     IPv6            4294967295              75 Disabled Connected       ActiveStore
26      vEthernet (cbr0_ep)             IPv4                  1450              25 Disabled Connected       ActiveStore
5       vEthernet (Ethernet0)           IPv4                  1450              25 Disabled Connected       ActiveStore
1       Loopback Pseudo-Interface 1     IPv4            4294967295              75 Disabled Connected       ActiveStore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants