Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPEM requires Kubernetes node name to match Equinix Metal device name #533

Open
hh opened this issue Apr 17, 2024 · 6 comments
Open

CPEM requires Kubernetes node name to match Equinix Metal device name #533

hh opened this issue Apr 17, 2024 · 6 comments

Comments

@hh
Copy link

hh commented Apr 17, 2024

I'm not sure where to set providerID. I don't remember setting it in the past. Any suggestions?

CPEM daemonset

kubectl  -n kube-system  describe ds cloud-provider-equinix-metal
Name:           cloud-provider-equinix-metal
Selector:       app=cloud-provider-equinix-metal
Node-Selector:  <none>
Labels:         app=cloud-provider-equinix-metal
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 3
Current Number of Nodes Scheduled: 3
Number of Nodes Scheduled with Up-to-date Pods: 3
Number of Nodes Scheduled with Available Pods: 3
Number of Nodes Misscheduled: 0
Pods Status:  3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=cloud-provider-equinix-metal
  Service Account:  cloud-provider-equinix-metal
  Containers:
   cloud-provider-equinix-metal:
    Image:      quay.io/equinix-oss/cloud-provider-equinix-metal:v3.8.0
    Port:       <none>
    Host Port:  <none>
    Command:
      ./cloud-provider-equinix-metal
      --cloud-provider=equinixmetal
      --leader-elect=true
      --authentication-skip-lookup=true
      --cloud-config=/etc/cloud-sa/cloud-sa.json
    Requests:
      cpu:        100m
      memory:     50Mi
    Environment:  <none>
    Mounts:
      /etc/cloud-sa from cloud-sa-volume (ro)
  Volumes:
   cloud-sa-volume:
    Type:               Secret (a volume populated by a Secret)
    SecretName:         metal-cloud-config
    Optional:           false
  Priority Class Name:  system-cluster-critical
Events:                 <none>

cloud-sa.json

kubectl  -n kube-system  get secret metal-cloud-config -o json | jq '.data["cloud-sa.json"]' -r | base64 -d | jq .
{
  "apiKey": "XXXXXXXXX",
  "projectID": "82b5c425-8dd4-429e-ae0d-d32f265c63e4",
  "metro": "sv",
  "eipTag": "eip-apiserver-sharingio",
  "eipHealthCheckUseHostIP": true,
  "loadBalancer": "metallb:///metallb-system?crdConfiguration=true"
}

CPEM logs

kubectl  -n kube-system logs ds/cloud-provider-equinix-metal | tail -10
Found 3 pods, using pod/cloud-provider-equinix-metal-bl7nh
I0417 16:37:29.152076       1 eip_controlplane_reconciliation.go:249] healthcheck node https://139.178.94.175:6443/healthz
E0417 16:37:29.157164       1 eip_controlplane_reconciliation.go:135] failed to handle node health check: failed to assign the control plane endpoint: providerID cannot be empty string
I0417 16:37:29.157191       1 eip_controlplane_reconciliation.go:125] handling update, node: shining-ant
I0417 16:37:29.389548       1 eip_controlplane_reconciliation.go:529] doHealthCheck(): no control plane IP assignment found, trying to assign to an available controlplane node
I0417 16:37:29.399453       1 eip_controlplane_reconciliation.go:249] healthcheck node https://139.178.94.167:6443/healthz
E0417 16:37:29.405800       1 eip_controlplane_reconciliation.go:135] failed to handle node health check: failed to assign the control plane endpoint: providerID cannot be empty string
I0417 16:37:29.405833       1 eip_controlplane_reconciliation.go:125] handling update, node: trusty-marmot
I0417 16:37:29.675037       1 eip_controlplane_reconciliation.go:529] doHealthCheck(): no control plane IP assignment found, trying to assign to an available controlplane node
I0417 16:37:29.683583       1 eip_controlplane_reconciliation.go:249] healthcheck node https://145.40.82.49:6443/healthz
E0417 16:37:29.689076       1 eip_controlplane_reconciliation.go:135] failed to handle node health check: failed to assign the control plane endpoint: providerID cannot be empty string
@cprivitere
Copy link
Member

cprivitere commented Apr 17, 2024

You shouldn't be setting providerID, that's something CPEM sets for you. Why it's not setting it here though, that's the real question. Hmm.

We had this part working in the work we did before kubecon, do you still have access to that config? Probably something we had to disable on the talos side.

@hh
Copy link
Author

hh commented Apr 17, 2024

It should be noted that it's also not clearing a taint I suspect it's responsible for:
#531

@hh
Copy link
Author

hh commented Apr 17, 2024

I have another open issue related to the /healthz check: #519

@hh
Copy link
Author

hh commented Apr 17, 2024

Lively conversation happing in #support channel on Talos / Sidero slack: https://taloscommunity.slack.com/archives/CMARMBC4E/p1712793108556169

Seems it might be related to the deviceByName function fallback wanting the kubernetes node names to match the Equinix devices names exactly.

Possibly? https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/blob/main/metal/devices.go#L165-L167

@hh
Copy link
Author

hh commented Apr 17, 2024

Going to try setting the machine.kubelet.registerWithFQDN: true in the Talos configuration.

hh added a commit to sharingio/infra that referenced this issue Apr 17, 2024
This fixes
kubernetes-sigs/cloud-provider-equinix-metal#533

// deviceByName returns an instance whose hostname matches the kubernetes node.Name
Defined here : https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/blob/main/metal/devices.go#L165C1-L166C1

The reason it fixes it is the logic in CPEM deviceByName requires the
equinix metal device name match the kubernetes node name in order for
eip_controlplane_reconciliation to complete.
@hh
Copy link
Author

hh commented Apr 17, 2024

I found a work around, but it was a bit difficult to find.

sharingio/infra@96bff1f

I might be a one-off, but it might make sense to take some steps to raise visibility so others don't get stuck on this in the future:

  • the CPEM error message should clearly state reason match could not occur, possibly link to documentation
  • CPEM documentation should clearly state that kubernetes node names must match Equinix Metal device names
  • Talos documentation should probably state something similar in an updated integration page with Equinix

@hh hh changed the title CPEM fails to handle node health check : by failing to find providerID CPEM requires Kubernetes node name to match Equinix Metal device name Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants