Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Erroneously Stuck in Failed State #2146

Open
spjmurray opened this issue Jul 17, 2024 · 8 comments
Open

Cluster Erroneously Stuck in Failed State #2146

spjmurray opened this issue Jul 17, 2024 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@spjmurray
Copy link
Contributor

/kind bug

What steps did you take and what happened:

Just checking the state of things in ArgoCD and noted my cluster was in the red. Boo! On further inspection I can see:

  failureMessage: >-
    Failure detected from referenced resource
    infrastructure.cluster.x-k8s.io/v1beta1, Kind=OpenStackCluster with name
    "cluster-bc3d5fc1": failed to reconcile external network: failed to get
    external network: Get
    "https://compute.sausage.cloud:9696/v2.0/networks/5617d17e-fdc1-4aa1-a14b-b9b5136c65af":
    dial tcp: lookup compute.sausage.cloud on 10.96.1.35:53: server misbehaving
  failureReason: UpdateError
  infrastructureReady: true
  observedGeneration: 2
  phase: Failed

but there is no such failure message attached to the OSC resource, so I'm figuring CAPO did sort itself out eventually. I'll just edit the resource, says I, and set the phase (didn't Kubernetes deem such things in the API a total fail?) back to Provisioned and huzzah. But that didn't work and it magically re-appeared from somewhere, I have no idea how this is even possible, but I digress...

According to kubernetes-sigs/cluster-api#10847 CAPO should only ever set these things if something is terminal, and DNS failure quite frankly isn't, specially if you are a road warrior, living Max Max style like some Antipodean Adonis where Wifi is always up and down.

What did you expect to happen:

Treat this error as transient.

Anything else you would like to add:

Just basically reaching out for discussion before I delve into the code, it may be known about, fixed. As always you may have opinions on how this could be fixed. Logically:

var derr *net.DNSError

if errors.As(err, &derr) {
  // handle gracefully
}

should be the simple solution, depending on how well errors are propagated from Gophercloud, which is another story entirely.

Environment:

  • Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built): 0.10.3
  • Cluster-API version: 1.7.2
  • OpenStack version: n/a
  • Minikube/KIND version: n/a
  • Kubernetes version (use kubectl version): n/a
  • OS (e.g. from /etc/os-release): n/a
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 17, 2024
@spjmurray
Copy link
Contributor Author

Ah, subresources... you can work around this with:

kubectl --kubeconfig kc patch clusters.cluster.x-k8s.io -n f766b888-7bc3-414b-9ca3-c5b4fc080c1b cluster-bc3d5fc1 --subresource status --type=json -p '[{"op":"replace","path":"/status/phase","value":"Provisioned"},{"op":"remove","path":"/status/failureReason"},{"op":"remove","path":"/status/failureMessage"}]'

@cwrau
Copy link
Contributor

cwrau commented Aug 5, 2024

Same thing is happening to us, kubernetes-sigs/cluster-api#10991 (comment)

Some little transient problems with the OpenStack API resulting in permanently failed clusters is quite annoying, CAPO shouldn't set these fields if the errors aren't terminal.

And, to be honest, what kind of failures are terminal? Maybe "couldn't (re)-allocate specified loadbalancer IP", but I can't think of anything more.

@spjmurray
Copy link
Contributor Author

I'm seeing similar problems with

    {"NeutronError": {"type": "IpAddressGenerationFailure", "message": "No more
    IP addresses available on network cc8c67a4-83a5-420d-93dd-34bba415f433.",
    "detail": ""}}

cluster comes up eventually, so it's treated correctly as transient by CAPO,, but it's stuck constantly broken in the CAPI bit

@cwrau
Copy link
Contributor

cwrau commented Sep 26, 2024

As we're running our own operator on top of this, we're patching this ourselves; if the CAPI cluster has these fields but the CAPO one doesn't, we remove it from the status ourselves

But it would be great if this would be addressed

@yankcrime
Copy link

Similar to OP, various transient network errors result in this state:

kubectl get cluster cluster-13ade6bf -n e0881433-b558-4535-bcf5-bb668dd33382 -o jsonpath='{.status.failureMessage}{"\n"}{.status.failureReason}'

Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1beta1, Kind=OpenStackCluster with name "cluster-13ade6bf": failed to reconcile security groups: Post "https://compute.xxx.com:9696/v2.0/security-group-rules": read tcp 10.0.7.162:49868->193.143.123.34:9696: read: connection reset by peer
UpdateError

With CAPI 1.7.4, patching the subresource doesn't remove the error and the cluster remains in a 'Failed' phase.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2024
@spjmurray
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2025
@roehlc
Copy link

roehlc commented Jan 21, 2025

We also ran into this bug. The easy workaround to patch the status subresource works for us.

I think this bug will probably get fixed once the new conditions are fully adopted and the fields are removed (CAPI proposal kubernetes-sigs/cluster-api#10897 and the issue for CAPO #2374).

As this might be a few releases in the future, do you think we should also address this with the old conditions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
Status: Inbox
Development

No branches or pull requests

6 participants