-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flux deadlocked, all resources stalled but everything fine in cluster #4752
Comments
Flux by default doesn't do that, at fault here is probably the way you configured it. Unless you provide your whole configuration and explain when the timeout occurs, I don't see how anyone can help you. |
I can't copy paste the configs for a few reasons, but I'll prepare something asap. The issue is definitely real. |
Here's the infrastructure. I'll let you be the judge but I really don't think there is anything wrong with the configuration. I believe that I'm encountering an edge case that isn't covered yet. It runs fine on 2 clusters with 40+ services, just causes this issue when the image is not available "in time" (like 5-10 mins.) Once the "context deadline" is exceeded that's it. It's stuck until I remove the service from the repository and re-add it. Which causes a downtime of 1-2 mins for our clients. In hindsight I have to apologize for the wording above, as I have since noticed that I am still able to update and deploy other things and it still catches on to new releases of the "broken" service. But it does not catch on to it running fine in the cluster regardless of how long I wait. I have followed the guide on how to set up the flux config repository. Directory overview:
apps/baseapps/base/service/kustomization.yaml
apps/base/service/repository.yaml
apps/base/service/release.yaml
apps/base/service/git-deploy-key.yaml
apps/productionapps/production/kustomization.yaml
apps/production/patches.yaml
clusters/productionclusters/production/flux-system -> standard 2 autogenerated files from flux clusters/production/apps.yaml
clusters/production/infrastructure.yaml
infrastructure/controllers/productioninfrastructure/controllers/production/kustomization.yaml
infrastructure/controllers/production/weave-gitops.yaml
|
What is a service in this context, is it some Deployment inside a Helm chart? |
Describe the bug
Our build workflow first increments the version of our services, then builds the new version. This works fine for over 30 services, except 1 that takes a bit longer to build.
Steps to reproduce
Update a chart version.
Wait until "context deadline exceeded"
Upload the docker-image for that chart version.
Kubernetes Cluster catches on and uses the new image
Flux gets stuck and refuses to apply any new changes
Expected behavior
Flux should notice that the new version is available and already running on the cluster.
Flux should never go into a complete deadlock just because one thing doesn't work.
Screenshots and recordings
No response
OS / Distro
N/A
Flux version
flux: v2.2.2
Flux check
► checking prerequisites
✗ flux 2.2.2 <2.2.3 (new CLI version is available, please upgrade)
✔ Kubernetes 1.27.11-gke.1062000 >=1.26.0-0
► checking version in cluster
✔ distribution: flux-v2.2.2
✔ bootstrapped: true
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.37.2
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.2.1
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.2.3
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v1.2.3
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta3
✔ buckets.source.toolkit.fluxcd.io/v1beta2
✔ gitrepositories.source.toolkit.fluxcd.io/v1
✔ helmcharts.source.toolkit.fluxcd.io/v1beta2
✔ helmreleases.helm.toolkit.fluxcd.io/v2beta2
✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1
✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2
✔ providers.notification.toolkit.fluxcd.io/v1beta3
✔ receivers.notification.toolkit.fluxcd.io/v1
✔ all checks passed
Git provider
GitHub
Container Registry provider
Google Cloud
Additional context
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: