Cluster-state drift detection #643

hiddeco · 2023-03-10T16:38:48Z

📣 Announcement

We are excited to announce a new (long requested) feature in the helm-controller - drift detection! This is now available in >=v0.37.0, and can be enabled by configuring a HelmRelease with .spec.driftDetection.mode set to enabled.

To enable drift detection without correction, set .mode to warn.

🔎 What is drift detection?

Drift detection allows you to detect any unintentional changes to a resource in your Kubernetes cluster that may have occurred outside of your Helm release process. The feature uses the same approach as kustomize-controller to detect drift by performing a dry-run Server Side Apply of the rendered manifests of a release. When drift is detected, the controller will emit an Event and recreate and/or patch the Kubernetes resources.

💥 Current limitations

The detected diff is only logged in the controller logs when --log-level is set to debug. In the Kubernetes Event, only the creation or change of a Kubernetes resource is reported with a brief summary of the changes.
There is no flux diff command available (yet) to manually inspect or detect drift.

📚 References

Drift detection: https://fluxcd.io/flux/components/helm/helmreleases/#drift-detection
Drift correction: https://fluxcd.io/flux/components/helm/helmreleases/#drift-correction
Ignore rules: https://fluxcd.io/flux/components/helm/helmreleases/#ignore-rules

☎️ Request for Feedback

Please note that this feature is still in its early stages and lacks certain UX features. However, we encourage you to try it out and provide feedback on your experience with it, including any issues you encounter or suggestions for improvements.

Thank you for your help in making the controller even more powerful and reliable!

The text was updated successfully, but these errors were encountered:

antaloala · 2023-03-11T11:47:09Z

Thanks for so smart implementation of this "so desired" feature ;-)
To be sure I fully understood it from https://fluxcd.io/flux/components/helm/helmreleases/#drift-detection description

When a HelmRelease is in-sync with the Helm release object in the storage, the controller will compare the manifests from the Helm storage with the current state of the cluster using a server-side dry-run apply. If this comparison detects a drift (either due resource being created or modified during the dry-run), the controller will perform an upgrade for the release, restoring the desired state

So helm-controller, for each reconciled HelmRelease object, will periodically (driven by .spec.interval setting):

Read the yaml manifests (having been rendered with those applied Helm param settings and saved by Helm in the k8s Secret object for the corresponding Helm release revision) that were applied by Helm in the latest succeed revision of the involved Helm release.
Perform a dry-run SSA for each
Compare the result with the current "content/intent" of each of the API objects for those manifests used to run the dry-run SSA step (including also the detection of missing API objects)

If any drift is detected -> a new Helm upgrade procedure is performed using the referred Helm chart.

Did I get it right? If so, some few additional questions:

Q1: I understand that Helm upgrade procedure triggered to correct the drift follows the .spec.upgrade settings in the involved HelmRelease object (so, as an example, triggering again the pre/post-upgrade hooks if .spec.upgrade.disableHooks not set or set to false), right?
Q2: Is the step 1 above ignoring the manifests annotated as pre/post-install/upgrade/rollback/test hooks?
(I understand it should in order to avoid possible mistakes, e.g. TTL having been set in the involved hook Jobs or annotated hook policy asking the hook API object to be deleted if succeed).
Q2: All this only happens if the latest version of the involved Helm release succeed (i.e. we did not get any Helm error/failure in the previous Helm install/upgrade/rollback procedure for the involved HelmRelease object), right?
(I understand this what you mean when saying When a HelmRelease is in-sync with the Helm release object in the storage ... just to confirm it).

Thanks in advance.

hiddeco · 2023-03-11T13:32:52Z

Compare the result with the current "content/intent" of each of the API objects for those manifests used to run the dry-run SSA step (including also the detection of missing API objects)

We look at if the object was created or changed during the dry-run apply. If it changed, we look at the comparison of "old" (the resource as present on the cluster) and "new" (the resource after the resource as defined in the release has been dry-run applied) to provide a diff in the logs. But the "created" or "changed" is already sufficient to determine if we should trigger a new release, as it means the resource in the cluster has either been deleted or mutated.

[..] as an example, triggering again the pre/post-upgrade hooks if .spec.upgrade.disableHooks not set or set to false

This is correct.

Is the step 1 above ignoring the manifests annotated as pre/post-install/upgrade/rollback/test hooks?

This is correct, as hooks are not part of the manifest blob in a Helm release storage object, but stored separately.

All this only happens if the latest version of the involved Helm release succeed (i.e. we did not get any Helm error/failure in the previous Helm install/upgrade/rollback procedure for the involved HelmRelease object), right?

This is correct.

h4wkmoon · 2023-03-12T08:53:34Z

Hi, I've read the prometheus example around drift-detection. If I understand well, PrometheusRules are excluded from drift-detection. So, when someone deletes/modifies one of the mix-in rules, the change is not reverted by flux ?

cwrau · 2023-03-13T08:50:37Z

I tried this out for a little bit and it seems to be working, correctly recreated a deployment I deleted 👍️

hiddeco · 2023-03-13T12:04:00Z

If I understand well, PrometheusRules are excluded from drift-detection.

This is correct, as PrometheusRules objects are annotated by a mutating webhook which (at present) does not get triggered by a dry-run Server Side Apply, causing the controller to detect spurious drift:

PrometheusRule/monitoring/kube-prometheus-stack-etcd diff:
.metadata.annotations.prometheus-operator-validated
+true
PrometheusRule/monitoring/kube-prometheus-stack-general.rules diff:
.metadata.annotations.prometheus-operator-validated
+true
...

We are looking into options to e.g. define a YAML path to ignore to circumvent this issue, rather than ignoring the resources in full.

h4wkmoon · 2023-03-13T12:30:23Z

If I understand well, PrometheusRules are excluded from drift-detection.

This is correct, as PrometheusRules objects are annotated by a mutating webhook which (at present) does not get triggered by a dry-run Server Side Apply, causing the controller to detect spurious drift:
PrometheusRule/monitoring/kube-prometheus-stack-etcd diff:
.metadata.annotations.prometheus-operator-validated
+true
PrometheusRule/monitoring/kube-prometheus-stack-general.rules diff:
.metadata.annotations.prometheus-operator-validated
+true
...
We are looking into options to e.g. define a YAML path to ignore to circumvent this issue, rather than ignoring the resources in full.

Sounds really promising. Thank you.

in preparation of fluxcd/helm-controller#643 we raise the reconcile times, as the reconciliation might now really do some work, finally!

avthart · 2023-05-16T09:10:48Z

Thanks for this feature!

@hiddeco We have two questions regarding drift detection:

Would be great to have support for flux diff helmrelease xyz. Is this already planned?
Any plans to add new prometheus metrics so that we can have alerting on drift detection?

hiddeco · 2023-05-17T09:38:12Z

Would be great to have support for flux diff helmrelease xyz. Is this already planned?

This is indeed planned, but not expected to land before end Q2 / sometime in Q3.

Any plans to add new prometheus metrics so that we can have alerting on drift detection?

Can you please create a separate issue for this?

benarnold42 · 2023-06-16T14:24:12Z

Would be great to have support for flux diff helmrelease xyz. Is this already planned?

This is indeed planned, but not expected to land before end Q2 / sometime in Q3.

Is there an issue or something I could follow?

kingdonb · 2023-07-07T16:58:07Z

@benarnold42 I think this is the same issue in principle (flux build helmrelease implies flux diff as well):

fluxcd/flux2#2808

prologic · 2023-10-23T23:23:50Z

What does this error mean?

ProgressingWithRetry Detecting drift for revision xxx with a timeout of 4m30s

hiddeco · 2023-12-12T17:52:09Z

Have updated the issue to reflect the changes since the v0.37.0 release, included in Flux v2.2.0. For more information, refer to this blog post or the documentation.

siegenthalerroger · 2023-12-13T15:28:55Z

Is it possible to set the default drift detection to be mode: enabled or mode: warn without just applying a global kustomize patch? If not that's definitely a feature we'd be interested in.

hiddeco · 2023-12-13T15:57:09Z

I think a better solution than controller global defaults is to make use of a policy agent which enforces certain default configurations through validation, an example of this are Kyverno policies (https://kyverno.io/docs/writing-policies/). This ensures you stick to a pure declarative model which can easily be reapplied elsewhere while functioning the same.

The reason the feature was previously available as a global configuration option was due to it being experimental, and us having questions around how we wanted to introduce it in the API.

siegenthalerroger · 2023-12-14T09:45:04Z

Alright, I understand the reasoning. Thanks for the feedback.

cwrau · 2023-12-14T11:35:52Z

I think a better solution than controller global defaults is to make use of a policy agent which enforces certain default configurations through validation, an example of this are Kyverno policies (kyverno.io/docs/writing-policies). This ensures you stick to a pure declarative model which can easily be reapplied elsewhere while functioning the same.

The reason the feature was previously available as a global configuration option was due to it being experimental, and us having questions around how we wanted to introduce it in the API.

I don't really understand this, isn't this way less declarative than just specifying a global default?

We'd also be interested in enabling this feature by default without having to touch every HelmRelease and definitely without having to write random mutating hooks in random policy agents.

Especially since a global default is much more portable than "you have to install another component you didn't want and depending on what policy agent you decide on you have to define this value differently. Oh, and don't forget that this doesn't apply to already existing resources, so, yeah, should've thought about this earlier"

fspaniol · 2023-12-21T12:55:43Z

Thanks for this cool feature :)

One question: I see that in both the debug logs and in the events, we print something like:

X/a changed (0 additions, 1 changes, 1 removals)
Y/B changed (0 additions, 1 changes, 0 removals)

I would be interested in what are the actual changes, additions and removals before changing the mode from "warn" to "enabled". Is there any way to do so? Thanks!

hiddeco · 2023-12-22T09:40:44Z

If you set -–log-level to debug, the controller will log the calculated JSON Patch for the object.

In the new year, we are planning to add a command to flux to allow getting this information forward without having to resort to the logs.

cwrau · 2024-01-10T09:38:51Z

If flux detects some kind of drift, e.g. I delete a resource, does flux execute the hooks? A bit of testing gave me the impression that the hooks won't be executed, in particular https://github.com/teutonet/teutonet-helm-charts/blob/main/charts/base-cluster/templates/flux/_create-authentication-key-secret-job.yaml

Seeing the secrets, or the helm history, no upgrade has been run

Is the step 1 above ignoring the manifests annotated as pre/post-install/upgrade/rollback/test hooks?

This is correct, as hooks are not part of the manifest blob in a Helm release storage object, but stored separately.

Could this be implemented? We rely on hook for various dynamic things in our helm charts

cwrau · 2024-01-31T09:24:56Z

Btw, the current implementation, with .spec.driftDetection.mode=disabled being the default is actually a breaking change, as before I could flux reconcile a HelmRelease to recreate missing resources and now this does nothing.

So creating a flux-global flag or just enabling this by default should be done.

stefanprodan · 2024-01-31T10:10:03Z

as before I could flux reconcile a HelmRelease to recreate missing resources and now this does nothing.

Before helm-controller had no drift detection capability so it couldn't recreate missing resources.

Legion2 · 2024-01-31T10:23:35Z

I used flux suspend hr followed by flux resume hr to reconcile HelmRelease before

stefanprodan · 2024-01-31T10:34:34Z

@Legion2 to achieve the same thing with Flux 2.2 do flux reconcile hr --force.

cwrau · 2024-02-05T09:33:35Z

@Legion2 to achieve the same thing with Flux 2.2 do flux reconcile hr --force.

Nice, but still, a global flag would be amazing, because I see no other good way of setting this globally. And setting this on all of my dozens of HelmReleases is quite annoying and error prone

gelato · 2024-02-09T09:23:13Z

+1 for global setting, this would be amazing, otherwize this is unusable

stefanprodan · 2024-02-09T09:33:01Z

HelmReleases are applied on the cluster by Flux Kustomizations, adding a patch to enable drift detection for all ca be done with:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
spec: 
  patches:
    - patch: |
        apiVersion: helm.toolkit.fluxcd.io/v2beta2
        kind: HelmRelease
        metadata:
          name: all
        spec:
          driftDetection:
            mode: enabled
      target:
        kind: HelmRelease

cwrau · 2024-02-09T11:35:41Z

HelmReleases are applied on the cluster by Flux Kustomizations, adding a patch to enable drift detection for all ca be done with:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
spec: 
  patches:
    - patch: |
        apiVersion: helm.toolkit.fluxcd.io/v2beta2
        kind: HelmRelease
        metadata:
          name: all
        spec:
          driftDetection:
            mode: enabled
      target:
        kind: HelmRelease

Mh, that would be an annoying but semi workable solution, although I'm not happy with that

kingdonb · 2024-02-09T14:24:16Z

The drift detection is not enabled by default because it introduces risk. It would be a goal of the v2beta2 -> v2 update to make this safe enough to enable by default. Enabling this globally introduces hazards that you should be monitoring for, and mitigating. The hazards are not uncommon, they occur in highly popular charts like kube-prometheus-stack and other common places, like anything that uses an HPA.

https://fluxcd.io/flux/components/helm/helmreleases/#drift-detection

which gets a reference from:

https://fluxcd.io/flux/installation/configuration/helm-drift-detection/

covers the hazard, how to monitor for it, and the mitigation.

This is important, users who enable drift detection may have to read it all and take action to prevent Flux from wasting a lot of resources dry-cycling and upgrading forever. Not every configuration will be affected, but when we introduce global configuration it brings the risk that some combination of defaults we have not considered is not good, and in this case, we've advised many users to do retries: -1 – that could result in major issues in combination with the flag enabled. And it could be very difficult to diagnose because you don't only need to consider the content of the HelmRelease.spec, you also need to consider global configuration – which the users may not even know about or have access to flux-system.

When we can condense all that into a single action that Flux can take on the user's behalf, then I think it will be possible to do global enable again. Ideally it's not even a feature flag, we just turn it on, since this is what users expect from the definition of GitOps. But as long as the users expectations may be violated by the hazards, certain HelmReleases are probably going to need some careful handling by an admin, and I would expect the global enable to generate more confused support inquiries than it resolves. We understand the current state isn't optimal, that's why it's v2beta2 and not yet v2.

Thanks for your patience as we make progress on this.

stefanprodan · 2024-02-09T15:28:46Z

It would be a goal of the v2beta2 -> v2 update to make this safe enough to enable by default.

We can’t make this safe as it’s not in our control. There are lots of charts out there which mutate the state from jobs running as Helm hooks, not to mention HPA/VPA. People need to opt-in and set ignore rules.

gelato · 2024-02-09T15:52:28Z

Why then ArgoCD manages to enable this globally somehow and they are just fine with that? Every global chart out there has some instructions to make it work with Argo properly (the same thing with exclude rules) and you for some reason encourage people to waste their time building circumventions for poor design...

cwrau · 2024-02-09T16:12:41Z

Why then ArgoCD manages to enable this globally somehow and they are just fine with that? Every global chart out there has some instructions to make it work with Argo properly (the same thing with exclude rules) and you for some reason encourage people to waste their time building circumventions for poor design...

Because ArgoCD doesn't really support Helm Charts, they only use it for simple templating and don't do hooks or API access. 😥

While flux does it correctly, allowing API calls, hooks, ... 🥳

Which is also why ArgoCD and flux can't really be compared 🤷

kingdonb · 2024-02-09T16:19:14Z

They do hooks, in a subtly incompatible way that will only bite you when you're getting deeply invested in Helm charts, you won't even notice it's a problem at first:

https://blog.teamhephy.info/blog/posts/tutorials/argocd-sentry-apps.html

It has been the story for years, ArgoCD maintainers recommend Flux if you need "perfect helm installations"

argoproj/argo-cd#4288 (comment)

This is one of the leading reasons why create Flamingo

We can’t make this safe as it’s not in our control.

I think the jury is out, there is some mechanism we can add to make it safe, but letting users opt in to the behavior and read the manual is a fine solution IMO. It may be better if the mechanism comes along that lets us short-circuit failures so we don't spin out of control. (Or the appropriate controls may have already made it in, I don't have a test that I can point at that says there is still any problem here...)

I agree that it's not safe to enable globally, if I haven't made that clear, and I won't recommend it because of the danger. This is because we use Helm SDK from upstream, and that's why you won't find this report under ArgoCD anywhere.

kdeyko · 2024-10-17T16:19:43Z

If you set -–log-level to debug, the controller will log the calculated JSON Patch for the object.

In the new year, we are planning to add a command to flux to allow getting this information forward without having to resort to the logs.

Hey @hiddeco,

Any update on this command?

hiddeco · 2024-10-17T17:26:39Z

Unfortunately, Weaveworks who used to pay me for my work on Flux went out of business.

Because of this, my contributions to Flux this year have been minimal — and I do not think I will be able to contribute the functionality we had planned back then this year. This does however not mean someone else would not be able to pick this up (and also doesn't implicate Flux is no longer maintained).

If someone were to pick this up, a couple of notes:

There is logic in internal/reconcile which is the driving force behind the drift functionality, this logic can be consumed by a potential new command in https://github.com/fluxcd/flux2 to print a diff.

It should also be noted that the bits in internal/action have been prepped in such a way that you can add e.g. --debug options to for example mimic a helm upgrade --debug using a HelmRelease. Which is another (command) use case I had in mind while working on this.

stefanprodan · 2024-10-17T23:26:01Z

I'm all for implementing the flux diff helmrelease feature, but this would require a lot of my time and also for a 2nd maintainer to review all the code changes across Flux packages, controllers and CLI. My focus for this year is in other areas of the project, but if this feature request would come from a Flux Enterprise customer, then my priorities would change.

hiddeco pinned this issue Mar 10, 2023

hiddeco mentioned this issue Mar 10, 2023

Cluster-state manifest drift correction #186

Closed

hiddeco added enhancement New feature or request request for feedback Feedback is requested from users labels Mar 10, 2023

cwrau added a commit to teutonet/teutonet-helm-charts that referenced this issue Mar 21, 2023

feat(base-cluster): drift detection

8cb8792

in preparation of fluxcd/helm-controller#643 we raise the reconcile times, as the reconciliation might now really do some work, finally!

cwrau added a commit to teutonet/teutonet-helm-charts that referenced this issue Mar 21, 2023

feat(base-cluster): drift detection

84b44f9

in preparation of fluxcd/helm-controller#643 we raise the reconcile times, as the reconciliation might now really do some work, finally!

cwrau mentioned this issue Mar 21, 2023

feat(base-cluster): drift detection teutonet/teutonet-helm-charts#300

Merged

cwrau added a commit to teutonet/teutonet-helm-charts that referenced this issue Mar 21, 2023

feat(t8s-cluster): drift detection

b7e654a

in preparation of fluxcd/helm-controller#643 we raise the reconcile times, as the reconciliation might now really do some work, finally!

cwrau mentioned this issue Mar 21, 2023

feat(t8s-cluster): drift detection teutonet/teutonet-helm-charts#301

Merged

cwrau added a commit to teutonet/teutonet-helm-charts that referenced this issue Mar 23, 2023

feat(base-cluster): drift detection (#300)

8d27167

in preparation of fluxcd/helm-controller#643 we raise the reconcile times, as the reconciliation might now really do some work, finally!

cwrau added a commit to teutonet/teutonet-helm-charts that referenced this issue Mar 23, 2023

feat(t8s-cluster): drift detection (#301)

8e07e94

in preparation of fluxcd/helm-controller#643 we raise the reconcile times, as the reconciliation might now really do some work, finally!

cwrau mentioned this issue Apr 11, 2023

feat(t8s-cluster): prepare for flux drift-detection teutonet/teutonet-helm-charts#322

Merged

kingdonb mentioned this issue Jun 26, 2023

helmrelease "upgrade retries exhausted" regression #454

Closed

1 task

jack1902 mentioned this issue Oct 18, 2023

helmRelease dry-run failed #796

Closed

kingdonb mentioned this issue Dec 5, 2023

(Pre-)Release v2.2.0 Blog fluxcd/website#1733

Closed

hiddeco changed the title ~~Experimental cluster-state drift detection~~ Cluster-state drift detection Dec 12, 2023

hiddeco mentioned this issue Dec 13, 2023

helmrelase Unable to perceive changes in resources #847

Closed

stefanprodan mentioned this issue Dec 22, 2023

Helm 2.2.2 doesn't recover K8s deployments that were deleted when the HelmRelease was suspended and later resumed fluxcd/flux2#4511

Closed

1 task

cwrau mentioned this issue Jan 31, 2024

upgrade all HelmRelease to v2beta2 teutonet/teutonet-helm-charts#740

Closed

Cluster-state drift detection #643

Cluster-state drift detection #643

Comments

hiddeco commented Mar 10, 2023 • edited Loading

📣 Announcement

🔎 What is drift detection?

💥 Current limitations

📚 References

☎️ Request for Feedback

antaloala commented Mar 11, 2023

hiddeco commented Mar 11, 2023

h4wkmoon commented Mar 12, 2023

cwrau commented Mar 13, 2023

hiddeco commented Mar 13, 2023

h4wkmoon commented Mar 13, 2023

avthart commented May 16, 2023

hiddeco commented May 17, 2023

benarnold42 commented Jun 16, 2023

kingdonb commented Jul 7, 2023

prologic commented Oct 23, 2023

hiddeco commented Dec 12, 2023 • edited Loading

siegenthalerroger commented Dec 13, 2023

hiddeco commented Dec 13, 2023

siegenthalerroger commented Dec 14, 2023

cwrau commented Dec 14, 2023

fspaniol commented Dec 21, 2023

hiddeco commented Dec 22, 2023

cwrau commented Jan 10, 2024 • edited Loading

cwrau commented Jan 31, 2024

stefanprodan commented Jan 31, 2024

Legion2 commented Jan 31, 2024

stefanprodan commented Jan 31, 2024

cwrau commented Feb 5, 2024

gelato commented Feb 9, 2024

stefanprodan commented Feb 9, 2024 • edited Loading

cwrau commented Feb 9, 2024

kingdonb commented Feb 9, 2024 • edited Loading

stefanprodan commented Feb 9, 2024

gelato commented Feb 9, 2024

cwrau commented Feb 9, 2024

kingdonb commented Feb 9, 2024 • edited Loading

kdeyko commented Oct 17, 2024

hiddeco commented Oct 17, 2024 • edited Loading

stefanprodan commented Oct 17, 2024

hiddeco commented Mar 10, 2023 •

edited

Loading

hiddeco commented Dec 12, 2023 •

edited

Loading

cwrau commented Jan 10, 2024 •

edited

Loading

stefanprodan commented Feb 9, 2024 •

edited

Loading

kingdonb commented Feb 9, 2024 •

edited

Loading

kingdonb commented Feb 9, 2024 •

edited

Loading

hiddeco commented Oct 17, 2024 •

edited

Loading