Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(helm-controller): add ADR on helm controller replacement #22

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 157 additions & 0 deletions architecture-decision-records/008-greenhouse-helm-controller-ng.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# 008-greenhouse-helm-controller-ng

- Status: [accepted] <!-- optional -->
- Deciders: [Abhijith R., David G., Ivo G.] <!-- optional -->
- Date: [2024-12-16] <!-- optional. To customize the ordering without relying on Git creation dates and filenames -->
- Tags: [greenhouse / cloudoperators] <!-- optional -->
- Technical Story: [plugins] <!-- optional -->

## Context and Problem Statement

Greenhouse uses [helm.sh](https://helm.sh/docs/) to deploy the Plugins to Kubernetes Clusters. The Helm lifecycle is implemented in the Helm Controller. The Helm Controller is a simple controller that deploys the Chart and rolls it back if it fails. It does not handle releases stuck in Pending, releases constantly failing, or rollback to previous versions.

There have been various occasions where this controller has not been able to handle the lifecycle of the Helm releases. This has led to manual intervention to fix the releases.

Examples of issues:

- Helm releases stuck in Pending
- Helm releases constantly failing
- Helm Charts always producing a new `helm diff` between two deployments
- Helm Charts not rolling back to the previous version, in case the initial install fails

There are two options to solve this problem:

1. Invest more development time into the existing Helm Controller to make it more robust and support all the above features.
2. Replace the Helm Controller with a more robust solution that can handle the above features.

## Decision Drivers <!-- optional -->

- Simplicity:
- The solution should work without any enduser facing changes.
- Control:
- The solution should allow to control the lifecycle of the Helm releases.
- Stability:
- The solution should be stable and reliable.
- The lifecycle of the Helm releases should be handled correctly.
- Development cost:
- Time spend on the development of the solution.

## Considered Options

- Helm Controller with more development time
- FluxCD Helm Operator
- ArgoCD

## Decision Outcome

Chosen option: "FluxCD Helm Operator",
because the migration from the current Helm controller to Helm releases by Flux can be done seamlessly. The CRDs of Flux support the same features as the Plugin and PluginDefinition CRDs. With the additional benefit that FluxCD can be run in a Hub-Spoke model, where a central FluxCD instance inside the Greenohuse cluster can manage multiple clusters.
The development cost is expected to be less than the continued development of the Helm Controller. This is due to the fact that the FluxCD Helm Operator is a proven solution that is widely adopted in the CNCF landscape. There will be continuous effort to operate the Flux deployment but this would also be the case for the Helm Controller.

### Positive Consequences <!-- optional -->

- Close to a drop-in replacement for the current Helm Controller
- Proven solution that is widely adopted in the CNCF landscape
- Community support
- No changes for the enduser

### Negative Consequences <!-- optional -->

- Continuous effort to operate the Flux deployment
- Initial effort to implement a new controller, that translates from Greenhouse CRDs into Flux CRDs
- Initial effort to onboard the team with FluxCD

### Helm Controller with more development time

The Helm controller supports the basic lifecycle around Helm actions. The Helm actions such as `install`, `upgrade`, `delete`, `rollback`, `diff` are already provided by upstream Helm. This means the controller manages the install, upgrade, and delete of a Plugin's Helm releases. Furthermore, it can diff the Helm release and rollback in case the prevoius upgrade fails.

Currently there is a list of features that is not supported:

- handling releases stuck with a pending status
- releases constantly failing (exponential backoff, but no circuit breaker/ limit of retries)
- rollback to previous versions on-demand
- circuit breaker for releases that constantly create a new diff
- ... anything else that we might need in the future

| Decision Driver | Rating | Reason |
|---------------------|--------|-------------------------------|
| Simplicity | +++ | Good, because virtually no changes for the enduser. | |
| Control | ++ | Good, because we have the whole source code, besides upstream helm, under our control. |
| Stability | o | Neutral, because the current state of the controller is lacking some key features. We only have limited time and a limited amount of hands to keep working on the Helm Controller. |
| Development cost | -- | Negative, because there are a multitude of features missing to make the controller work reliable and to support the full plugin lifecycle. |

### FluxCD Helm Operator

[FluxCD](https://github.com/fluxcd/flux2) provides a suite of controllers around the lifecycle of Helm Releases. These controllers are proven to be stable and reliable.

In order to replace the HelmController which Greenhouse uses, it is necessary to create Flux CRDs from the Plugin. This can be done by implementing a new controller that works with the Plugin CRDs.

The flow for the interaction between the Greenhouse and the FluxCD Helm Operator can look as follows:

```mermaid
flowchart LR
subgraph Greenhouse
subgraph "Namespace [greenhouse]"
pc[PluginController]
fc[FluxCD HelmController]
end
pd[PluginDefinition]
subgraph "Namespace [my-org]"
p[Plugin]
c[Cluster]
s["Secret [type: greenhouse.sap/kubeconfig]"]

hrepo["HelmRepository(source.toolkit.fluxcd.io/v2)"]
hr["HelmRelease(helm.toolkit.fluxcd.io/v2)"]
end
end

subgraph "Kubernetes Cluster"
subgraph "namespace"
rmr["Helm Release"]
end
end

c --"owns"--> s
p -..-> c
p -..-> pd

pc --"reconciles"-->p
pr --"creates"--> hrepo
pr --"creates"--> hr
hr -..-> s

fc --"reconciles"--> hr
fc --> rmr
```

The enduser facing changes are none. All the data required by the FluxCD Helm Operator is already present in the Plugin, PluginDefinition and Cluster CRDs.
The flux resources in the org namespace can even be hidden from the enduser, as they should be managed only by Greenhouse. The enduser should only see the status of Flux's HelmRelease resource reflected in the Plugin's status.

| Decision Driver | Rating | Reason |
|---------------------|--------|-------------------------------|
| Simplicity | +++ | Good, because there are no changes for the enduser | |
| Control | - | Negative, while FluxCD is open-source there is no guarantee to have proposed a feature request or bug fix merged. |
| Stability | +++ | Good, because this is a proven solution that is widely adopted in the CNCF landscape. |
| Development cost | o/- | Neutral/Negative, because there is development effort to implement the controller that translates from Greenhouse CRDs into Flux CRDs. A benefit is that the Flux CRDs map with the current features of the Plugin & PluginDefinition CRDs. Also the Status must be reliably be transferred back to the Plugin. Other existing controllers (HelmTest, WorkloadStatus) may need to be adjusted. Another factor is the operations of the Flux controller(s) which is a continuous effort. |

### ArgoCD

ArgoCD is a GitOps continuous delivery tool for Kubernetes. It follows the GitOps pattern of using Git repositories as the source of truth for defining the desired application state. It also supports Helm Charts as the source for Kubernetes Manifests but does not use Helm directly. All manifests are rendered and managed by ArgoCD itself. This means there are no Helm releases in the Kubernetes cluster.

It also supports a Hub-Spoke model where one ArgoCD instance can manage multiple clusters. Furthermore, there are CRDs available to specify the desired state of the application. Secrets cannot be referenced easily from Kubernetes Secrets.

Helm Charts that automatically generate values such as passwords or certificates do not work well out of the box with ArgoCD. The problem is that Argo will template the Helm Chart every time and compare with the deployed resources. This will always result in a diff. Therefore, it is necessary to adjust the values in such a way that they are stable between multiple executions of helm templates.

| Decision Driver | Rating | Reason |
|---------------------|--------|-------------------------------|
| Simplicity | - | Negative, because it requires reworking of how the Helm Charts are configured. Helm Charts with generated values need to be abjusted. | |
| Control | - | Negative, while ArgoCD is open-source there is no guarantee to have proposed a feature request or bug fix merged. |
| Stability | +++ | Good, because this is a proven solution that is widely adopted in the CNCF landscape. |
| Development cost | -- | Negative, the CRDs from ArgoCD do not map directly to what the Plugins specify for the release. Currently the Plugins are strongly tied with Helm releases, Argo manages the resources without Helm. |

## Related Decision Records <!-- optional -->

none.

## Links <!-- optional -->
Loading