Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opt-in allowing cascading deletes of namespaces #119

Open
4 tasks
erikgb opened this issue Feb 23, 2024 · 3 comments
Open
4 tasks

Opt-in allowing cascading deletes of namespaces #119

erikgb opened this issue Feb 23, 2024 · 3 comments

Comments

@erikgb
Copy link
Contributor

erikgb commented Feb 23, 2024

What

As a cluster-admin I want to onboard and eventually terminate tenants in our cluster by creating and eventually deleting a root namespace for the tenant. The onboarding process works like a charm, but I am not too happy about the termination process - which we had an opportunity to try out this week. Let me explain:

We use a GitOps process where tenants are onboarded via a pull request to our "tenants" project. A tenant is in this project defined by some simple resources:

  • An Accurate root namespace with some templated resources defined in our tenant-template Accurate template namespace.
  • A Flux gitops-reconciler service account granted admin permissions in the tenant root namespace. Both the SA and admin role binding are configured with propagation to sub-namespaces. This allows the tenant to use Flux to provision most resources using a GitOps process in their namespace tree.
  • Flux GitRepository and Kustomization resources pointing to a Git project controlled by the tenant - allowing the tenant to bootstrap their resources.

So far, so good. But this week we received a request for a tenant termination. Using a modern GitOps tool like Flux, with pruning enabled, we thought that it was as simple as reverting the onboarding PR in Git. So we did that, after getting the PR approved by the tenant responsible. What we forgot to do, was to check if there were sub-namespaces defined under the tenant root namespace. After merging the tenant termination PR, Flux immediately reported an error: It got (correctly) blocked by the Accurate namespace webhook:

delete failed, errors: Namespace/blnc delete failed: admission webhook "namespace.accurate.cybozu.io" denied the request: child namespaces exist;
kustomization/flux-tenants.flux-tenants

But this error was reported only once, which kind of surprised me - as Flux is usually constantly reconciling until the actual state equals the desired state. And as I suspected, the tenant namespace was still present - including the child namespaces that we did not think about. However the Flux controller resource had no knowledge of the resources it used to control anymore, which left the tenant root namespace (including children) as orphans in our cluster. This is something we are trying hard to avoid.

After cleaning up manually, I reached out to the Flux maintainers on Slack, and you can read all the details in the CNCF Slack thread - if you are interested. TL;DR: Flux maintainers think this is an issue with Accurate, and I tend to agree now.

To fix this for future tenant terminations in our clusters, I suggest adding an opt-in allowing us to configure the Accurate namespace webhook allowing cascade namespace DELETE requests.

How

I have some ideas after looking at the code, but please let me know what you think first! I will update this description once we agree on a solution. I'll be happy to submit a PR fixing this.

Checklist

  • Finish implementation of the issue
  • Test all functions
  • Have enough logs to trace activities
  • Notify developers of necessary actions
@erikgb
Copy link
Contributor Author

erikgb commented Feb 25, 2024

The easiest way to implement this would be to configure an Accurate installation to allow cascading deletes. It would be better if it could be expressed in some declarative way per namespace/sub-namespace, but I fear the complexities in an implementation of this. Looking at the docs for HNC in this area, it's not clear to me how this works - even after reading it multiple times. 😄

My ideal solution would be to allow cascading deletes of namespaces in general, which is admin area anyway. And block end-users from deleting SubNamespace with children by default. WDYT? Would it be possible to implement without adding a lot of complexity?

@yamatcha
Copy link
Contributor

My ideal solution would be to allow cascading deletes of namespaces in general, which is admin area anyway. And block end-users from deleting SubNamespace with children by default

I don't think it is a good idea to cascade deletions by default. Because it is breaking changes and very dangerous.
It would not be complicated if the implementation were to allow cascade deletion only for the cluster administrator.

@erikgb
Copy link
Contributor Author

erikgb commented Oct 29, 2024

My ideal solution would be to allow cascading deletes of namespaces in general, which is admin area anyway. And block end-users from deleting SubNamespace with children by default

I don't think it is a good idea to cascade deletions by default. Because it is breaking changes and very dangerous. It would not be complicated if the implementation were to allow cascade deletion only for the cluster administrator.

@yamatcha I agree allowing cascade deletes by default could represent a risk for some users. But I would not consider it a breaking change: Anything that worked before will still work. That said, you rule in this project, and adding this enhancement as an opt-in is fine with me. 😸

Our most common problem with the present validation is that removing a tenant from our clusters requires manual intervention. We promote GitOps as much as possible, and our tenants are onboarded with a pull request to our tenants-repo. When PR is merged, our GitOps agent will create an Accurate root namespace for the tenant and bootstrap the tenant's GitOps agent with default user-facing RBAC in their newly created root namespace. This will allow the tenant to self-provision anything granted by RBAC in their root namespace, which also includes creating sub-namespaces. And if the tenant decides to create one or more sub-namespaces, our GitOps agent suddenly loses the ability to delete the tenant's root namespace. This is not right IMO. If our desired state (in Git) specifies that the tenant root namespace should cease to exist, our GitOps agent should be able to delete it! Right now, we have to clean this up manually, which is something we would like to avoid.

I suggest starting with a controller config allowing the namespace DELETE validating webhook to be disabled. It could make sense to optionally disable the subnamespaces DELETE validating webhook also, but let's leave that for an eventual follow-up PR. Allowing us, as responsible for our clusters, the ability to manage namespaces is the most urgent issue. Note: I want to allow any user with RBAC present to manage namespaces freely, and not only "the cluster administrator". This is because we have least-privileged agents configured with elevated privileges, but not cluster-admin RBAC.

WDYT? Is there a chance I can get something as described eventually merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants