Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow partial consolidation of nodes with blocking PDBs #1176

Closed
hintofbasil opened this issue Apr 8, 2024 · 7 comments
Closed

Allow partial consolidation of nodes with blocking PDBs #1176

hintofbasil opened this issue Apr 8, 2024 · 7 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@hintofbasil
Copy link

hintofbasil commented Apr 8, 2024

Description

What problem are you trying to solve?

We run clusters with relatively large nodes (~100 pods), significant autoscaling of deployments, and some slow pod start up/termination times.
After we update a Nodepool we monitor the status of the nodes and wait for all drifted nodes to be replaced as part of our continuous delivery (CD) pipeline. This makes the speed Karpenter removes drifted nodes from our clusters significant as it can block our CD pipelines for long periods of time.

Currently Karpenter will only consolidate a drifted node if all the pods are evictable (no PDBs are blocking) (source).
Karpenter also runs one disruption command at a time which can lead to the duration between drift reconciliations being many minutes.

This often results in a situation where Karpenter checks too infrequently and the odds of a single PDB blocking consolidation are so high that it takes many hours for large nodes to be valid consolidation candidates. It would be preferable for us if we could drain all the pods which are currently evictable and then wait for PDBs to allow the eviction of the remaining pods.

This will require changes to the API to configure this. The sensible place would be adding it to the disruption section of the NodePool spec keeping it in line with the Disruption Controls design.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    evictionPolicy: FullNodeOnly || BestEffort  # FullNodeOnly requires all pods to be evictable

How important is this feature to you?

We currently have a hacky work-a-round where we kubectl drain drifted nodes as kubectl drain does evicts pods as they become evictable. We would prefer a proper implementation in the controller.

I am interested on working on this if there is agreement the feature is useful and we can agree on the API.

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@hintofbasil hintofbasil added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 8, 2024
@jmdeal
Copy link
Member

jmdeal commented Apr 9, 2024

I think this ties into some of the discussion that's been taking place in #1047. I definitely see the use case here and think this falls under a broader story we've been discussing: ensuring cluster operators have mechanisms to disrupt nodes regardless of pod level configuration (PDBs, the karpenter.sh/do-not-disrupt annotation, etc).

Making sure I understand your use case, you want drifted nodes to be immediately eligible to begin draining. The drain should occur without violating PDBs. It is acceptable for nodes to remain on the cluster partially drained for an indefinite period of time to ensure this. Does this sound about right?

I want to draw particular attention to that last point, it being acceptable for nodes to remain partially drained indefinitely. This has been something we've been hesitant to enable, hence the current behavior where we wait until we know the node can be disrupted successfully. One option I posed over in #1047 is enabling this immediate drain behavior if and only if the corresponding NodeClaim had a terminationGracePeriod set (#834). This would ensure that we don't get in a state where a node can be draining indefinitely since the terminationGracePeriod sets an upper bound on node lifetime once the drain begins. Does something like this address your use case or is it important for PDBs to never be violated, even if this could result in partially drained nodes remaining on the cluster indefinitely?

@hintofbasil
Copy link
Author

Thanks for the response Jason,

Making sure I understand your use case, you want drifted nodes to be immediately eligible to begin draining. The drain should occur without violating PDBs. It is acceptable for nodes to remain on the cluster partially drained for an indefinite period of time to ensure this. Does this sound about right?

Yeah, this is correct.

Does something like this address your use case or is it important for PDBs to never be violated, even if this could result in partially drained nodes remaining on the cluster indefinitely?

It is important for us that PDBs are never violated. We are happy with nodes left in a partially drained state indefinitely. We have alerting which will highlight this to us so that we can investigate why the PDB is blocking the full drain.

We could theoretically set the terminationGracePeriod to a very large duration to give us plenty of time to investigate manually but would prefer an option where a forced termination was never a possibility.

@jonathan-innis jonathan-innis removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 22, 2024
@jonathan-innis
Copy link
Member

It is important for us that PDBs are never violated. We are happy with nodes left in a partially drained state indefinitely. We have alerting which will highlight this to us so that we can investigate why the PDB is blocking the full drain

I'm actually curious if #623 would help you as well here and if you even need a disruptionToleration? In reality, the disruption toleration duration is there to force removal of nodes by a given time, but it sound like you do still want some form of graceful termination, so you are really looking for us to stop scheduling pods (or at least weigh-away from scheduling to the nodes that are drifted). FWIW, if we go with the PreferNoSchedule taint here (which is what we are leaning toward), I'm not sure how much this will help if your entire NodePool gets drifted at once, since all nodes will be treated the same.

Karpenter also runs one disruption command at a time which can lead to the duration between drift reconciliations being many minutes

Which version are you running? With the changes introduced in v0.34.0, this should no longer be the case. We should be able to run at a much higher parallelism and you should see us react much quicker to drift events. Have you tried the disruption budget changes? Do you think that these changes might help mitigate your existing problem?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 20, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 19, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants