Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to Scale Karpenter Provisioned Nodes To 0 On Demand Or By Schedule During Off Hours #1177

Open
ronberna opened this issue Apr 9, 2024 · 4 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@ronberna
Copy link

ronberna commented Apr 9, 2024

Description

What problem are you trying to solve?
We've recently begun the migration from using ASG's (AutoScaling Groups) and CAS (Cluster AutoScaler) to Karpenter. With ASG's, as part of cost saving measures, our EKS clusters are scaled down during off hours and weekends in lower environments, and then scaled back up during office hours. This was performed by running a lambda at a scheduled time to set the min/max/desired settings of the ASG to 0. The current values of the min/max/desired settings before the update to 0 are captured and stored in ssm. For the scale up, the lambda reads this ssm parameter to set the ASG min/max/desired values. With Karpenter, this is not possible.

As a workaround, we have a lambda that will patch the cpu limit of the nodepool and set it to 0 so that no new Karpenter nodes will be provisioned. The lambda will then take care of deleting the previously provisioned Karpenter nodes. We have a mix of workloads running in the cluster with some using HPA and some not, so trying to scale down all of the deployments to remove the Karpenter provisioned nodes will not work. It has also been suggested to delete the nodepool and reapply it via a cronjob. This option will also not work since some of our clusters are in a controlled environment.

The ask here is to introduce a feature in Karpenter that will handle scaling down/up all Karpenter provisioned nodes on-demand via a flag or possibly with the update of the cpu limit, Karpenter will not provision any new nodes and will also clean up previously provisioned nodes without having to introduce additional cronjobs, lambdas, or deleting nodepools.

How important is this feature to you?
This feature is important as it will help with AWS cost savings by not having EC2 instances running during off hours and not having to add additional components (lambdas, cronjobs, etc...) to aid with scaling Karpenter provisioned instances.

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@ronberna ronberna added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 9, 2024
@jonathan-innis jonathan-innis removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 22, 2024
@jonathan-innis
Copy link
Member

Karpenter will not provision any new nodes and will also clean up previously provisioned nodes without having to introduce additional cronjobs, lambdas, or deleting nodepools

We've had some conversation about this among the maintainers. IMO, this features basically comes down to -- should we consolidate based on limits? If you apply a more restrictive limit to your NodePool, does that mean that you are implying that the NodePool should deprovision nodes until it gets back to complying with its limits.

IMO: This strikes me as an intuitive desired state mechanism -- you have set a new desired state on your NodePool -- implying that you no longer support a given capacity. Now comes the more difficult question: Should Karpenter force application pods off of your nodes unsafely if you have enforced stricter limits on your NodePool and those pods have nowhere else to schedule? This breaks current assumptions that we have around the safety of disruption -- that is, if we disrupt a node (unless it is due to spot interruption), we assume that we are doing so assuming that we can reschedule the existing pods on the node onto some other capacity (either existing or new). This feature would have us force delete pods regardless of whether they can schedule or not -- which starts to look a bit scary.

This option will also not work since some of our clusters are in a controlled environment

I know you mentioned that you can't delete the NodePool to spin down nodes but I'm curious what you mean by "controlled environment". Wouldn't updating the limits also cause similar changes to your cluster that I assume would also be subject to this "controlled environment?"

@ronberna
Copy link
Author

If you apply a more restrictive limit to your NodePool, does that mean that you are implying that the NodePool should deprovision nodes until it gets back to complying with its limits.

Yes, I believe this is what is being implied. If the cpu limit is set to 0, that would mean that we want to deprovision existing nodes, similar to setting the min/max/desired values to 0 for an ASG. Even if something similar to an ASG Scheduled Action was introduced to where I can create a configuration inside the NodePool to deprovision existing nodes and not spin up any additional nodes.

A flaw that we've uncovered with our current approach of using a lambda to patch the cpu limit to 0 and then delete existing Karpenter provisioned nodes is that if a node was provisioned right before the cpu limit was set and is now in the "NotReady" state, this node will not get cleaned up as it is not yet recognized as an active node and will remain running. We're having to come up with a solution to rerun the lambda multiple times to make sure nodes get cleaned up if this happens. We will not only have to delete the finalizer from the node before deleting from the cluster, but we will also have to terminate the node in AWS as a kubectl delete node will delete it from the cluster, but will not delete it from AWS. As long as the node is still in AWS, Karpenter will not provision a new node.

Should Karpenter force application pods off of your nodes unsafely if you have enforced stricter limits on your NodePool and those pods have nowhere else to schedule?

Yes. This is the behavior that currently happens for ASG's. Our pods will stay in a Pending state until the next workday when the ASG min/max/desired settings are updated to their previous work hour values. With no nodes running during non-work hours our savings are pretty significant.

I know you mentioned that you can't delete the NodePool to spin down nodes but I'm curious what you mean by "controlled environment".

By controlled environment we mean that certain changes to the environment will require going through change control (testing the change, creating change request, verifying test results, getting approvals to implement said request, implementing the change, verifying the change). Doing this daily is not feasible IMO. Yes, technically patching the limit is subject to the "controlled environment", but it's easier based on our current process to patch the cpu limit with a scheduled lambda function as opposed to deleting an entire k8s resource and having to go through the steps mentioned above in order to kick off a pipeline to get the resource re-applied. That's why the ask here is to have this feature built into Karpenter. If designed properly, IMO, this would be a huge win.

@cp1408
Copy link

cp1408 commented May 24, 2024

You can use below yaml to delete and create Karpenter nodes, Logic is to delete the nodepool on friday and re-create on sunday.

---
apiVersion: v1
kind: Namespace
metadata:
  labels:
    kubernetes.io/metadata.name: karpenter-cron
  name: karpenter-cron
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app: karpenter-cron
  name: karpenter-cron
  namespace: karpenter-cron
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: karpenter-cron
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["delete", "create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: karpenter-cron
subjects:
- kind: ServiceAccount
  name: karpenter-cron
  namespace: karpenter-cron
roleRef:
  kind: ClusterRole
  name: karpenter-cron
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: karpenter-cron-cm
  namespace: karpenter-cron
data:
  karpenter-nodepool.yaml: |
    apiVersion: karpenter.sh/v1beta1
    kind: NodePool
    metadata:
      annotations:
      name: default
    spec:
      disruption:
        budgets:
        - nodes: 10%
        consolidationPolicy: WhenUnderutilized
        expireAfter: 720h
      limits:
        cpu: 1000
      template:
        spec:
          nodeClassRef:
            name: default
          requirements:
          - key: kubernetes.io/arch
            operator: In
            values:
            - amd64
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
          - key: karpenter.k8s.aws/instance-category
            operator: In
            values:
            - t
            - r
            - m
            - c
          - key: karpenter.k8s.aws/instance-generation
            operator: Gt
            values:
            - "2"
          - key: karpenter.sh/capacity-type
            operator: In
            values:
            - on-demand
          - key: karpenter.k8s.aws/instance-cpu
            operator: In
            values:
            - "4"
            - "8"
            - "16"
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: karpenter-nodepool-delete-cron
  namespace: karpenter-cron
spec:
  schedule: "55 17 * * FRI"
  startingDeadlineSeconds: 20
  successfulJobsHistoryLimit: 1
  suspend: false
  jobTemplate:
    spec:
      completions: 1
      ttlSecondsAfterFinished: 10
      parallelism: 1
      completions: 1
      template:
        spec:
          containers:
          - name: karpenter-scale
            image: bitnami/kubectl:latest
            command:
            - /bin/bash
            - -c
            - |
              echo "List all the karpneter nodes"
              kubectl get nodes -l karpenter.sh/nodepool
              echo "List nodepool"
              kubectl get nodepool
              echo "Deleting NodePool"
              kubectl delete nodepool default
              sleep 5s
              echo "List all the karpneter nodes"
              kubectl get nodepool -A
              kubectl get nodes -l karpenter.sh/nodepool
              echo "script executed"
              echo "completed"
          restartPolicy: OnFailure
          serviceAccountName: karpenter-cron
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: karpenter-nodepool-create-cron
  namespace: karpenter-cron
spec:
  schedule: "55 17 * * SUN"
  startingDeadlineSeconds: 20
  successfulJobsHistoryLimit: 1
  suspend: false
  jobTemplate:
    spec:
      completions: 1
      ttlSecondsAfterFinished: 10
      parallelism: 1
      completions: 1
      template:
        spec:
          containers:
          - name: karpenter-scale
            image: bitnami/kubectl:latest
            command:
            - /bin/bash
            - -c
            - |
              echo "creating nodepool"
              kubectl apply -f /home/karpenter-nodepool.yaml
              echo "nodepool created"
              kubectl get nodepool -o yaml
              sleep 5s
            volumeMounts:
            - name: karpenter-nodepool
              mountPath: /home
          restartPolicy: OnFailure
          serviceAccountName: karpenter-cron
          volumes:
            - name: karpenter-nodepool
              configMap:
                name: karpenter-cron-cm

@ronberna
Copy link
Author

Unfortunately deleting and reapplying nodepool resources is not an option for us. What would be ideal, IMO, would to have something like the disruption budget schedule that we can set that would basically scale down all instances provisioned by a given nodepool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants