Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The batchIdleDuration value is ignored #1173

Open
MauroSoli opened this issue Apr 8, 2024 · 6 comments · May be fixed by #1244
Open

The batchIdleDuration value is ignored #1173

MauroSoli opened this issue Apr 8, 2024 · 6 comments · May be fixed by #1244
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@MauroSoli
Copy link

Description

Observed Behavior:

Karpenter always wait to the batchMaxDuration value and ignore the batchIdleDuration value when different pods are in pending status.

I change the above values to batchIdleDuration=10s and batchMaxDuration=90s so it's clearer the behaviour.

As you can see in the following image, pods are in pending state after more than 60 seconds and karpenter was not scheduling new nodeClaim yet.

this

Here are the controller logs:

{"level":"DEBUG","time":"2024-04-08T17:36:48.408Z","logger":"controller.nodeclass","message":"discovered security groups","commit":"17dd42b","ec2nodeclass":"default","security-groups":["sg-*******"]}
{"level":"DEBUG","time":"2024-04-08T17:36:48.417Z","logger":"controller.nodeclass","message":"discovered kubernetes version","commit":"17dd42b","ec2nodeclass":"default","version":"1.29"}
{"level":"DEBUG","time":"2024-04-08T17:36:48.421Z","logger":"controller.pricing","message":"updated spot pricing with instance types and offerings","commit":"17dd42b","instance-type-count":762,"offering-count":1927}
{"level":"DEBUG","time":"2024-04-08T17:36:48.605Z","logger":"controller.nodeclass","message":"discovered amis","commit":"17dd42b","ec2nodeclass":"default","ids":"ami-00a591fc812caa612, ami-0fba9088684638922, ami-0fba9088684638922, ami-0ec163aec4a900dc0","count":4}
{"level":"DEBUG","time":"2024-04-08T17:36:49.553Z","logger":"controller.pricing","message":"updated on-demand pricing","commit":"17dd42b","instance-type-count":706}
{"level":"DEBUG","time":"2024-04-08T17:36:50.692Z","logger":"controller.disruption","message":"discovered instance types","commit":"17dd42b","count":704}
{"level":"DEBUG","time":"2024-04-08T17:36:50.850Z","logger":"controller.disruption","message":"discovered offerings for instance types","commit":"17dd42b","instance-type-count":706}
{"level":"DEBUG","time":"2024-04-08T17:36:50.850Z","logger":"controller.disruption","message":"discovered zones","commit":"17dd42b","zones":["eu-central-1c","eu-central-1b","eu-central-1a"]}
{"level":"DEBUG","time":"2024-04-08T17:38:51.395Z","logger":"controller.disruption","message":"discovered subnets","commit":"17dd42b","subnets":["subnet-******** (eu-central-1b)","subnet-******** (eu-central-1c)","subnet-******** (eu-central-1a)"]}
{"level":"DEBUG","time":"2024-04-08T17:45:31.914Z","logger":"controller.provisioner","message":"73 out of 704 instance types were excluded because they would breach limits","commit":"17dd42b","nodepool":"nodepool-eu-central-1a"}
{"level":"INFO","time":"2024-04-08T17:45:31.936Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"17dd42b","pods":"default/pause1, default/pause2, default/pause3, default/pause4, default/pause5 and 5 other(s)","duration":"84.72464ms"}
{"level":"INFO","time":"2024-04-08T17:45:31.936Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"17dd42b","nodeclaims":1,"pods":10}
{"level":"INFO","time":"2024-04-08T17:45:31.952Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"17dd42b","nodepool":"nodepool-eu-central-1a","nodeclaim":"nodepool-eu-central-1a-8rzln","requests":{"cpu":"210m","memory":"41200Mi","pods":"14"},"instance-types":"c6a.12xlarge, c6a.16xlarge, c6a.24xlarge, c6a.32xlarge, c6a.8xlarge and 55 other(s)"}
{"level":"DEBUG","time":"2024-04-08T17:45:32.461Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","launch-template-name":"karpenter.k8s.aws/12721094165069015774","id":"lt-0b41adc259308f38f"}
{"level":"DEBUG","time":"2024-04-08T17:45:34.326Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"17dd42b"}
{"level":"INFO","time":"2024-04-08T17:45:34.478Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","provider-id":"aws:///eu-central-1a/i-0a8468578de536001","instance-type":"z1d.3xlarge","zone":"eu-central-1a","capacity-type":"spot","allocatable":{"cpu":"11900m","ephemeral-storage":"224Gi","memory":"88002Mi","pods":"110","vpc.amazonaws.com/pod-eni":"54"}}
{"level":"DEBUG","time":"2024-04-08T17:45:55.545Z","logger":"controller.disruption","message":"discovered subnets","commit":"17dd42b","subnets":["subnet-******** (eu-central-1a)","subnet-******** (eu-central-1b)","subnet-******** (eu-central-1c)"]}
{"level":"INFO","time":"2024-04-08T17:46:11.252Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","provider-id":"aws:///eu-central-1a/i-0a8468578de536001","node":"ip-10-1-6-176.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-04-08T17:46:24.385Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","provider-id":"aws:///eu-central-1a/i-0a8468578de536001","node":"ip-10-1-6-176.eu-central-1.compute.internal","allocatable":{"cpu":"11900m","ephemeral-storage":"240506825124","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"94711256Ki","pods":"110"}}
{"level":"DEBUG","time":"2024-04-08T17:46:57.299Z","logger":"controller.disruption","message":"discovered subnets","commit":"17dd42b","subnets":["subnet-******** (eu-central-1b)","subnet-******** (eu-central-1c)","subnet-******** (eu-central-1a)"]}

Expected Behavior:

The NodeClaim should be created when the batchIdleDuration time is passed and no new pending pods have been scheduled on cluster.

Versions:

  • Chart Version: 0.35.4
  • Kubernetes Version (kubectl version): v1.29.1-eks-508b6b3
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@MauroSoli MauroSoli added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 8, 2024
@jigisha620
Copy link
Contributor

@MauroSoli Just to be sure, would you mind verifying if BATCH_MAX_DURATION and BATCH_IDLE_DURATION are being set properly. You can do this by running kubectl describe pod <karpenter pod>. Also there is not much that we can see from the logs that you have shared. Can you share logs with message found provisionable pod(s) to see when the actual scheduling for pods happened?

What is the size of the cluster that you are working with? Is it a test cluster that does not have too many pods/deployments?
You have mentioned that the pod is in pending state for more than 60 sec. Given that the node can take some time to come up, do you expect to see the pods scheduled right after 10 seconds?

@MauroSoli
Copy link
Author

Just to be sure, would you mind verifying if BATCH_MAX_DURATION and BATCH_IDLE_DURATION are being set properly. You can do this by running kubectl describe pod

$ kubectl describe pod karpenter-678d69d4d5-6rpgw -n karpenter | grep BATCH_
      BATCH_MAX_DURATION:  90s
      BATCH_IDLE_DURATION: 10s

Also there is not much that we can see from the logs that you have shared. Can you share logs with message found provisionable pod(s) to see when the actual scheduling for pods happened?

In the logs that I've shared there is the log that you are searching for:

{"level":"INFO","time":"2024-04-08T17:45:31.936Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"17dd42b","pods":"default/pause1, default/pause2, default/pause3, default/pause4, default/pause5 and 5 other(s)","duration":"84.72464ms"}

What is the size of the cluster that you are working with? Is it a test cluster that does not have too many pods/deployments?

We use Karpenter only to manage some specific workloads, like building or running cron jobs.
The cluster-autoscaler manages all other workload.

You have mentioned that the pod is in pending state for more than 60 sec. Given that the node can take some time to come up, do you expect to see the pods scheduled right after 10 seconds?

That's not what I was meaning.
After 60 seconds, while the pods were pending, karpenter was not scheduling new nodeClaim yet. Which instead happened after 90 seconds.

@jigisha620
Copy link
Contributor

Let's say you scheduled a single pod and until 10 seconds there is no new pod then Karpenter will begin scheduling new nodeClaim after 10 seconds. However if a pod comes up before 10 seconds the batching window will be extended up to the maxDuration.

From the logs that you have shared it seems like there were 10 pods and in that case Karpenter would wait until the batchMaxDuration of 90s before it begins scheduling a nodeClaim.

@lesinigo
Copy link

lesinigo commented Apr 9, 2024

I've been able to reproduce this using the latest Karpenter v0.35.4 and the latest EKS v1.29.1-eks-508b6b3 and it seems to me there could be some misunderstanding between what is stated above, what is stated in the documentation and what is the actual Karpenter behavior.

If I set high and far enough times for IDLE_DURATION and MAX_DURATION (let's say 10 seconds and 300 seconds) and run a script like for I in $(seq 10); do kubectl run <single_pod>; sleep 2; done, the following happens:

  1. new pods are created
  2. the "nothing happens" interval between one pod and the next is always less than 10 seconds
  3. the whole pods creation takes more than IDLE_DURATION
  4. after the last pod gets created, Karpenter waits until 300 seconds has passed since the time of creation of the first pod
  5. Karpenter creates new nodeclaim(s) to fit all the pending pods

In other words: the batch window gets immediately extended to BATCH_MAX_DURATION after the creation of the second pod is interrupting the first BATCH_IDLE_DURATION interval.

Personally, I understood the current documentation in a different way. It says "BATCH_IDLE_DURATION The maximum amount of time with no new pending pods that if exceeded ends the current batching window. If pods arrive faster than this time, the batching window will be extended up to the maxDuration. If they arrive slower, the pods will be batched separately.". I understood it as follows:

  1. one or more new pods are created
  2. Karpenter notices and starts waiting the IDLE duration
  3. if nothing new happens, the batch window is immediately closed and node claims are computed
  4. otherwise, if new pods are created before BATCH_IDLE_DURATION, jump to 2 (starting a new wait for IDLE duration)
  5. if pods keeps getting created before an IDLE_DURATION ends, Karpenter will keep extending the batch window at blocks of IDLE_DURATION unless the whole window (defined as now - time of first pod created) exceeds BATCH_MAX_DURATION
  6. if the window extends more than BATCH_MAX_DURATION, it is immediatly closed, node claims are computed, and the whole process starts again from 1

In other words: BATCH_IDLE_DURATION will always be the maximum "inactivity time" after which Karpenter starts computing node claims. To me, this makes much more sense because it allows shorter latencies between workload demands (pods creation) and its execution (node claims computed, nodes created, pods start running) while the MAX_DURATION still guarantees a maximum limit to said latency by closing batch windows even if new pods keep arriving.

To me, the documentation seems correct and describes the behavior I would expect, but Karpenter is misbehaving and actually uses the IDLE_DURATION "only the first time", then skipping directly to the MAX_DURATION and thus causing higher latencies between the pods creation and the nodes starting.

@jigisha620
Copy link
Contributor

I had probably misunderstood what was being implied earlier. I was able to reproduce this. Looking into creating a fix for this. Thanks.

@jonathan-innis jonathan-innis removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 15, 2024
@jonathan-innis
Copy link
Member

/assign @jigisha620

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants