The `batchIdleDuration` value is ignored #1173

MauroSoli · 2024-04-08T18:27:13Z

Description

Observed Behavior:

Karpenter always wait to the batchMaxDuration value and ignore the batchIdleDuration value when different pods are in pending status.

I change the above values to batchIdleDuration=10s and batchMaxDuration=90s so it's clearer the behaviour.

As you can see in the following image, pods are in pending state after more than 60 seconds and karpenter was not scheduling new nodeClaim yet.

Here are the controller logs:

{"level":"DEBUG","time":"2024-04-08T17:36:48.408Z","logger":"controller.nodeclass","message":"discovered security groups","commit":"17dd42b","ec2nodeclass":"default","security-groups":["sg-*******"]}
{"level":"DEBUG","time":"2024-04-08T17:36:48.417Z","logger":"controller.nodeclass","message":"discovered kubernetes version","commit":"17dd42b","ec2nodeclass":"default","version":"1.29"}
{"level":"DEBUG","time":"2024-04-08T17:36:48.421Z","logger":"controller.pricing","message":"updated spot pricing with instance types and offerings","commit":"17dd42b","instance-type-count":762,"offering-count":1927}
{"level":"DEBUG","time":"2024-04-08T17:36:48.605Z","logger":"controller.nodeclass","message":"discovered amis","commit":"17dd42b","ec2nodeclass":"default","ids":"ami-00a591fc812caa612, ami-0fba9088684638922, ami-0fba9088684638922, ami-0ec163aec4a900dc0","count":4}
{"level":"DEBUG","time":"2024-04-08T17:36:49.553Z","logger":"controller.pricing","message":"updated on-demand pricing","commit":"17dd42b","instance-type-count":706}
{"level":"DEBUG","time":"2024-04-08T17:36:50.692Z","logger":"controller.disruption","message":"discovered instance types","commit":"17dd42b","count":704}
{"level":"DEBUG","time":"2024-04-08T17:36:50.850Z","logger":"controller.disruption","message":"discovered offerings for instance types","commit":"17dd42b","instance-type-count":706}
{"level":"DEBUG","time":"2024-04-08T17:36:50.850Z","logger":"controller.disruption","message":"discovered zones","commit":"17dd42b","zones":["eu-central-1c","eu-central-1b","eu-central-1a"]}
{"level":"DEBUG","time":"2024-04-08T17:38:51.395Z","logger":"controller.disruption","message":"discovered subnets","commit":"17dd42b","subnets":["subnet-******** (eu-central-1b)","subnet-******** (eu-central-1c)","subnet-******** (eu-central-1a)"]}
{"level":"DEBUG","time":"2024-04-08T17:45:31.914Z","logger":"controller.provisioner","message":"73 out of 704 instance types were excluded because they would breach limits","commit":"17dd42b","nodepool":"nodepool-eu-central-1a"}
{"level":"INFO","time":"2024-04-08T17:45:31.936Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"17dd42b","pods":"default/pause1, default/pause2, default/pause3, default/pause4, default/pause5 and 5 other(s)","duration":"84.72464ms"}
{"level":"INFO","time":"2024-04-08T17:45:31.936Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"17dd42b","nodeclaims":1,"pods":10}
{"level":"INFO","time":"2024-04-08T17:45:31.952Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"17dd42b","nodepool":"nodepool-eu-central-1a","nodeclaim":"nodepool-eu-central-1a-8rzln","requests":{"cpu":"210m","memory":"41200Mi","pods":"14"},"instance-types":"c6a.12xlarge, c6a.16xlarge, c6a.24xlarge, c6a.32xlarge, c6a.8xlarge and 55 other(s)"}
{"level":"DEBUG","time":"2024-04-08T17:45:32.461Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","launch-template-name":"karpenter.k8s.aws/12721094165069015774","id":"lt-0b41adc259308f38f"}
{"level":"DEBUG","time":"2024-04-08T17:45:34.326Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"17dd42b"}
{"level":"INFO","time":"2024-04-08T17:45:34.478Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","provider-id":"aws:///eu-central-1a/i-0a8468578de536001","instance-type":"z1d.3xlarge","zone":"eu-central-1a","capacity-type":"spot","allocatable":{"cpu":"11900m","ephemeral-storage":"224Gi","memory":"88002Mi","pods":"110","vpc.amazonaws.com/pod-eni":"54"}}
{"level":"DEBUG","time":"2024-04-08T17:45:55.545Z","logger":"controller.disruption","message":"discovered subnets","commit":"17dd42b","subnets":["subnet-******** (eu-central-1a)","subnet-******** (eu-central-1b)","subnet-******** (eu-central-1c)"]}
{"level":"INFO","time":"2024-04-08T17:46:11.252Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","provider-id":"aws:///eu-central-1a/i-0a8468578de536001","node":"ip-10-1-6-176.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-04-08T17:46:24.385Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","provider-id":"aws:///eu-central-1a/i-0a8468578de536001","node":"ip-10-1-6-176.eu-central-1.compute.internal","allocatable":{"cpu":"11900m","ephemeral-storage":"240506825124","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"94711256Ki","pods":"110"}}
{"level":"DEBUG","time":"2024-04-08T17:46:57.299Z","logger":"controller.disruption","message":"discovered subnets","commit":"17dd42b","subnets":["subnet-******** (eu-central-1b)","subnet-******** (eu-central-1c)","subnet-******** (eu-central-1a)"]}

Expected Behavior:

The NodeClaim should be created when the batchIdleDuration time is passed and no new pending pods have been scheduled on cluster.

Versions:

Chart Version: 0.35.4
Kubernetes Version (kubectl version): v1.29.1-eks-508b6b3

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jigisha620 · 2024-04-08T19:16:04Z

@MauroSoli Just to be sure, would you mind verifying if BATCH_MAX_DURATION and BATCH_IDLE_DURATION are being set properly. You can do this by running kubectl describe pod <karpenter pod>. Also there is not much that we can see from the logs that you have shared. Can you share logs with message found provisionable pod(s) to see when the actual scheduling for pods happened?

What is the size of the cluster that you are working with? Is it a test cluster that does not have too many pods/deployments?
You have mentioned that the pod is in pending state for more than 60 sec. Given that the node can take some time to come up, do you expect to see the pods scheduled right after 10 seconds?

MauroSoli · 2024-04-08T23:13:31Z

Just to be sure, would you mind verifying if BATCH_MAX_DURATION and BATCH_IDLE_DURATION are being set properly. You can do this by running kubectl describe pod

$ kubectl describe pod karpenter-678d69d4d5-6rpgw -n karpenter | grep BATCH_
      BATCH_MAX_DURATION:  90s
      BATCH_IDLE_DURATION: 10s

Also there is not much that we can see from the logs that you have shared. Can you share logs with message found provisionable pod(s) to see when the actual scheduling for pods happened?

In the logs that I've shared there is the log that you are searching for:

{"level":"INFO","time":"2024-04-08T17:45:31.936Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"17dd42b","pods":"default/pause1, default/pause2, default/pause3, default/pause4, default/pause5 and 5 other(s)","duration":"84.72464ms"}

What is the size of the cluster that you are working with? Is it a test cluster that does not have too many pods/deployments?

We use Karpenter only to manage some specific workloads, like building or running cron jobs.
The cluster-autoscaler manages all other workload.

You have mentioned that the pod is in pending state for more than 60 sec. Given that the node can take some time to come up, do you expect to see the pods scheduled right after 10 seconds?

That's not what I was meaning.
After 60 seconds, while the pods were pending, karpenter was not scheduling new nodeClaim yet. Which instead happened after 90 seconds.

jigisha620 · 2024-04-09T05:52:33Z

Let's say you scheduled a single pod and until 10 seconds there is no new pod then Karpenter will begin scheduling new nodeClaim after 10 seconds. However if a pod comes up before 10 seconds the batching window will be extended up to the maxDuration.

From the logs that you have shared it seems like there were 10 pods and in that case Karpenter would wait until the batchMaxDuration of 90s before it begins scheduling a nodeClaim.

lesinigo · 2024-04-09T09:26:48Z

I've been able to reproduce this using the latest Karpenter v0.35.4 and the latest EKS v1.29.1-eks-508b6b3 and it seems to me there could be some misunderstanding between what is stated above, what is stated in the documentation and what is the actual Karpenter behavior.

If I set high and far enough times for IDLE_DURATION and MAX_DURATION (let's say 10 seconds and 300 seconds) and run a script like for I in $(seq 10); do kubectl run <single_pod>; sleep 2; done, the following happens:

new pods are created
the "nothing happens" interval between one pod and the next is always less than 10 seconds
the whole pods creation takes more than IDLE_DURATION
after the last pod gets created, Karpenter waits until 300 seconds has passed since the time of creation of the first pod
Karpenter creates new nodeclaim(s) to fit all the pending pods

In other words: the batch window gets immediately extended to BATCH_MAX_DURATION after the creation of the second pod is interrupting the first BATCH_IDLE_DURATION interval.

Personally, I understood the current documentation in a different way. It says "BATCH_IDLE_DURATION The maximum amount of time with no new pending pods that if exceeded ends the current batching window. If pods arrive faster than this time, the batching window will be extended up to the maxDuration. If they arrive slower, the pods will be batched separately.". I understood it as follows:

one or more new pods are created
Karpenter notices and starts waiting the IDLE duration
if nothing new happens, the batch window is immediately closed and node claims are computed
otherwise, if new pods are created before BATCH_IDLE_DURATION, jump to 2 (starting a new wait for IDLE duration)
if pods keeps getting created before an IDLE_DURATION ends, Karpenter will keep extending the batch window at blocks of IDLE_DURATION unless the whole window (defined as now - time of first pod created) exceeds BATCH_MAX_DURATION
if the window extends more than BATCH_MAX_DURATION, it is immediatly closed, node claims are computed, and the whole process starts again from 1

In other words: BATCH_IDLE_DURATION will always be the maximum "inactivity time" after which Karpenter starts computing node claims. To me, this makes much more sense because it allows shorter latencies between workload demands (pods creation) and its execution (node claims computed, nodes created, pods start running) while the MAX_DURATION still guarantees a maximum limit to said latency by closing batch windows even if new pods keep arriving.

To me, the documentation seems correct and describes the behavior I would expect, but Karpenter is misbehaving and actually uses the IDLE_DURATION "only the first time", then skipping directly to the MAX_DURATION and thus causing higher latencies between the pods creation and the nodes starting.

jigisha620 · 2024-04-09T19:26:47Z

I had probably misunderstood what was being implied earlier. I was able to reproduce this. Looking into creating a fix for this. Thanks.

jonathan-innis · 2024-04-16T23:23:16Z

/assign @jigisha620

MauroSoli added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 8, 2024

jonathan-innis removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 15, 2024

k8s-ci-robot assigned jigisha620 Apr 16, 2024

jonathan-innis linked a pull request May 10, 2024 that will close this issue

fix: Fix idle duration timeout for pod/node batcher #1244

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `batchIdleDuration` value is ignored #1173

The `batchIdleDuration` value is ignored #1173

MauroSoli commented Apr 8, 2024

jigisha620 commented Apr 8, 2024

MauroSoli commented Apr 8, 2024

jigisha620 commented Apr 9, 2024

lesinigo commented Apr 9, 2024

jigisha620 commented Apr 9, 2024

jonathan-innis commented Apr 16, 2024

The batchIdleDuration value is ignored #1173

The batchIdleDuration value is ignored #1173

Comments

MauroSoli commented Apr 8, 2024

Description

jigisha620 commented Apr 8, 2024

MauroSoli commented Apr 8, 2024

jigisha620 commented Apr 9, 2024

lesinigo commented Apr 9, 2024

jigisha620 commented Apr 9, 2024

jonathan-innis commented Apr 16, 2024

The `batchIdleDuration` value is ignored #1173

The `batchIdleDuration` value is ignored #1173