Discover Instance Type Capacity Memory Overhead Instead of `vmMemoryOverheadPercent` #5161

jonathan-innis · 2023-11-25T03:53:48Z

Description

Tell us about your request

We could consider a few options to discover the expected capacity overhead for a given instance type:

We could store the instance type capacity in memory once a version of that type has been launched and use that as the capacity value after the initial launch rather than basing our calculations off of some heuristic
We could launch instance types, check their capacity and diff the reported capacity from the actual capacity in a generated file that can be shipped with Karpenter on release so we always have accurate measurements on the instance type overhead.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Calculating the difference between the EC2-reported memory capacity and the actual capacity of the instance as reported by kubelet.

Are you currently working around this issue?

Using a heuristic vmMemoryOverheadPercent value right now that is tunable by users and passed through karpenter-global-settings

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jonathan-innis · 2023-11-25T03:54:02Z

This is a "transferred" version of the issue in kubernetes-sigs/karpenter: kubernetes-sigs/karpenter#716

sosheskaz · 2024-04-11T22:00:15Z

Could a simple mitigation be to add a static setting? e.g. vmMemoryOverheadOffset, which would take a static amount of memory?

When running compute which is highly heterogeneous, but relatively large, this would simplify the practice of managing this setting, because a sufficiently large static overhead would be able to outweigh the percentage, and allow for less waste.

BEvgeniyS · 2024-06-27T11:27:30Z

The vmMemoryOverheadPercent value is clearly not one-size-fits-all, so making it per-nodepool would be a great start.

I prefer approach #1 because it doesn't tie Karpenter to cloud provider updates and can work even when new instance types appear before an update.

Here's how I imagine this could be implemented:
a) Convert it to something like "initvmMemoryOverheadPercent" and make it a per-nodepool setting.
b) As soon as we discover or launch a single node of any given nodepool/instanceType pair, the node allocatable should be dynamically discovered and stored in memory or optionally in some form of permanent storage. If Karpenter restarts, it can rediscover this information. All systemReserved/kubeReserved and Bottlerocket's control/admin containers subtract from the allocatable, and should already be considered in the proper allocatable exposed by kubelet.
Karpenter already is able to calculate daemonset overhead, so this can be factored in

The remaining issue is that if the value is too high, Karpenter won't be able to launch the node even though the capacity is there. #2 can then be used as an initial hint for known types.

ADD:
Just dug into cluster-autoscaler, questioning, how come there isn't such an issue?
I've done some testing: tried to schedule a pod with memory request that is total instance memory - DS overhead, and scale from 0

Turns out, the issue is there: CA tries to launch an instance for each of the matching ASG and pod is still pending. CA then kills the instance

So the issue was just hidden and much less likely to be encountered, because:
a) cluster-autoscaler is pretty smart in that it would cache "true allocatable" as soon as it sees first node from ASG for the nodegroup, and doesn't retry it if it doesn't fit
b) CA relies on same resource on every node in a nodegroup, so if it sees 1 node, it knows the config for whole nodegroup. (not something karpenter can rely on hence the need to discover and cache "nodepool/instanceType" pairs.

Balraj06 · 2024-08-07T04:45:33Z

@jonathan-innis Any latest updates on this issue.

youwalther65 · 2024-08-09T07:40:30Z

Agree in that CA has to do an easier job, but even CA does include some delta to accommodate for slight changes between instance types in same or similar NodePool/ASG etc (something just visible in main.go and not yet described in CA FAQ)
Having a combination of NodePool and instance type might be somehow safe, because a NodePool is tight to an EC2NodeClass. But underlying AMI version of NodePool might change over time and AMI and/or custom user data can tweak kubeReserved and systemReserved in different ways leading to new values of allocatable. So we should think of re-evaluate the cached value.
Another point just to mention is how DaemonSet overhead is factored in: Some DS can be applied to only a subset of nodes via taints/tolerations, so

BEvgeniyS · 2024-08-21T05:13:45Z

So we should think of re-evaluate the cached value.

cached value shouldn't be static, it could get updated during node registration. Cache can be flushed if nodepool hash is changed. DS overhead is factored in already, don't think we need to change anything there

jukie · 2024-09-03T18:12:09Z

I'd like to start an RFC on this, does this sound like a reasonable start?

Use an in-memory cache to store per instance-type overhead memory values
Pre-populate overhead memory values for all known instance types via similar implementation as [DRAFT] Add capacity memory overhead generation #4517. This value can change so still needs to be accounted for but provides a better initial value over vmMemoryOverheadPercent
Unknown instance types (For example when a new instance type becomes available) would default to vmMemoryOverheadPercent calculations
Every time a NodeClaim is registered with a particular instance type the cached value gets updated

~~If it hasn't been mentioned or discussed elsewhere, could a feature request be opened with the EC2 team and SDK maintainers to provide this information in the DescribeInstanceTypes API?~~

youwalther65 · 2024-09-04T07:40:00Z

@jukie I generally agree with your proposal.
But I have to disagree with the idea of providing this information in EC2 API DescribeInstanceTypes. The overhead is based on many factors like AMI type and version, kernel version and modules used and many others. So it does not make sense to put a reasonable "overhead" value into the API.
So I would even go one step further and implement the cache for overhead not only based on instance type but rather a combination of at least instance type and AMI version.

jukie · 2024-09-04T15:13:38Z

Thanks for explaining that, I hadn't given it enough thought! What you've mentioned for a cache mapping of instance-type + ami version to overhead makes sense.

jukie · 2024-09-12T22:17:38Z

Caching per-ami is proving to be difficult as that would introduce some coupling.
Specifically here https://github.com/aws/karpenter-provider-aws/blob/main/pkg/cloudprovider/cloudprovider.go#L286-L288 I'm not sure where to store and/or how to compare an instance type that maps to a particular AMI version.

I've opened #7004 which would use the nodeClass hash as a key and the latest version would always be used. However I don't think AMI changes trigger a hash change so I'm wondering what makes the most sense to use instead.

jukie · 2024-09-14T18:42:55Z

@youwalther65 would it be better to use a cache key with instance type name + hash of the AMI names in NodeClass status?

That might be the safest route but would mean the first node that gets launched after an AMI change would fallback to vmMemoryOverheadPercent.

Edit: It would actually be every node until the controller requeues (12hrs) so will adjust again (done)

youwalther65 · 2024-09-16T14:31:05Z

@jonathan-innis What do you think about the idea from @jukie ?

jukie · 2024-09-24T20:50:31Z

Would it make sense to bring this up as a topic during the working group meeting?

jonathan-innis added the feature New feature or request label Nov 25, 2023

jonathan-innis mentioned this issue Nov 25, 2023

Discover Instance Type Capacity Memory Overhead Instead of vmMemoryOverheadPercent kubernetes-sigs/karpenter#716

Closed

dgdevops mentioned this issue Feb 27, 2024

Karpenter cannot provision node with same instance type previously used by cluster-autoscaler #5676

Closed

pH14 mentioned this issue May 3, 2024

feat: per-instance type VM memory overhead percentages #6146

Closed

3 tasks

engedaam mentioned this issue May 3, 2024

Failing to provision EC2 node even though matching EC2 node type exists kubernetes-sigs/karpenter#1212

Closed

engedaam mentioned this issue May 15, 2024

fix: Increase to large instance size for scale tests #6184

Merged

3 tasks

ellistarn mentioned this issue Jun 6, 2024

RFC: NodeOverlay kubernetes-sigs/karpenter#1305

Open

This was referenced Jul 9, 2024

fix: accurately track allocatable resources for nodes BEvgeniyS/karpenter#1

Closed

fix: accurately track allocatable resources for nodes kubernetes-sigs/karpenter#1420

Open

This was referenced Aug 12, 2024

vmMemoryOverheadPercent design is causing over provision #6652

Open

Karpenter is nominating machine that have not enough memory #4450

Closed

jukie mentioned this issue Sep 3, 2024

design: Add vmOverheadMemory discovery RFC #6919

Closed

3 tasks

jukie linked a pull request Sep 12, 2024 that will close this issue

feat: Cache Instance Type Memory Capacity #7004

Open

3 tasks

jukie mentioned this issue Sep 17, 2024

feat: Add VMReserved to InstanceTypeOverhead kubernetes-sigs/karpenter#1673

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discover Instance Type Capacity Memory Overhead Instead of `vmMemoryOverheadPercent` #5161

Discover Instance Type Capacity Memory Overhead Instead of `vmMemoryOverheadPercent` #5161

jonathan-innis commented Nov 25, 2023 •

edited

Loading

jonathan-innis commented Nov 25, 2023

sosheskaz commented Apr 11, 2024

BEvgeniyS commented Jun 27, 2024 •

edited

Loading

Balraj06 commented Aug 7, 2024

youwalther65 commented Aug 9, 2024

BEvgeniyS commented Aug 21, 2024 •

edited

Loading

jukie commented Sep 3, 2024 •

edited

Loading

youwalther65 commented Sep 4, 2024 •

edited

Loading

jukie commented Sep 4, 2024

jukie commented Sep 12, 2024 •

edited

Loading

jukie commented Sep 14, 2024 •

edited

Loading

youwalther65 commented Sep 16, 2024

jukie commented Sep 24, 2024

Discover Instance Type Capacity Memory Overhead Instead of vmMemoryOverheadPercent #5161

Discover Instance Type Capacity Memory Overhead Instead of vmMemoryOverheadPercent #5161

Comments

jonathan-innis commented Nov 25, 2023 • edited Loading

Description

Tell us about your request

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Community Note

jonathan-innis commented Nov 25, 2023

sosheskaz commented Apr 11, 2024

BEvgeniyS commented Jun 27, 2024 • edited Loading

Balraj06 commented Aug 7, 2024

youwalther65 commented Aug 9, 2024

BEvgeniyS commented Aug 21, 2024 • edited Loading

jukie commented Sep 3, 2024 • edited Loading

youwalther65 commented Sep 4, 2024 • edited Loading

jukie commented Sep 4, 2024

jukie commented Sep 12, 2024 • edited Loading

jukie commented Sep 14, 2024 • edited Loading

youwalther65 commented Sep 16, 2024

jukie commented Sep 24, 2024

Discover Instance Type Capacity Memory Overhead Instead of `vmMemoryOverheadPercent` #5161

Discover Instance Type Capacity Memory Overhead Instead of `vmMemoryOverheadPercent` #5161

jonathan-innis commented Nov 25, 2023 •

edited

Loading

BEvgeniyS commented Jun 27, 2024 •

edited

Loading

BEvgeniyS commented Aug 21, 2024 •

edited

Loading

jukie commented Sep 3, 2024 •

edited

Loading

youwalther65 commented Sep 4, 2024 •

edited

Loading

jukie commented Sep 12, 2024 •

edited

Loading

jukie commented Sep 14, 2024 •

edited

Loading