Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discover Instance Type Capacity Memory Overhead Instead of vmMemoryOverheadPercent #5161

Open
jonathan-innis opened this issue Nov 25, 2023 · 13 comments · May be fixed by kubernetes-sigs/karpenter#1420 or #7004
Labels
feature New feature or request

Comments

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Nov 25, 2023

Description

Tell us about your request

We could consider a few options to discover the expected capacity overhead for a given instance type:

  1. We could store the instance type capacity in memory once a version of that type has been launched and use that as the capacity value after the initial launch rather than basing our calculations off of some heuristic
  2. We could launch instance types, check their capacity and diff the reported capacity from the actual capacity in a generated file that can be shipped with Karpenter on release so we always have accurate measurements on the instance type overhead.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Calculating the difference between the EC2-reported memory capacity and the actual capacity of the instance as reported by kubelet.

Are you currently working around this issue?

Using a heuristic vmMemoryOverheadPercent value right now that is tunable by users and passed through karpenter-global-settings

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jonathan-innis jonathan-innis added the feature New feature or request label Nov 25, 2023
@jonathan-innis
Copy link
Contributor Author

This is a "transferred" version of the issue in kubernetes-sigs/karpenter: kubernetes-sigs/karpenter#716

@sosheskaz
Copy link
Contributor

Could a simple mitigation be to add a static setting? e.g. vmMemoryOverheadOffset, which would take a static amount of memory?

When running compute which is highly heterogeneous, but relatively large, this would simplify the practice of managing this setting, because a sufficiently large static overhead would be able to outweigh the percentage, and allow for less waste.

@BEvgeniyS
Copy link

BEvgeniyS commented Jun 27, 2024

The vmMemoryOverheadPercent value is clearly not one-size-fits-all, so making it per-nodepool would be a great start.

I prefer approach #1 because it doesn't tie Karpenter to cloud provider updates and can work even when new instance types appear before an update.

Here's how I imagine this could be implemented:
a) Convert it to something like "initvmMemoryOverheadPercent" and make it a per-nodepool setting.
b) As soon as we discover or launch a single node of any given nodepool/instanceType pair, the node allocatable should be dynamically discovered and stored in memory or optionally in some form of permanent storage. If Karpenter restarts, it can rediscover this information. All systemReserved/kubeReserved and Bottlerocket's control/admin containers subtract from the allocatable, and should already be considered in the proper allocatable exposed by kubelet.
Karpenter already is able to calculate daemonset overhead, so this can be factored in

The remaining issue is that if the value is too high, Karpenter won't be able to launch the node even though the capacity is there. #2 can then be used as an initial hint for known types.

ADD:
Just dug into cluster-autoscaler, questioning, how come there isn't such an issue?
I've done some testing: tried to schedule a pod with memory request that is total instance memory - DS overhead, and scale from 0

Turns out, the issue is there: CA tries to launch an instance for each of the matching ASG and pod is still pending. CA then kills the instance

So the issue was just hidden and much less likely to be encountered, because:
a) cluster-autoscaler is pretty smart in that it would cache "true allocatable" as soon as it sees first node from ASG for the nodegroup, and doesn't retry it if it doesn't fit
b) CA relies on same resource on every node in a nodegroup, so if it sees 1 node, it knows the config for whole nodegroup. (not something karpenter can rely on hence the need to discover and cache "nodepool/instanceType" pairs.

@Balraj06
Copy link

Balraj06 commented Aug 7, 2024

@jonathan-innis Any latest updates on this issue.

@youwalther65
Copy link
Contributor

Agree in that CA has to do an easier job, but even CA does include some delta to accommodate for slight changes between instance types in same or similar NodePool/ASG etc (something just visible in main.go and not yet described in CA FAQ)
Having a combination of NodePool and instance type might be somehow safe, because a NodePool is tight to an EC2NodeClass. But underlying AMI version of NodePool might change over time and AMI and/or custom user data can tweak kubeReserved and systemReserved in different ways leading to new values of allocatable. So we should think of re-evaluate the cached value.
Another point just to mention is how DaemonSet overhead is factored in: Some DS can be applied to only a subset of nodes via taints/tolerations, so

@BEvgeniyS
Copy link

BEvgeniyS commented Aug 21, 2024

So we should think of re-evaluate the cached value.

cached value shouldn't be static, it could get updated during node registration. Cache can be flushed if nodepool hash is changed. DS overhead is factored in already, don't think we need to change anything there

@jukie
Copy link

jukie commented Sep 3, 2024

I'd like to start an RFC on this, does this sound like a reasonable start?

  • Use an in-memory cache to store per instance-type overhead memory values
  • Pre-populate overhead memory values for all known instance types via similar implementation as [DRAFT] Add capacity memory overhead generation #4517. This value can change so still needs to be accounted for but provides a better initial value over vmMemoryOverheadPercent
  • Unknown instance types (For example when a new instance type becomes available) would default to vmMemoryOverheadPercent calculations
  • Every time a NodeClaim is registered with a particular instance type the cached value gets updated

If it hasn't been mentioned or discussed elsewhere, could a feature request be opened with the EC2 team and SDK maintainers to provide this information in the DescribeInstanceTypes API?

@youwalther65
Copy link
Contributor

youwalther65 commented Sep 4, 2024

@jukie I generally agree with your proposal.
But I have to disagree with the idea of providing this information in EC2 API DescribeInstanceTypes. The overhead is based on many factors like AMI type and version, kernel version and modules used and many others. So it does not make sense to put a reasonable "overhead" value into the API.
So I would even go one step further and implement the cache for overhead not only based on instance type but rather a combination of at least instance type and AMI version.

@jukie
Copy link

jukie commented Sep 4, 2024

Thanks for explaining that, I hadn't given it enough thought! What you've mentioned for a cache mapping of instance-type + ami version to overhead makes sense.

@jukie
Copy link

jukie commented Sep 12, 2024

Caching per-ami is proving to be difficult as that would introduce some coupling.
Specifically here https://github.com/aws/karpenter-provider-aws/blob/main/pkg/cloudprovider/cloudprovider.go#L286-L288 I'm not sure where to store and/or how to compare an instance type that maps to a particular AMI version.

I've opened #7004 which would use the nodeClass hash as a key and the latest version would always be used. However I don't think AMI changes trigger a hash change so I'm wondering what makes the most sense to use instead.

@jukie jukie linked a pull request Sep 12, 2024 that will close this issue
3 tasks
@jukie
Copy link

jukie commented Sep 14, 2024

@youwalther65 would it be better to use a cache key with instance type name + hash of the AMI names in NodeClass status?

That might be the safest route but would mean the first node that gets launched after an AMI change would fallback to vmMemoryOverheadPercent.

Edit: It would actually be every node until the controller requeues (12hrs) so will adjust again (done)

@youwalther65
Copy link
Contributor

@jonathan-innis What do you think about the idea from @jukie ?

@jukie
Copy link

jukie commented Sep 24, 2024

Would it make sense to bring this up as a topic during the working group meeting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
6 participants