-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Nvidia GPU Feature Discovery #1219
Comments
What kind of feature discovery are you talking about here? Is it stuff related to the properties of the instance type that we are launching? |
gfd adds labels after the nodes have already been created.
|
basically you are requesting a workload requiring a node with those labels that we create a node with those labels, but the nodepool is not aware of these labels and we wont be aware of them. They aren't added until gfd goes and adds them. They are added after gpu nodes are provisioned? How can karpenter know these traits? Seems relevant to per instance type overrides. If you know particular instance types will have particular traits then we can override a configmap to say these instance types have these values for the overrides. Do these values differ from node to node? Seems cuda runtime is dependent on the gpu drivers installed on the node? We can't just cache them directly. |
|
DRA -> #1231 probably solve thing = "knowing before" as third-party drivers would present noderesourceslices when running on cluster altough not sure about its flexibility in terms we are still assuming that something is there before and it is constrained only on resources |
Also e.g. node feature discovery adds labels to nodes e.g CPU capabilities |
I think the ideal state here is defining what the different configurations can be for the GPU feature discovery operator and then see if we can surface first-class support for these in Karpenter directly. Like you mentioned, having to statically configure all of these values is going to be a huge pain, ideally Karpenter can auto-discover them by matching its logic up with what Nvidia tells us should be on these instance types. I'm wondering if it makes sense to retitle this issue to be more specific to the use-case. Something like: "Support Nvidia GPU Feature Discovery". @p53 What do you think? |
/triage accepted |
@jonathan-innis renamed |
Description
Original Title: Ignore node selector labels for provisioning
What problem are you trying to solve?
We have nvidia operator which installs nvidia runtime etc.. on karpenter nodes after they are provisioned, operator runs feature discovery and applies appropriate nvidia labels, we need to place pods on these karpenter nodes depending on these nvidia labels. Problem is that when i place nvidia labels in nodeSelector on pod, which are not in NodePool, because they are placed on nodes during node runtime by nvidia operator, karpenter will fail to provision nodes. Solution might be e.g. placing some annotations on pod e.g.
karpenter.sh/ignore-label=somelabel
so that karpenter ignores this label during provisioningHow important is this feature to you?
The text was updated successfully, but these errors were encountered: