GPU machine type selection #3796
Replies: 5 comments 7 replies
-
This use case is the exact reason why we were very eager for the new pod template feature. With On GKE for instance, nodes have these labels depending on the GPU type. (One question to be answered in an RFC would be whether all cloud providers handle selection of GPU types via node labels or whether other cloud providers have different resource names). As a platform engineer, I not only want to give my platform users the ability to select GPU types, I also want to be able to prevent accidental usage of an expensive A100 node for instance by a task that requests any gpu (doesn't specify the GPU type). This is why we set taints on our GPU node pools with the GPU type (and actually also GPU count). This is not done by default by GKE. The If selection of GPU types becomes a feature in flyte itself, as a platform engineer, I probably wouldn't want to have the mapping from gpu type to taints, tolerations, ... in python. Instead, to give our platform users the ability to switch between e.g. T4, V100, A100 e.g. on GKE, I could imagine the following configuration on the platform side (helm values): gpu_types:
- name: A100 # This can be any string
nodeselector:
cloud.google.com/gke-accelerator: nvidia-a100-80gb
toleration:
# Toleration for some custom taint which in the case of GKE does not exist by default but which I as a platform engineer would want to create. This is optional.
# resource_name: On GKE this field wouldn't have to be set since all GPUs share the resource name which is already configured in the helm values
In case other cloud providers have different resource names for different GPU types (instead of node labels), platform engineers would configure: gpu_type:
- name:
resource_name: ... TL;DR I would put the responsibility of coming up with a mapping from arbitrary gpu type names to resource names/node selectors/tolerations, ... on the platform engineers. This would allow for great flexibility in how gpu types are controlled in different cloud providers/bare metal clusters, ... |
Beta Was this translation helpful? Give feedback.
-
WIP comment
node affinity and tolerations For fractional
For mixed partitioning
So even for |
Beta Was this translation helpful? Give feedback.
-
09-14-2023 contributors meeting notes: OK to move to RFC |
Beta Was this translation helpful? Give feedback.
-
Jeev has been working on this and the python ux has been updated slightly # Specify T4 if your cluster doesn't have a default/has multiple types.
@task(accelerator=NvidiaTeslaT4)
def needs_t4(a: int):
pass
# same with an a100
@task(accelerator=NvidiaTeslaA100)
def needs_a100(a: int):
pass
# specify that you want a whole a100 (if you have some a100s partitioned and some not)
@task(accelerator=NvidiaTeslaA100.with_partition_size(None))
def needs_unpartitioned_a100(a: int):
pass
# specify a specific a100 partition size (if you have multiple)
@task(accelerator=NvidiaTeslaA100.with_partition_size(NvidiaTeslaA100.partition_sizes.PARTITION_1G_5GB))
def needs_partitioned_a100(a: int):
pass |
Beta Was this translation helpful? Give feedback.
-
2023-11-9 Contributor's meetup notes: implementation of this idea is already in progress. |
Beta Was this translation helpful? Give feedback.
-
Use Case:
As an ML engineer, I want to specify exactly what kind of gpu type I want (e.g.
A10G
,A100
,V100
,T4
) so that I can target a machine that suits my model training / inference workload. E.g. certain data-types likebfloat16
can only support certain GPU types.Example flytekit api:
Flyte should be able to provision the correct instance in the underlying cloud (e.g. AWS, GCP) to fulfill the request.
Beta Was this translation helpful? Give feedback.
All reactions