Add GPU support to run only DAG's #722

antikilahdjs · 2023-04-12T02:45:35Z

antikilahdjs
Apr 12, 2023

Hello everybody.

I am new in airflow and I have many machines with labels like, just cpu, just gpu and so on. I would like to deploy airflow in cpu as well but when run the dags I need run in gpu machine. There is a solution for it ? I can add a nvidia label in the resources spec but I dont know if I need add in scheduler or worker.... in that case I need just add a gpu when run a dag and for the rest component I would like to deploy using the cpu machine.

Thank you

Answered by thesuperzapper

Apr 13, 2023

@antikilahdjs I actually have a solution for this coming up, with the task-aware auto-scaler feature, which will allow automatically scaling up/down the celery workers (with some clever logic to prevent scaling down workers which are actively doing stuff, unless you label the task as "safe to interrupt").

We will support having multiple "queues" of celery workers, for example, you might have a "default" queue with CPUs only, and a "gpu" queue with GPUs. The auto-scaler will then allow you to scale up the "gpu" queue only when tasks are waiting in that queue, and scale it down when it's no longer needed.

Before the new auto-scaler is finished, you can actually achieve GPU support in a les…

View full answer

thesuperzapper · 2023-04-13T00:06:39Z

thesuperzapper
Apr 13, 2023
Maintainer

@antikilahdjs I actually have a solution for this coming up, with the task-aware auto-scaler feature, which will allow automatically scaling up/down the celery workers (with some clever logic to prevent scaling down workers which are actively doing stuff, unless you label the task as "safe to interrupt").

We will support having multiple "queues" of celery workers, for example, you might have a "default" queue with CPUs only, and a "gpu" queue with GPUs. The auto-scaler will then allow you to scale up the "gpu" queue only when tasks are waiting in that queue, and scale it down when it's no longer needed.

Before the new auto-scaler is finished, you can actually achieve GPU support in a less elegant way by either:

using the KubernetesExecutor mode of airflow, and use pod_override to apply specific tolerations/requests to some of your tasks, so that they get scheduled on nodes with GPUs:
- Personally, I don't like how KubernetesExecutor works, as it means that each task in your dags will be its own Pod on Kubernetes, which is massively inefficient and does not scale if you have many tasks (Kubernetes can't handle more than a few thousand concurrent pods) or short tasks (which take less time to run than the Pod takes to spawn)
- Some of these problems could be mitigated by using the CeleryKubernetesExecutor which lets you choose either a celery-worker OR Kubernetes pods on a per-task basis, but this is very messy, and requires your users to really understand what they are doing
using the KubernetesPodOperator task in your dags to run Pods with the needed requirements/tolerations, so they get scheduled on nodes with GPUs:
- But using this approach means you have to build a docker image for each task (or find a way to mount/download your code into the pod), this means you won't get the benefit of airflow's operators, but it might be fine if you only want to run a Python script.

This is a lot of information, if you are doing this for a company, I do offer consulting services if you're interested!

1 reply

antikilahdjs Apr 13, 2023
Author

Hi @thesuperzapper I really appreciate that and about your explanation about those features. I will check how use it and wait for the coming up the another solution. All the best for you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU support to run only DAG's #722

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Add GPU support to run only DAG's #722

antikilahdjs Apr 12, 2023

Replies: 1 comment · 1 reply

thesuperzapper Apr 13, 2023 Maintainer

antikilahdjs Apr 13, 2023 Author

antikilahdjs
Apr 12, 2023

Replies: 1 comment 1 reply

thesuperzapper
Apr 13, 2023
Maintainer

antikilahdjs Apr 13, 2023
Author