feat(dask): make dask autoscaler configurable #835

Alputer · 2024-09-20T09:37:29Z

closes #834

codecov · 2024-09-20T09:38:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 31.08%. Comparing base (c0d2e92) to head (debb354).
Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #835   +/-   ##
=======================================
  Coverage   31.08%   31.08%           
=======================================
  Files          26       26           
  Lines        2487     2487           
=======================================
  Hits          773      773           
  Misses       1714     1714

tiborsimko · 2024-11-01T14:20:51Z

helm/reana/templates/reana-server.yaml

@@ -101,6 +101,8 @@ spec:
          - name: DASK_ENABLED
            value: {{ .Values.dask.enabled | quote }}
          {{- if .Values.dask.enabled }}
+          - name: DASK_AUTOSCALER_ENABLED
+            value: {{ .Values.dask.autoscaler_enabled | quote }}


BTW when testing the autoscaler PR set with the auto scaler turned off:

autoscaler_enabled: false cluster_default_number_of_workers: 4

I'm seeing a problem with stopping workflows:

$ rc run -w calver ... $ rc stop -w calver --force ==> ERROR: Cannot stop workflow calver: (404) Reason: Not Found HTTP response headers: HTTPHeaderDict({'Audit-Id': '4c08dda6-899b-4da2-bdf7-2c2c76006ce6', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '6b060355-5047-4214-899c-623e82a0c569', 'X-Kubernetes-Pf-Prioritylevel-Uid': '498afda6-b2ba-4561-9313-9bda4f75865d', 'Date': 'Fri, 01 Nov 2024 14:13:25 GMT', 'Content-Length': '374'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"daskautoscalers.kubernetes.dask.org \"dask-autoscaler-reana-run-dask-ce1f4c15-36ff-42fc-b320-b5f277322db7\" not found","reason":"NotFound","details":{"name":"dask-autoscaler-reana-run-dask-ce1f4c15-36ff-42fc-b320-b5f277322db7","group":"kubernetes.dask.org","kind":"daskautoscalers"},"code":404}

The worklow is stopped, the Dask workers are stopped, but the job pod fails with:

$ k logs reana-run-job-42c23954-93d7-4524-8e9b-daa28e8c1fcd-5p4fg ... File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 167, in submit raise RuntimeError('cannot schedule new futures after shutdown') RuntimeError: cannot schedule new futures after shutdown Exception ignored in: <function Client.__del__ at 0x7ff1dc223100> Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/distributed/client.py", line 1736, in __del__ File "/usr/local/lib/python3.11/site-packages/distributed/client.py", line 1991, in close File "/usr/local/lib/python3.11/site-packages/distributed/utils.py", line 434, in sync TimeoutError: timed out after 60 s. $ k logs reana-run-batch-ce1f4c15-36ff-42fc-b320-b5f277322db7-qhzq2 2024-11-01 14:14:31,967 | root | MainThread | INFO | Workflow ce1f4c15-36ff-42fc-b320-b5f277322db7 finished. Files available at /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/ce1f4c15-36ff-42fc-b320-b5f277322db7.

P.S. When auto-scaler is enabled, then the stopping works perfectly well.

tiborsimko · 2024-11-01T14:34:25Z

helm/reana/README.md

@@ -69,6 +69,7 @@ This Helm automatically prefixes all names using the release name to avoid colli
 | `components.reana_workflow_engine_snakemake.image`       | [REANA-Workflow-Engine-Snakemake image](https://hub.docker.com/r/reanahub/reana-workflow-engine-snakemake) to use | `docker.io/reanahub/reana-workflow-engine-snakemake:<chart-release-version>` |
 | `compute_backends`                                       | List of supported compute backends (kubernetes, htcondorcern, slurmcern)             | "kubernetes"                                    |
 | `dask.enabled`                                           | Enable support for running Dask workflows | false |
+| `dask.autoscaler_enabled`                                | Enable Dask autoscaler | true |


Is one boolean variable enough?

Looking at the Dask auto-scaler spec, some parameters are possible:

apiVersion: kubernetes.dask.org/v1 kind: DaskAutoscaler metadata: name: example spec: cluster: "name of DaskCluster to autoscale" minimum: 0 # minimum number of workers to create maximum: 10 # maximum number of workers to create

Perhaps instead of using only one autoscaler_enabled Helm value we should use two for min and max, such as:

dask.cluster_autoscale_workers_min: 1 dask.cluster_autoscale_workers_max: 4

And, if both min and max values would be equal to the same amount, then this would indicate that the auto-scaling would be disabled.

This would kind of "override" the meaning of cluster_default_number_of_workers so we could get rid of that one...

We can think IRL.

I am a bit susceptible about these new parameters. If they are equal for example, it not only means that auto-scaling is disabled but also users are not allowed to override the number of workers for their Dask cluster.

I also think that we should be able to somehow restrict users to ask for unbounded number of workers (e.g 1000 workers with 1Mi memory regardless of if the autoscaling is enabled or disabled. We might specify the max number of workers in another way by setting a minimum memory limit for one worker. This would implicitly indicate that max_workers = max_dask_cluster_memory // min_worker_memory. I think this approach is better. We should talk about how we approach this one IRL.

And for the minimum number of workers, I do not see any case where we may want to force users to have at least X number of workers where X>1. So, maybe we can simply check if users request <1 number of workers and then reject such requests.

OK, I see the point. We could keep the autoscale to be independent. However, allowing users to specify 1000 workers with low memory per worker would not be good, since users could generate 1000s of pods which would overload Kubernetes master and slow down everything... So we may want to keep memory independent. Let's muse IRL.

tiborsimko

Commit log cosmetics: the conventional commit scope should represent which component was touched by the change, so we should use "helm" here:

feat(helm): make Dask autoscaler configurable (#835)

Also, please put Closes #NNN in the commit message body, so that we have the reference to the issue which this PR closes also locally, not only on GitHub.

Closes reanahub#834

Alputer added the workflow/dask label Sep 20, 2024

Alputer requested review from tiborsimko and mdonadoni September 20, 2024 09:37

Alputer self-assigned this Sep 20, 2024

Alputer changed the title ~~Dask autoscaler config~~ feat(dask): dask autoscaler config (#835) Sep 20, 2024

Alputer force-pushed the dask-autoscaler-config branch from e3ce610 to 128e7de Compare October 31, 2024 14:36

Alputer added a commit to Alputer/reana that referenced this pull request Oct 31, 2024

feat(dask): make dask autoscaler configurable (reanahub#835)

c91a810

Alputer force-pushed the dask-autoscaler-config branch from 128e7de to c91a810 Compare October 31, 2024 14:37

Alputer changed the title ~~feat(dask): dask autoscaler config (#835)~~ feat(dask): make dask autoscaler configurable Oct 31, 2024

Alputer added a commit to Alputer/reana that referenced this pull request Nov 1, 2024

feat(dask): make Dask autoscaler configurable (reanahub#835)

b513ddb

Alputer force-pushed the dask-autoscaler-config branch from c91a810 to b513ddb Compare November 1, 2024 09:27

tiborsimko reviewed Nov 1, 2024

View reviewed changes

Alputer added a commit to Alputer/reana that referenced this pull request Nov 4, 2024

feat(dask): make Dask autoscaler configurable (reanahub#835)

26d5078

Alputer force-pushed the dask-autoscaler-config branch from b513ddb to 26d5078 Compare November 4, 2024 10:08

tiborsimko reviewed Nov 7, 2024

View reviewed changes

Alputer added a commit to Alputer/reana that referenced this pull request Nov 11, 2024

feat(helm): make Dask autoscaler configurable (reanahub#835)

aeebee1

Closes reanahub#834

Alputer force-pushed the dask-autoscaler-config branch from 26d5078 to aeebee1 Compare November 11, 2024 09:03

Alputer added a commit to Alputer/reana that referenced this pull request Nov 11, 2024

feat(helm): make Dask autoscaler configurable (reanahub#835)

2ebee54

Closes reanahub#834

Alputer force-pushed the dask-autoscaler-config branch from aeebee1 to 2ebee54 Compare November 11, 2024 09:30

Alputer added a commit to Alputer/reana that referenced this pull request Nov 11, 2024

feat(helm): make Dask autoscaler configurable (reanahub#835)

b90e25f

Closes reanahub#834

Alputer force-pushed the dask-autoscaler-config branch from 2ebee54 to b90e25f Compare November 11, 2024 10:58

feat(helm): make Dask autoscaler configurable (reanahub#835)

debb354

Closes reanahub#834

Alputer force-pushed the dask-autoscaler-config branch from b90e25f to debb354 Compare November 11, 2024 11:17

tiborsimko approved these changes Nov 11, 2024

View reviewed changes

tiborsimko merged commit debb354 into reanahub:master Nov 11, 2024
12 checks passed

github-actions bot mentioned this pull request Oct 31, 2024

chore(master): release 0.9.4 #791

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dask): make dask autoscaler configurable #835

feat(dask): make dask autoscaler configurable #835

Alputer commented Sep 20, 2024

codecov bot commented Sep 20, 2024 •

edited

Loading

tiborsimko Nov 1, 2024 •

edited

Loading

tiborsimko Nov 1, 2024

Alputer Nov 4, 2024 •

edited

Loading

tiborsimko Nov 5, 2024

tiborsimko left a comment •

edited

Loading

feat(dask): make dask autoscaler configurable #835

feat(dask): make dask autoscaler configurable #835

Conversation

Alputer commented Sep 20, 2024

codecov bot commented Sep 20, 2024 • edited Loading

Codecov Report

tiborsimko Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

tiborsimko Nov 1, 2024

Choose a reason for hiding this comment

Alputer Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

tiborsimko Nov 5, 2024

Choose a reason for hiding this comment

tiborsimko left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented Sep 20, 2024 •

edited

Loading

tiborsimko Nov 1, 2024 •

edited

Loading

Alputer Nov 4, 2024 •

edited

Loading

tiborsimko left a comment •

edited

Loading