-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(dask): make dask autoscaler configurable #835
Conversation
e3ce610
to
128e7de
Compare
128e7de
to
c91a810
Compare
c91a810
to
b513ddb
Compare
@@ -101,6 +101,8 @@ spec: | |||
- name: DASK_ENABLED | |||
value: {{ .Values.dask.enabled | quote }} | |||
{{- if .Values.dask.enabled }} | |||
- name: DASK_AUTOSCALER_ENABLED | |||
value: {{ .Values.dask.autoscaler_enabled | quote }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW when testing the autoscaler PR set with the auto scaler turned off:
autoscaler_enabled: false
cluster_default_number_of_workers: 4
I'm seeing a problem with stopping workflows:
$ rc run -w calver
...
$ rc stop -w calver --force
==> ERROR: Cannot stop workflow calver:
(404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '4c08dda6-899b-4da2-bdf7-2c2c76006ce6', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '6b060355-5047-4214-899c-623e82a0c569', 'X-Kubernetes-Pf-Prioritylevel-Uid': '498afda6-b2ba-4561-9313-9bda4f75865d', 'Date': 'Fri, 01 Nov 2024 14:13:25 GMT', 'Content-Length': '374'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"daskautoscalers.kubernetes.dask.org \"dask-autoscaler-reana-run-dask-ce1f4c15-36ff-42fc-b320-b5f277322db7\" not found","reason":"NotFound","details":{"name":"dask-autoscaler-reana-run-dask-ce1f4c15-36ff-42fc-b320-b5f277322db7","group":"kubernetes.dask.org","kind":"daskautoscalers"},"code":404}
The worklow is stopped, the Dask workers are stopped, but the job pod fails with:
$ k logs reana-run-job-42c23954-93d7-4524-8e9b-daa28e8c1fcd-5p4fg
...
File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 167, in submit
raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown
Exception ignored in: <function Client.__del__ at 0x7ff1dc223100>
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/distributed/client.py", line 1736, in __del__
File "/usr/local/lib/python3.11/site-packages/distributed/client.py", line 1991, in close
File "/usr/local/lib/python3.11/site-packages/distributed/utils.py", line 434, in sync
TimeoutError: timed out after 60 s.
$ k logs reana-run-batch-ce1f4c15-36ff-42fc-b320-b5f277322db7-qhzq2
2024-11-01 14:14:31,967 | root | MainThread | INFO | Workflow ce1f4c15-36ff-42fc-b320-b5f277322db7 finished. Files available at /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/ce1f4c15-36ff-42fc-b320-b5f277322db7.
P.S. When auto-scaler is enabled, then the stopping works perfectly well.
@@ -69,6 +69,7 @@ This Helm automatically prefixes all names using the release name to avoid colli | |||
| `components.reana_workflow_engine_snakemake.image` | [REANA-Workflow-Engine-Snakemake image](https://hub.docker.com/r/reanahub/reana-workflow-engine-snakemake) to use | `docker.io/reanahub/reana-workflow-engine-snakemake:<chart-release-version>` | | |||
| `compute_backends` | List of supported compute backends (kubernetes, htcondorcern, slurmcern) | "kubernetes" | | |||
| `dask.enabled` | Enable support for running Dask workflows | false | | |||
| `dask.autoscaler_enabled` | Enable Dask autoscaler | true | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is one boolean variable enough?
Looking at the Dask auto-scaler spec, some parameters are possible:
apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
name: example
spec:
cluster: "name of DaskCluster to autoscale"
minimum: 0 # minimum number of workers to create
maximum: 10 # maximum number of workers to create
Perhaps instead of using only one autoscaler_enabled
Helm value we should use two for min and max, such as:
dask.cluster_autoscale_workers_min: 1
dask.cluster_autoscale_workers_max: 4
And, if both min and max values would be equal to the same amount, then this would indicate that the auto-scaling would be disabled.
This would kind of "override" the meaning of cluster_default_number_of_workers
so we could get rid of that one...
We can think IRL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit susceptible about these new parameters. If they are equal for example, it not only means that auto-scaling is disabled but also users are not allowed to override the number of workers for their Dask cluster.
I also think that we should be able to somehow restrict users to ask for unbounded number of workers (e.g 1000 workers with 1Mi memory regardless of if the autoscaling is enabled or disabled. We might specify the max number of workers in another way by setting a minimum memory limit for one worker. This would implicitly indicate that max_workers = max_dask_cluster_memory // min_worker_memory
. I think this approach is better. We should talk about how we approach this one IRL.
And for the minimum number of workers, I do not see any case where we may want to force users to have at least X number of workers where X>1. So, maybe we can simply check if users request <1 number of workers
and then reject such requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see the point. We could keep the autoscale to be independent. However, allowing users to specify 1000 workers with low memory per worker would not be good, since users could generate 1000s of pods which would overload Kubernetes master and slow down everything... So we may want to keep memory independent. Let's muse IRL.
b513ddb
to
26d5078
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit log cosmetics: the conventional commit scope should represent which component was touched by the change, so we should use "helm" here:
feat(helm): make Dask autoscaler configurable (#835)
Also, please put Closes #NNN
in the commit message body, so that we have the reference to the issue which this PR closes also locally, not only on GitHub.
26d5078
to
aeebee1
Compare
aeebee1
to
2ebee54
Compare
2ebee54
to
b90e25f
Compare
b90e25f
to
debb354
Compare
closes #834