You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used the GitHub search to find a similar issue and didn't find it.
I searched the Prefect documentation for this issue.
I checked that this issue is related to Prefect and not one of its dependencies.
Bug summary
I have a flow run which triggers about 2,000 tasks. I am using the DaskTaskRunner to handle running this many tasks at scale. After about 200 tasks are scheduled, scheduling hangs. After this point the flow continues for about 5 minutes, with the scheduled tasks running fine but no more being scheduled. All tasks that managed to be scheduled eventually complete and the flow hangs in a running status, not finished, but with not picking showing any new tasks starting. After a seemingly variable amount of time (0-5mins?), the flow crashes with a PoolTimeout trying to reach <Private-Prefect-Server-IP>/api/task_runs/
Detailed sequential walkthrough of error
Prefect server is running on a medium size machine
A Flow is triggered. The flow starts triggering 2,208 tasks.
These tasks are handled by the DaskTaskRunner which is configured to point at a Dask cluster (on a 3rd separate VM) for distributed computation of the tasks.
A singular dask worker (another VM) is connected to the dask cluster, with 2cpus.
The dask worker starts processing tasks, max 2 at a time.
The logs of the prefect-worker show tasks 0-180 being scheduled quickly, then slowing down, and stopping at about 200. From here there are no more logs until the crash.
The prefect UI and dask UI show the existing tasks being completed as normal. Neither UI hangs on reload or becomes unresponsive
The 200 tasks complete normally, then the prefect UI shows the flow as running, but no tasks currently pending or in progress.
After about 5 minutes, the flow is reported as crashed with the following console UI Error
All logs are fine, apart from eventually reporting the crash in various ways. The docker logs on the prefect-worker are the only logs with a bit more info as shown in the stack trace provided below.
There is clearly a bottleneck here but I am struggling to find where it is. The cpu and memory stays at a normal level for all VMs. Thinking the bottleneck may be the prefect-server database (after reading similar issues) I moved to a Cloud SQL managed postgreSQL database. That has not fixed it - the database shows low loading and the issue still happens.
Reproduction
# I managed to reproduce the problem in a simpler way with the below code (running from prefect.serve). This seems to result in a similar error (though about 900 tasks are scheduled before grinding to a halt. Maybe because it is simpler code/not run through docker, or because my laptop is handling the sheduling and is higher spec? but still no out of the ordinary resource usage at the point it hangs)importosfromprefectimporttask, flow, get_run_logger, servefromprefect.runtimeimporttask_runfromprefect_daskimportDaskTaskRunnerdefgenerate_task_name():
parameters=task_run.parameterstask_number=parameters["task_number"]
returnstr(task_number)
@task(name="example_task",task_run_name=generate_task_name,log_prints=True)defexample_task(task_number):
logger=get_run_logger()
logger.info(f'I am task {task_number}')
@flow(name="example_flow",log_prints=True,task_runner=DaskTaskRunner(address="tcp://10.128.15.193:8786"))defexample_flow():
logger=get_run_logger()
args=list(range(2000))
logger.info(f'Triggering {len(args)} tasks')
example_task.map(args)
if__name__=='__main__':
repro_deploy=example_flow.to_deployment(
name='reproduce_error',
work_pool_name='default-agent-pool')
serve(repro_deploy)
# for server and worker the output is the same
Version: 2.16.6
API version: 0.8.4
Python version: 3.10.12
Git commit: 3fecd435
Built: Sat, Mar 23, 2024 4:06 PM
OS/Arch: linux/x86_64
Profile: default
Server type: server
Additional context
This may be a terrible architecture, so if the solution is to use prefect in a diifferent way then I would also welcome so advice here. A post to the prefect discourse on how to do something like this did not lead to any advice on good architecture patterns. For context I tried another approach of using prefect workers to distribute tasks out to other VMs (i.e. instead of 1-prefect-server => 1-prefect-worker => 1-dask-cluster => many dask-workers, i tried 1-prefect-server => many prefect-workers), but this crashed in a similar way (failed to reach api).
I could also simply use prefect to trigger the flow and call dask normally from within the flow. Since it seems to be prefect struggling to keep track of all the tasks this would likely help - however I would lose the greatly increased visibility and logging that prefect provides at the task level
The text was updated successfully, but these errors were encountered:
For some additional context, I removed the prefect @task decorator from the task to let Dask handle scaling the tasks natively and I was easily able to scale to 1,000,000+ tasks. I think I can therefore also rule out the dask server as being a bottleneck here
First check
Bug summary
I have a flow run which triggers about 2,000 tasks. I am using the DaskTaskRunner to handle running this many tasks at scale. After about 200 tasks are scheduled, scheduling hangs. After this point the flow continues for about 5 minutes, with the scheduled tasks running fine but no more being scheduled. All tasks that managed to be scheduled eventually complete and the flow hangs in a running status, not finished, but with not picking showing any new tasks starting. After a seemingly variable amount of time (0-5mins?), the flow crashes with a PoolTimeout trying to reach
<Private-Prefect-Server-IP>/api/task_runs/
Detailed sequential walkthrough of error
All logs are fine, apart from eventually reporting the crash in various ways. The docker logs on the prefect-worker are the only logs with a bit more info as shown in the stack trace provided below.
There is clearly a bottleneck here but I am struggling to find where it is. The cpu and memory stays at a normal level for all VMs. Thinking the bottleneck may be the prefect-server database (after reading similar issues) I moved to a Cloud SQL managed postgreSQL database. That has not fixed it - the database shows low loading and the issue still happens.
Reproduction
Error
Versions
Additional context
This may be a terrible architecture, so if the solution is to use prefect in a diifferent way then I would also welcome so advice here. A post to the prefect discourse on how to do something like this did not lead to any advice on good architecture patterns. For context I tried another approach of using prefect workers to distribute tasks out to other VMs (i.e. instead of 1-prefect-server => 1-prefect-worker => 1-dask-cluster => many dask-workers, i tried 1-prefect-server => many prefect-workers), but this crashed in a similar way (failed to reach api).
I could also simply use prefect to trigger the flow and call dask normally from within the flow. Since it seems to be prefect struggling to keep track of all the tasks this would likely help - however I would lose the greatly increased visibility and logging that prefect provides at the task level
The text was updated successfully, but these errors were encountered: