-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance: optimise "blue spread" runtime and NotReady pod termination statuses #411
Comments
For reference, here is the full DB and K8S status snapshot throughout the run:
|
suggestion: One possible solution, which was already mentioned during internal meetings, is to have a higher priority for One thing to keep in mind is that progress tracking should also be sent with |
One example of a NotReady pod termination: Whilst the pod logs were showing "finished": $ kubectl logs reana-run-batch-fb8113f6-e80d-4b5b-9f23-84d76e39c79b-ksgz7 workflow-engine
...
2021-11-01 14:23:57,952 | adage | MainThread | INFO | adage state loop done.
2021-11-01 14:23:57,952 | adage | MainThread | INFO | execution valid. (in terms of execution order)
2021-11-01 14:23:57,952 | adage | MainThread | INFO | workflow completed successfully.
2021-11-01 14:23:57,952 | yadage.steering_api | MainThread | INFO | done. dumping workflow to disk.
2021-11-01 14:23:57,960 | yadage.steering_api | MainThread | INFO | visualizing workflow.
2021-11-01 14:23:58,593 | reana-workflow-engine-yadage | MainThread | INFO | Workflow finished. Publishing...
2021-11-01 14:23:58,595 | root | MainThread | INFO | Workflow fb8113f6-e80d-4b5b-9f23-84d76e39c79b finished. Files available at /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/fb8113f6-e80d-4b5b-9f23-84d76e39c79b. The DB status was: $ reana-client status -w fb8113f6-e80d-4b5b-9f23-84d76e39c79b
NAME RUN_NUMBER CREATED STARTED STATUS PROGRESS
pMSSM-NoSys-1200Mi-10nodeb-56nodej-250max-21k8s-mqprio-1k-235 1 2021-11-01T13:47:51 2021-11-01T14:16:00 running 3/6 Note that the workflow was reported as "finished" only much later: $ reana-client status -w fb8113f6-e80d-4b5b-9f23-84d76e39c79b
NAME RUN_NUMBER CREATED STARTED ENDED STATUS PROGRESS
pMSSM-NoSys-1200Mi-10nodeb-56nodej-250max-21k8s-mqprio-1k-235 1 2021-11-01T13:47:51 2021-11-01T14:16:00 2021-11-01T14:28:35 finished 6/6 Note the total time difference: workflow pod logs indicate that yadage finished at 14:23:58, whilst DB logs indicate that the workflow finished only at 14:28:35. So the pod was "unnecessarily blue" during almost five minutes. Note also the DB progress being 3/6 when the pod says it finished... Quite out of sync? (This is with r-commons 314 and r-w-controller 413 PRs.) |
note: load on k8s cluster might play a role cause suggestion: disable the removal of
|
|
WRT 1, this is with ATLAS pMSSM 1k workflow test on a 66 node cluster, see the WRT 2, we are doing a lot of stuff in $ kubectl logs deployment/reana-workflow-controller job-status-consumer | grep -c '2021-11-01 14:28'
293
$ kubectl logs deployment/reana-workflow-controller job-status-consumer | grep '2021-11-01 14:28:52'
2021-11-01 14:28:52,004 | root | MainThread | INFO | [x] Received workflow_uuid: 3e1ed0f7-5da7-4bd9-a8b5-21fa198f586a status: RunStatus.running
2021-11-01 14:28:52,037 | root | MainThread | INFO | [x] Received workflow_uuid: 4a4db77a-6177-4ba8-b836-ea9aa882fdd5 status: RunStatus.running
2021-11-01 14:28:52,071 | root | MainThread | INFO | [x] Received workflow_uuid: e0ee4593-2b88-4b45-b6ff-72c3f66042d1 status: RunStatus.running
2021-11-01 14:28:52,094 | root | MainThread | INFO | [x] Received workflow_uuid: adb8a0a5-79e7-4c6d-aa2e-758d35e10084 status: RunStatus.running
2021-11-01 14:28:52,129 | root | MainThread | INFO | [x] Received workflow_uuid: a46bae97-ebce-451d-8b2f-fbab2499e092 status: RunStatus.running
2021-11-01 14:28:52,150 | root | MainThread | INFO | [x] Received workflow_uuid: e32b2682-de4f-47fe-847b-aa7aab9472ab status: RunStatus.running
2021-11-01 14:28:52,167 | root | MainThread | INFO | [x] Received workflow_uuid: e0132e06-4bec-4385-9752-a6e74c1f74c4 status: RunStatus.running
2021-11-01 14:28:52,203 | root | MainThread | INFO | [x] Received workflow_uuid: ef250fd8-49d9-4802-bb30-286a25bfacb5 status: RunStatus.running
2021-11-01 14:28:52,238 | root | MainThread | INFO | [x] Received workflow_uuid: e2124a59-5c64-4b56-83d0-44e96ba3b6cf status: RunStatus.running
2021-11-01 14:28:52,260 | root | MainThread | INFO | [x] Received workflow_uuid: 80702651-bc28-4eb2-8e66-c6c891e8f61e status: RunStatus.running
2021-11-01 14:28:52,282 | root | MainThread | INFO | [x] Received workflow_uuid: 1b337af1-9e51-49cd-b8ee-7e4c8f6b8d50 status: RunStatus.running
2021-11-01 14:28:52,314 | root | MainThread | INFO | [x] Received workflow_uuid: 68b0ad92-6bf9-4338-a11a-2d45bf8a3cbe status: RunStatus.running
2021-11-01 14:28:52,337 | root | MainThread | INFO | [x] Received workflow_uuid: c516b730-6fea-46fb-888a-0e9b4c309959 status: RunStatus.running
2021-11-01 14:28:52,375 | root | MainThread | INFO | [x] Received workflow_uuid: 8e482984-f63d-4b5f-9588-07c276f890b1 status: RunStatus.finished It is much easier to profile "live" locally to see where the time is spent. WRT 3, one could call it "severely delayed"; I have seen workflows which were finished, i.e. the batch pod reported done, and the output files were present, and DB indicated not |
WRT 4, I ran with the exact versions of r-w-controller@pr-413 and r-commons@pr-314. |
@tiborsimko what should we do with this issue? Do we still have the same problem with benchmarks on a bigger cluster? If not, we can close this issue. P.S we had a PR that optimizes the deletion of batch pods and looks like it solved some of the pain points (comment). |
The performance tests launching 200 simple workflows defined as:
on a system with 1000 concurrent workflow limit and 14 batch nodes and 12 job nodes gives the following:
One can see that there is a lot of "pending time" starting workflows, see sister issue reanahub/reana-server#410.
This issue focuses on optimising "blue spread" where some workflow runs are being terminated fast, whilst others take a long time.
Here is the timing information about this test run:
Here is a snapshot of the cluster situation just a few seconds after the start of the test run:
Here is the situation some 20 seconds later where some of the workflows start to get finished:
Here is the situation around the "shoulder" of the plot, some 110 seconds after the start of the test run:
Here is the situation near the completion of the test run:
and ten seconds later:
We can see here is that even though all workflows have apparently successfully finished, some 39 are still being reported as "running" in the database.
This may indicate an issue with run batch termination tasks. When a workflow finishes, it is flopped into a "NotReady" state, before passing to "Terminating" state, and the DB state does not seem to fully correlate to those. Some of them the DB will still see as "Running".
This means that if someone uses a concurrent limit on how many workflows can run in parallel, the scheduler may not start them as fast as it could. This could contribute to the "orange hill" issue, even though for the issue at hand this is not the case, since the concurrent limit was set very high, and all 200 workflows could have started in parallel. The issue reanahub/reana-server#410 is dedicated to investigating that.
In the issue at hand, we are concerned with the apparent difference between what DB sees as the running workflows and what the K8s has as the running pods. The present observations can mean that even if some workflows successfully finished and produced output and are basically over, they are "stuck" in NotReady -> Terminating stages, leading to slower DB update, and thus contributing to the "blue spread" factor, and actually somewhat "artificially" inflating it.
We should investigate the run batch termination tasks again and if possible we should speed up the termination procedures, so that workflows are marked as "finished" in DB as fast as possible.
(Note that this test was run with
REANA_WORKFLOW_TERMINATION_QUOTA_UPDATE_POLICY
set tonone
, so that no termination time was lost in CPU and Disk quota update parts. This was verified via quota reporting before and after the test, both zero.)P.S. On another news, there are only a few cases of "Pending" pods, meaning that jobs could have been allocated mostly well across the cluster, so these should contribute less to the overall reported "blue" runtimes in the plot.
The text was updated successfully, but these errors were encountered: