consumer: remove pods for deleted workflows #445

VMois · 2022-04-13T11:45:47Z

closes #437

codecov-commenter · 2022-04-13T11:49:06Z

Codecov Report

Merging #445 (38dafac) into master (ae32eb8) will decrease coverage by 0.66%.
The diff coverage is 6.66%.

@@            Coverage Diff             @@
##           master     #445      +/-   ##
==========================================
- Coverage   68.22%   67.56%   -0.67%     
==========================================
  Files          15       15              
  Lines        1328     1341      +13     
==========================================
  Hits          906      906              
- Misses        422      435      +13

Impacted Files	Coverage Δ
reana_workflow_controller/consumer.py	`35.75% <6.66%> (-2.81%)`	⬇️

VMois · 2022-04-19T09:16:42Z

reana_workflow_controller/consumer.py

+                        f"Could not clean up not alive workflow {workflow.id_} batch pod for workflow."
+                        f" Error: {exception}"
+                    )
+                _delete_workflow_jobs(workflow)


I have a problem in this line with deciding whether to delete workflow jobs or not for two reasons:

(1) Theoretically, it is possible that when the workflow is deleted but there are a lot of messages in the jobs-status queue, its deletion will be delayed. And during this delay, the workflow pod can start executing jobs. But, I have a hard time coming up with a test scenario (unit or manual).

(2) In addition, I am not sure if this is the right place to delete jobs. It would make more sense to delete them on a shutdown of the job-controller pod. But it will require investigation on how to implement it in job-controller

WDYT?

If you are only addressing the NotReady situation for the workflow engine pod, then I was not seeing many situations in production where we needed to kill also workflow step jobs... So I think we should be mostly okay here already?

The situation of surviving workflow step pods happened in the past in other kind of troubles that were not related to NotReady. I recall one recent example where workflow step pods survived in the Completed status but there was no workflow batch pod anymore. That is, the workflow batch pod apparently terminated well, but some of the workflow step pods still remained in Completed status. Have you tried to look at this situation too?

FWIW I tend to agree with you about the danger of deleting things for all not ALIVE_STATUSES because that could include ones we don't want perhaps?

Good point about not deleting not ALIVE_STATUSES. I should only delete workflows that have deleted status

Have you tried to look at this situation too?

Do we have an issue describing in detail this situation? I couldn't find it.

I recall one recent example where workflow step pods survived in the Completed status but there was no workflow batch pod anymore.

Step jobs are removed in job-controller when they are finished. It probably means something failed in job-controller. There is a possibility to automate the step job deletion process in Kubernetes - https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/#ttl-after-finished-controller . But this is a discussion for a new issue.

Do we have an issue describing in detail this situation? I couldn't find it.

E.g. the same as reanahub/reana-job-controller#326. That issue is closed, but I saw the same problem also recently.

If you are only addressing the NotReady situation for the workflow engine pod, then I was not seeing many situations in production where we needed to kill also workflow step jobs... So I think we should be mostly okay here already?

This is a tricky question. First of all, the issue with hanging step jobs is unlikely. So we are talking about an edge case here.

In the situation above, If you wait long enough the only thing that would be left hanging is a workflow engine pod. In reality, before that, all step jobs would run, finish, and get deleted. So it wastes the resources of the cluster.

tiborsimko · 2022-04-19T09:45:12Z

reana_workflow_controller/consumer.py

@@ -281,7 +291,7 @@ def _update_job_cache(msg):
    Session.add(cached_job)


-def _delete_workflow_job(workflow: Workflow) -> None:
+def _delete_workflow_batch_pod(workflow: Workflow) -> None:


Please beware of the terminology when renaming. We are deleting Kuberentes "job" here in this function, not only "pod", when we are doing delete_namespaced_job(). So this rename wouldn't be fully justified.

When a workflow starts, it creates a Kubernetes job named such as 'reana-run-batch-0bf3cf48-247a-4cbf-bd5f-727cf41e5edbwhich will create a Kubernetes pod named such asreana-run-batch-0bf3cf48-247a-4cbf-bd5f-727cf41e5edb-hz46p` where the workflow engine runs.

It is the latter that can get to NotReady status situations. When this happens, the manual solution is to delete the former (=job), which will then automatically delete also the latter (=pod). (If we would have deleted only the batch pod, then the kubernetes job would start a new batch pod.)

Perhaps the terminology is confusing since we have both Kubernetes job and pod, and workflow engine job and pod, and that workflow engine pod can generate new workflow step jobs and pods... Needs some renaming care....

Yeah. It is not a final naming here. _delete_workflow_batch_job would be a better name or even _delete_workflow_engine_batch_job.

And, I would agree terminology is confusing.

VMois · 2022-04-20T13:17:50Z

I managed to reproduce a scenario where step jobs might get left hanging even if you remove the workflow engine pod (when the workflow is deleted).

With this PR in check, you will need to:

Add sleep, at the end of on_message function, inside the try/catch:

import time
time.sleep(5)

For example, in the root6-roofit example, add sleep at the beginning of the gendata step:

sleep 60 && ...

Start one workflow to delay a job-status queue:

reana-client run -w w-1

After 4-5 seconds, start a second workflow, for this workflow step jobs will hang:

reana-client run -w w-2

After the second workflow starts, wait 5-10 seconds (or until you see that the first job for this workflow started); delete workflow:

reana-client delete -w w-2

Monitor kubectl get pods, you will have one job left hanging out.

Why does this happen?

Imagine that we only delete workflow engine Jobs if we detect deleted workflow in the consumer. And that the job-status queue is busy with a lot of messages.

The user starts a workflow, but it got stuck in "pending" mode. After some time, the user deletes a workflow. Now it has a "deleted" value in DB. But, in the background, the workflow engine Job is not deleted.

When. ready, the workflow engine Job will start and send its first message to the "jobs-status" queue. The consumer will not process this message anytime soon because the queue is busy. It also means that we cannot delete the workflow engine pod soon. Because of that, the workflow engine pod will start step jobs related to the workflow. Let's assume those steps will run for a long time.

Later, when the first message arrives in the consumer, it will delete the workflow engine Job. But, step job(s) are still running and no one will delete them.

This is a textual explanation for now.

VMois · 2022-04-20T14:25:39Z

I have just realized that deleting step jobs in the consumer is a bad idea. Reason: job-controller supports more compute_backends than just Kubernetes like HTCondor and Slurm. If we want to delete step jobs in the consumer, I assume, we will need to handle different backends too. It creates a mess. WDYT? cc: @tiborsimko

A better idea could be to make the job-controller responsible for cleaning its job on shutdown.

closes reanahub#437

VMois force-pushed the handle-deleted-workflows branch from 38dafac to 7e8674f Compare April 19, 2022 08:40

VMois commented Apr 19, 2022

View reviewed changes

tiborsimko reviewed Apr 19, 2022

View reviewed changes

consumer: remove pods for deleted workflows

295f522

closes reanahub#437

VMois force-pushed the handle-deleted-workflows branch from 7e8674f to 295f522 Compare April 21, 2022 09:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consumer: remove pods for deleted workflows #445

consumer: remove pods for deleted workflows #445

VMois commented Apr 13, 2022

codecov-commenter commented Apr 13, 2022

VMois Apr 19, 2022 •

edited

Loading

tiborsimko Apr 19, 2022

VMois Apr 19, 2022

VMois Apr 19, 2022

tiborsimko Apr 19, 2022

VMois Apr 20, 2022

tiborsimko Apr 19, 2022

VMois Apr 19, 2022 •

edited

Loading

VMois commented Apr 20, 2022

VMois commented Apr 20, 2022 •

edited

Loading

consumer: remove pods for deleted workflows #445

Are you sure you want to change the base?

consumer: remove pods for deleted workflows #445

Conversation

VMois commented Apr 13, 2022

codecov-commenter commented Apr 13, 2022

Codecov Report

VMois Apr 19, 2022 • edited Loading

Choose a reason for hiding this comment

tiborsimko Apr 19, 2022

Choose a reason for hiding this comment

VMois Apr 19, 2022

Choose a reason for hiding this comment

VMois Apr 19, 2022

Choose a reason for hiding this comment

tiborsimko Apr 19, 2022

Choose a reason for hiding this comment

VMois Apr 20, 2022

Choose a reason for hiding this comment

tiborsimko Apr 19, 2022

Choose a reason for hiding this comment

VMois Apr 19, 2022 • edited Loading

Choose a reason for hiding this comment

VMois commented Apr 20, 2022

VMois commented Apr 20, 2022 • edited Loading

VMois Apr 19, 2022 •

edited

Loading

VMois Apr 19, 2022 •

edited

Loading

VMois commented Apr 20, 2022 •

edited

Loading