job-status-consumer: improve handling of "not alive" workflows #437

VMois · 2022-03-28T09:23:50Z

Some workflows that have "Not alive" status (finished, failed, deleted, stopped, etc.) in DB can continue running on Kubernetes, and even start reporting status to job-status-consumer.

Example of such workflow from job-status-consumer logs:

2022-03-24 14:20:16,312 | root | MainThread | WARNING | Event for not alive workflow 9b67170b-33ce-4dc3-8150-99e490afcade received:
{"workflow_uuid": "9b67170b-33ce-4dc3-8150-99e490afcade", "logs": "", "status": 1, "message": {"progress": {"total": {"total": 1, "job_ids": []}}}}
2022-03-24 14:20:16,315 | root | MainThread | WARNING | Event for not alive workflow 9b67170b-33ce-4dc3-8150-99e490afcade received:
{"workflow_uuid": "9b67170b-33ce-4dc3-8150-99e490afcade", "logs": "", "status": 1, "message": {"progress": {"running": {"total": 1, "job_ids": ["51060737-7e03-4255-a779-0e71e061e5f5"]}}}}
...
2022-03-24 14:20:16,860 | root | MainThread | WARNING | Event for not alive workflow 9b67170b-33ce-4dc3-8150-99e490afcade received:
{"workflow_uuid": "9b67170b-33ce-4dc3-8150-99e490afcade", "logs": "", "status": 2, "message": {"progress": {"finished": {"total": 1, "job_ids": ["51060737-7e03-4255-a779-0e71e061e5f5"]}}}}

Workflow 9b67170b-33ce-4dc3-8150-99e490afcade is reported as deleted in DB. But, according to the above logs, it even finished its execution. A user reported that some of the workflows were stuck in "pending".

Looking into the code:

reana-workflow-controller/reana_workflow_controller/rest/utils.py

Lines 202 to 211 in 2e539ec

    
           def delete_workflow(workflow, all_runs=False, workspace=False): 
        
               """Delete workflow.""" 
        
               if workflow.status in [ 
        
                   RunStatus.created, 
        
                   RunStatus.finished, 
        
                   RunStatus.stopped, 
        
                   RunStatus.deleted, 
        
                   RunStatus.failed, 
        
                   RunStatus.queued, 
        
                   RunStatus.pending,

In delete_workflow, it is possible to delete the "pending" workflow, or more precisely, mark workflow as deleted in DB. This, probably, means that workflow was deleted between pending status and actual workflow-engine pod start (and sending the first message to the job-status queue).

Some optional questions are:

Why workflow stayed in "pending" state for so long, and after that actually started running on Kubernetes?

How to reproduce

Start helloworld workflow - reana-client run -w test
As soon as workflow starts, wait 3-5 seconds (it should go to the pending state) and delete it reana-client delete -w test
Check reana-client list, it will show you that test workflow is deleted
Check kubectl get pods, you will find batch pod in NotReady state (and it will stay like this)
Check kubectl logs deployment/reana-workflow-controller job-status-consumer, it will show you that workflow was not in an alive state but still continued to execute.

Next actions

set better logging for "Not alive" workflow, so it is clear from the logs in which state workflow is if it is not alive (job-status-consumer: improve logging of "not alive" workflows #443)
if the workflow is deleted it can still be running on Kubernetes, it somehow needs to be detected and fixed so we do not have hanging workflows.

The text was updated successfully, but these errors were encountered:

closes reanahub#437

addresses reanahub#437

VMois · 2022-04-04T11:18:19Z

set better logging for "Not alive" workflow, so it is clear from the logs in which state workflow is if it is not alive

This is easy to do and I have already opened PR for that.

if the workflow is deleted it can still be running on Kubernetes, it somehow needs to be detected and fixed so we do not have hanging workflows.

This is more tricky logic and will require more time. Maybe, even, it should be handled differently. So this will be done in a separate PR.

VMois · 2022-04-04T11:42:56Z

if the workflow is deleted it can still be running on Kubernetes, it somehow needs to be detected and fixed so we do not have hanging workflows.

One easy way of solving a problem is to forbid deleting workflow in a pending state. Cause technically it is already started. But I am not sure how it might impact UX as in the case above workflow was in a pending state for a long time and was blocking other workflows. cc: @tiborsimko

Update: allowing to delete workflows in a pending state was requested by the users in case workflow is stuck for a long time. So we cannot revert this change back.

closes reanahub#437

VMois changed the title ~~job-status-consumer: make it clear what status "not alive workflow" event has in logs, check and clean "not alive workflow" pods~~ job-status-consumer: improve handling of "not alive" workflows Mar 28, 2022

VMois added type/enhancement compute/kubernetes type/bug labels Mar 28, 2022

VMois self-assigned this Mar 31, 2022

VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 4, 2022

job-status-consumer, log DB status of "not alive" workflows

b1a73dc

closes reanahub#437

VMois mentioned this issue Apr 4, 2022

job-status-consumer: improve logging of "not alive" workflows #443

Merged

VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 4, 2022

job-status-consumer, log DB status of "not alive" workflows

c33bcc8

closes reanahub#437

VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 4, 2022

job-status-consumer, log DB status of "not alive" workflows

fa80331

closes reanahub#437

VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 4, 2022

job-status-consumer, log DB status of "not alive" workflows

9282a46

closes reanahub#437

VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 4, 2022

job-status-consumer, log DB status of "not alive" workflows

3ca6537

addresses reanahub#437

VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 4, 2022

job-status-consumer, log DB status of "not alive" workflows

ae32eb8

addresses reanahub#437

VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 13, 2022

consumer: remove pods for deleted workflows

38dafac

closes reanahub#437

VMois linked a pull request Apr 13, 2022 that will close this issue

consumer: remove pods for deleted workflows #445

Draft

VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 19, 2022

consumer: remove pods for deleted workflows

7e8674f

closes reanahub#437

VMois pushed a commit to VMois/reana-workflow-controller that referenced this issue Apr 21, 2022

consumer: remove pods for deleted workflows

295f522

closes reanahub#437

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-status-consumer: improve handling of "not alive" workflows #437

job-status-consumer: improve handling of "not alive" workflows #437

VMois commented Mar 28, 2022 •

edited

Loading

VMois commented Apr 4, 2022

VMois commented Apr 4, 2022 •

edited

Loading

job-status-consumer: improve handling of "not alive" workflows #437

job-status-consumer: improve handling of "not alive" workflows #437

Comments

VMois commented Mar 28, 2022 • edited Loading

How to reproduce

Next actions

VMois commented Apr 4, 2022

VMois commented Apr 4, 2022 • edited Loading

VMois commented Mar 28, 2022 •

edited

Loading

VMois commented Apr 4, 2022 •

edited

Loading