-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-status-consumer: improve handling of "not alive" workflows #437
Comments
This is easy to do and I have already opened PR for that.
This is more tricky logic and will require more time. Maybe, even, it should be handled differently. So this will be done in a separate PR. |
One easy way of solving a problem is to forbid deleting workflow in a Update: allowing to delete workflows in a |
Some workflows that have "Not alive" status (finished, failed, deleted, stopped, etc.) in DB can continue running on Kubernetes, and even start reporting status to
job-status-consumer
.Example of such workflow from
job-status-consumer
logs:Workflow
9b67170b-33ce-4dc3-8150-99e490afcade
is reported asdeleted
in DB. But, according to the above logs, it even finished its execution. A user reported that some of the workflows were stuck in "pending".Looking into the code:
reana-workflow-controller/reana_workflow_controller/rest/utils.py
Lines 202 to 211 in 2e539ec
In
delete_workflow
, it is possible to delete the "pending" workflow, or more precisely, mark workflow as deleted in DB. This, probably, means that workflow was deleted between pending status and actual workflow-engine pod start (and sending the first message to thejob-status
queue).Some optional questions are:
How to reproduce
reana-client run -w test
pending
state) and delete itreana-client delete -w test
reana-client list
, it will show you thattest
workflow is deletedkubectl get pods
, you will find batch pod inNotReady
state (and it will stay like this)kubectl logs deployment/reana-workflow-controller job-status-consumer
, it will show you that workflow was not in an alive state but still continued to execute.Next actions
set better logging for "Not alive" workflow, so it is clear from the logs in which state workflow is if it is not alive (job-status-consumer: improve logging of "not alive" workflows #443)
if the workflow is deleted it can still be running on Kubernetes, it somehow needs to be detected and fixed so we do not have hanging workflows.
The text was updated successfully, but these errors were encountered: