Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitor: evicted pods are sometimes not considered as failed #438

Open
mdonadoni opened this issue Feb 27, 2024 · 0 comments
Open

monitor: evicted pods are sometimes not considered as failed #438

mdonadoni opened this issue Feb 27, 2024 · 0 comments

Comments

@mdonadoni
Copy link
Member

mdonadoni commented Feb 27, 2024

Seen on DEV

Workflow is stuck waiting for a job to complete, as it can be seen by the request made to job-controller:

...
2024-02-26 08:05:16,862 | werkzeug | Thread-791 | INFO | 127.0.0.1 - - [26/Feb/2024 08:05:16] "GET /jobs/<uuid-of-job> HTTP/1.1" 200 -
2024-02-26 08:05:37,486 | werkzeug | Thread-792 | INFO | 127.0.0.1 - - [26/Feb/2024 08:05:37] "GET /jobs/<uuid-of-job> HTTP/1.1" 200 -
2024-02-26 08:05:57,667 | werkzeug | Thread-793 | INFO | 127.0.0.1 - - [26/Feb/2024 08:05:57] "GET /jobs/<uuid-of-job> HTTP/1.1" 200 -
2024-02-26 08:06:17,899 | werkzeug | Thread-794 | INFO | 127.0.0.1 - - [26/Feb/2024 08:06:17] "GET /jobs/<uuid-of-job> HTTP/1.1" 200 -
2024-02-26 08:06:38,127 | werkzeug | Thread-795 | INFO | 127.0.0.1 - - [26/Feb/2024 08:06:38] "GET /jobs/<uuid-of-job> HTTP/1.1" 200 -
2024-02-26 08:06:58,570 | werkzeug | Thread-796 | INFO | 127.0.0.1 - - [26/Feb/2024 08:06:58] "GET /jobs/<uuid-of-job> HTTP/1.1" 200 -
2024-02-26 08:07:18,733 | werkzeug | Thread-797 | INFO | 127.0.0.1 - - [26/Feb/2024 08:07:18] "GET /jobs/<uuid-of-job> HTTP/1.1" 200 -
...

However, job is reported as evicted on k8s at around the same time it was created:

status:
  message: 'Pod The node had condition: [DiskPressure]. '
  phase: Failed
  reason: Evicted
  startTime: "2024-02-26T05:19:55Z"

The monitor has correctly received the update event, and it has correctly identified the pod as failed:

2024-02-26 05:19:55,185 | root | kubernetes_job_monitor | INFO | Kubernetes job id: reana-run-job-... failed.
2024-02-26 05:19:55,193 | root | kubernetes_job_monitor | INFO | New Pod event received: MODIFIED
2024-02-26 05:19:55,193 | root | kubernetes_job_monitor | INFO | Kubernetes job id: reana-run-job-... failed.

However, the job status and the logs are not updated. This is the condition used to determine whether to update the status/logs:

remaining_jobs = self._get_remaining_jobs(
statuses_to_skip=[
JobStatus.finished.name,
JobStatus.failed.name,
JobStatus.stopped.name,
]
)
backend_job_id = self.get_backend_job_id(job_pod)
is_job_in_remaining_jobs = backend_job_id in remaining_jobs
job_status = self.get_job_status(job_pod)
is_job_completed = job_status in [
JobStatus.finished.name,
JobStatus.failed.name,
]
return is_job_in_remaining_jobs and is_job_completed

job-controller's API reports the job as started. A possible cause is that the job is created and evicted before it is saved to the database, so it will not be part of remaining_jobs.

Deleting manually the evicted pod makes the monitor update the job status/logs, and the workflow terminates on its own.

@mdonadoni mdonadoni self-assigned this Feb 28, 2024
@mdonadoni mdonadoni removed their assignment Mar 14, 2024
@mdonadoni mdonadoni added this to 0.95.0 Aug 8, 2024
@mdonadoni mdonadoni moved this to Backlog in 0.95.0 Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

1 participant