monitor: evicted pods are sometimes not considered as failed #438

mdonadoni · 2024-02-27T09:28:04Z

Seen on DEV

Workflow is stuck waiting for a job to complete, as it can be seen by the request made to job-controller:

...
2024-02-26 08:05:16,862 | werkzeug | Thread-791 | INFO | 127.0.0.1 - - [26/Feb/2024 08:05:16] "GET /jobs/<uuid-of-job> HTTP/1.1" 200 -
2024-02-26 08:05:37,486 | werkzeug | Thread-792 | INFO | 127.0.0.1 - - [26/Feb/2024 08:05:37] "GET /jobs/<uuid-of-job> HTTP/1.1" 200 -
2024-02-26 08:05:57,667 | werkzeug | Thread-793 | INFO | 127.0.0.1 - - [26/Feb/2024 08:05:57] "GET /jobs/<uuid-of-job> HTTP/1.1" 200 -
2024-02-26 08:06:17,899 | werkzeug | Thread-794 | INFO | 127.0.0.1 - - [26/Feb/2024 08:06:17] "GET /jobs/<uuid-of-job> HTTP/1.1" 200 -
2024-02-26 08:06:38,127 | werkzeug | Thread-795 | INFO | 127.0.0.1 - - [26/Feb/2024 08:06:38] "GET /jobs/<uuid-of-job> HTTP/1.1" 200 -
2024-02-26 08:06:58,570 | werkzeug | Thread-796 | INFO | 127.0.0.1 - - [26/Feb/2024 08:06:58] "GET /jobs/<uuid-of-job> HTTP/1.1" 200 -
2024-02-26 08:07:18,733 | werkzeug | Thread-797 | INFO | 127.0.0.1 - - [26/Feb/2024 08:07:18] "GET /jobs/<uuid-of-job> HTTP/1.1" 200 -
...

However, job is reported as evicted on k8s at around the same time it was created:

status:
  message: 'Pod The node had condition: [DiskPressure]. '
  phase: Failed
  reason: Evicted
  startTime: "2024-02-26T05:19:55Z"

The monitor has correctly received the update event, and it has correctly identified the pod as failed:

2024-02-26 05:19:55,185 | root | kubernetes_job_monitor | INFO | Kubernetes job id: reana-run-job-... failed.
2024-02-26 05:19:55,193 | root | kubernetes_job_monitor | INFO | New Pod event received: MODIFIED
2024-02-26 05:19:55,193 | root | kubernetes_job_monitor | INFO | Kubernetes job id: reana-run-job-... failed.

However, the job status and the logs are not updated. This is the condition used to determine whether to update the status/logs:

reana-job-controller/reana_job_controller/job_monitor.py

Lines 115 to 131 in b9f8364

    
           remaining_jobs = self._get_remaining_jobs( 
        
               statuses_to_skip=[ 
        
                   JobStatus.finished.name, 
        
                   JobStatus.failed.name, 
        
                   JobStatus.stopped.name, 
        
               ] 
        
           ) 
        
           backend_job_id = self.get_backend_job_id(job_pod) 
        
           is_job_in_remaining_jobs = backend_job_id in remaining_jobs 
        
           job_status = self.get_job_status(job_pod) 
        
           is_job_completed = job_status in [ 
        
               JobStatus.finished.name, 
        
               JobStatus.failed.name, 
        
           ] 
        
           return is_job_in_remaining_jobs and is_job_completed

job-controller's API reports the job as started. A possible cause is that the job is created and evicted before it is saved to the database, so it will not be part of remaining_jobs.

Deleting manually the evicted pod makes the monitor update the job status/logs, and the workflow terminates on its own.

The text was updated successfully, but these errors were encountered:

mdonadoni added type/bug priority/soon labels Feb 27, 2024

mdonadoni self-assigned this Feb 28, 2024

mdonadoni removed their assignment Mar 14, 2024

mdonadoni added this to 0.95.0 Aug 8, 2024

mdonadoni moved this to Backlog in 0.95.0 Aug 8, 2024

mdonadoni added the size/m label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monitor: evicted pods are sometimes not considered as failed #438

monitor: evicted pods are sometimes not considered as failed #438

mdonadoni commented Feb 27, 2024 •

edited

Loading

monitor: evicted pods are sometimes not considered as failed #438

monitor: evicted pods are sometimes not considered as failed #438

Comments

mdonadoni commented Feb 27, 2024 • edited Loading

mdonadoni commented Feb 27, 2024 •

edited

Loading