Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runaway workflow execution due to invalid image specs #364

Open
tiborsimko opened this issue May 2, 2022 · 1 comment
Open

runaway workflow execution due to invalid image specs #364

tiborsimko opened this issue May 2, 2022 · 1 comment
Assignees

Comments

@tiborsimko
Copy link
Member

Current behaviour

Consider the workflow definition containing an invalid image specification due to trailing whitespace, such as the following helloworld serial demo:

diff --git a/reana.yaml b/reana.yaml
index 50d1bd8..dfb498a 100644
--- a/reana.yaml
+++ b/reana.yaml
@@ -12,7 +12,7 @@ workflow:
   type: serial
   specification:
     steps:
-      - environment: 'python:2.7-slim'
+      - environment: 'python:2.7-slim '
         commands:
           - python "${helloworld}"
               --inputfile "${inputfile}"

or the following helloworld snakemake demo:

diff --git a/workflow/snakemake/Snakefile b/workflow/snakemake/Snakefile
index e7344f2..d2b1187 100644
--- a/workflow/snakemake/Snakefile
+++ b/workflow/snakemake/Snakefile
@@ -26,7 +26,7 @@ rule helloworld:
     output:
         "results/greetings.txt"
     container:
-        "docker://python:2.7-slim"
+        "docker://python:2.7-slim "
     shell:
         "python {input.helloworld} "
         "--inputfile {input.inputfile} "

The validation of such a workflow succeeds:

$ reana-client validate
==> Verifying REANA specification file... reana-demo-helloworld/reana.yaml
  -> SUCCESS: Valid REANA specification file.
==> Verifying REANA specification parameters...
  -> SUCCESS: REANA specification parameters appear valid.
==> Verifying workflow parameters and commands...
  -> SUCCESS: Workflow parameters and commands appear valid.
==> Verifying dangerous workflow operations...
  -> SUCCESS: Workflow operations appear valid.

However, when the workflow is submitted for execution, the workflow appears to run away forever without finishing.

Behind the scenes, the reana-run-batch-... pod starts well, but when the reana-run-job-... is attempted to start, it cannot due to:

$ kubectl describe job reana-run-job-5ff44876-5778-4ecf-b985-b8ffdce56812
...
Events:
  Type     Reason        Age    From            Message
  ----     ------        ----   ----            -------
  Warning  FailedCreate  7m58s  job-controller  Error creating: Pod "reana-run-job-5ff44876-5778-4ecf-b985-b8ffdce56812-tl2gj" is invalid: spec.containers[0].image: Invalid value: "python:2.7-slim ": must not have leading or trailing whitespace
  Warning  FailedCreate  7m48s  job-controller  Error creating: Pod "reana-run-job-5ff44876-5778-4ecf-b985-b8ffdce56812-lwjkz" is invalid: spec.containers[0].image: Invalid value: "python:2.7-slim ": must not have leading or trailing whitespace
  Warning  FailedCreate  7m28s  job-controller  Error creating: Pod "reana-run-job-5ff44876-5778-4ecf-b985-b8ffdce56812-k2mv4" is invalid: spec.containers[0].image: Invalid value: "python:2.7-slim ": must not have leading or trailing whitespace
  Warning  FailedCreate  6m48s  job-controller  Error creating: Pod "reana-run-job-5ff44876-5778-4ecf-b985-b8ffdce56812-dqdbh" is invalid: spec.containers[0].image: Invalid value: "python:2.7-slim ": must not have leading or trailing whitespace
  Warning  FailedCreate  5m28s  job-controller  Error creating: Pod "reana-run-job-5ff44876-5778-4ecf-b985-b8ffdce56812-rwzx2" is invalid: spec.containers[0].image: Invalid value: "python:2.7-slim ": must not have leading or trailing whitespace
  Warning  FailedCreate  2m48s  job-controller  Error creating: Pod "reana-run-job-5ff44876-5778-4ecf-b985-b8ffdce56812-k62l4" is invalid: spec.containers[0].image: Invalid value: "python:2.7-slim ": must not have leading or trailing whitespace

The consequence is that the workflow stays running "forever" in the user's workflow list.

Expected behaviour

Using the onion principle, there are several way to address this issue:

  • We can improve the validation on the client side so that this leading/trailing whitespace in the image specification would be caught early, and the workflow won't be even started. This will solve the individual problem at hand.
  • We can improve job scheduling on the cluster side so that FailedCreate events would be handled properly. This will solve any similar problems we might be having when starting ill-specified jobs.
@tiborsimko
Copy link
Member Author

The core of the issue was fixed. We can check whether there any other similar FailedCreate situations later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants