Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflows that failed before upgrade to 3.5.6 fail to retry #13003

Closed
3 of 4 tasks
sarialalem1 opened this issue May 2, 2024 · 10 comments · Fixed by #13004
Closed
3 of 4 tasks

Workflows that failed before upgrade to 3.5.6 fail to retry #13003

sarialalem1 opened this issue May 2, 2024 · 10 comments · Fixed by #13004
Assignees
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/retryStrategy Template-level retryStrategy P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@sarialalem1
Copy link

sarialalem1 commented May 2, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

  1. We were on 3.5.5

  2. Some workflows failed

  3. Upgraded to 3.5.6

  4. Retried some of the workflows
    Result:

    • The failed tasks within the workflows disappeared, keeping only the successful ones
    • The workflows got stuck in a running state
    • Still The workflows are shown in the failed list
    • Unable to retry again, because it's running
    • Unable to Stop

Reproducible on any workflow

Version

v3.5.6

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

any workflow reproduces it

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@Joibel
Copy link
Member

Joibel commented May 2, 2024

I tested this purely on 3.5.6, and it fails if you attempt to retry this deliberately broken dag diamond.

apiVersion: argoproj.io/v1alpha1                                                                                                                                                                                                                
kind: Workflow   
metadata:
  generateName: dag-diamond-
spec:
  entrypoint: diamond
  templates:
  - name: diamond
    dag:
      tasks:
      - name: A
        template: echo
        arguments:
          parameters: [{name: message, value: A}]
      - name: B
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: B}]
      - name: C
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: C}]
      - name: D
        depends: "B && C"
        template: eacho
        arguments:
          parameters: [{name: message, value: D}]

  - name: echo
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [echo, "{{inputs.parameters.message}}"]
  - name: eacho
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [eacho, "{{inputs.parameters.message}}"]

A link to the slack discussion: https://cloud-native.slack.com/archives/C01QW9QSSSK/p1714641906410049

@Joibel Joibel added area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority labels May 2, 2024
@sarialalem1
Copy link
Author

Another piece of info:
After rolling back to 3.5.5, retrying runs the workflow propperly, but by the end it gets stuck without changing status to Finished

@Joibel
Copy link
Member

Joibel commented May 2, 2024

For me, the workflow above that reproduces the issue on 3.5.6 doesn't reproduce it on 3.5.5.
It may be that retrying workflows which have been touched by 3.5.6 is part of the problem, so recreate them fresh instead.

@Joibel
Copy link
Member

Joibel commented May 2, 2024

Ignore that last comment, it doesn't go wrong for me in a really basic workflows installation at all. 3.5.6 will retry happily there. I'll try and determine what the difference is with our production 3.5.6 and why it only fails there.

@Joibel
Copy link
Member

Joibel commented May 2, 2024

Our production has a workflowDefaults which enables retryStrategy for everything.
The following reproduces the error - note the retryStrategy at the top level template. Remove that or place retries elsewhere and it will work.

metadata:
  generateName: dag-diamond-
spec:
  entrypoint: diamond
  templates:
  - name: diamond
    retryStrategy:
      limit: 2
      retryPolicy: OnError
    dag:
      tasks:
      - name: A
        template: echo
        arguments:
          parameters: [{name: message, value: A}]
      - name: B
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: B}]
      - name: C
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: C}]
      - name: D
        depends: "B && C"
        template: eacho
        arguments:
          parameters: [{name: message, value: D}]

  - name: echo
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [echo, "{{inputs.parameters.message}}"]
  - name: eacho
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [eacho, "{{inputs.parameters.message}}"]

This works correctly with 3.5.5

@agilgur5 agilgur5 added type/regression Regression from previous behavior (a specific type of bug) area/retryStrategy Template-level retryStrategy labels May 2, 2024
@Joibel
Copy link
Member

Joibel commented May 2, 2024

This was broken by #12817.

@agilgur5
Copy link
Member

agilgur5 commented May 2, 2024

I feel like this has got to be related to the root cause I mentioned in #12817 (review). Although the PR itself did not touch (automated) retry nodes. The manual retry logic needs a refactor in general.

We should also add all these failing test cases

@shuangkun
Copy link
Member

I do think the retry node needs to be skipped when checking if the descendants have success nodes since it is virtual.

@agilgur5
Copy link
Member

agilgur5 commented May 2, 2024

The following reproduces the error - note the retryStrategy at the top level template.

To clarify, this will happen even when no retry was needed, correct? or does it only occur if a retry is triggered?
As in, the existence of a retryStrategy (with any configuration) on a template invocator (i.e. DAG or steps) causes it, not whether it were actually retried or not based on its retryPolicy or expression.

@Joibel
Copy link
Member

Joibel commented May 2, 2024

The following reproduces the error - note the retryStrategy at the top level template.

To clarify, this will happen even when no retry was needed, correct? or does it only occur if a retry is triggered?
As in, the existence of a retryStrategy (with any configuration) on a template invocator (i.e. DAG or steps) causes it, not whether it were actually retried or not based on its retryPolicy or expression.

This requires both a retryStrategy and a manual retry attempt, but the retryStrategy does not need to have been used, we just need the retry virtual node to be present. I don't believe the actual retryStrategy matters at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/retryStrategy Template-level retryStrategy P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants