Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parent Node Should Be Marked as Success With Recursive Template Success #14237

Open
kevinc3n opened this issue Feb 28, 2025 · 0 comments
Open
Labels
type/feature Feature request

Comments

@kevinc3n
Copy link

kevinc3n commented Feb 28, 2025

Summary

  • When a task within a DAG recursively calls its own template, the parent node’s status should reflect the last iteration in the recursive stack (i.e. the result of the last recursive template should propagate back up to the parent node).
  • My purpose of a recursive template is to retry a specific section of the workflow.
  • Therefore, it is more informative for the parent’s status to match the outcome of the final recursive template call rather than being marked as a failure due to an earlier failed retry that ultimately succeeded.
  • The use of continueOn and failFast have not changed the behavior.

Use Cases

  • I am currently using a recursive template call to dynamically increase the memory allocated to a template whenever it detects an exit code of 137 (OOM error).
  • Implementing this change means that if the third attempt with increased memory succeeds, the parent node will be marked as successful despite OOM failures in the previous two retry attempts.
  • NOTE: The use of retryStrategy currently does not support dynamic memory increase with each retry while simultaneously retrying for transient errors.

Message from the maintainers:

Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.


Example

Here is an example workflow that handles OOM errors by increasing memory allocation using recursive templates.
This is using Argo v3.6.4.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: oom-test-recursive
  namespace: argo-workflows
spec:
  entrypoint: start
  serviceAccountName: s3-argo
  templates:
    - name: run-sb-stage
      retryStrategy:
        limit: "10"
        retryPolicy: "Always"
        expression: "lastRetry.status == 'Error'"
        backoff:
          duration: "1"
          cap: "1m"
          factor: 2
      inputs:
        parameters:
          - name: memory
          - name: cpu
            value: 4
      container:
        image: python:3.9
        command: ["python", "-c"]
        args: ["s = ' ' * (1024 * 1024 * 1024 * 2)"] # Attempts to make a string of size 2 GB
      podSpecPatch: '{"containers":[{"name":"main", "resources":{"limits":{"cpu": "{{inputs.parameters.cpu}}", "memory": "{{inputs.parameters.memory}}Mi"}}}]}'

    - name: single-band-template
      inputs:
        parameters:
          - name: multiplier
      dag:
        failFast: false
        tasks:
          - name: run-stage
            template: run-sb-stage
            arguments:
              parameters:
                - name: memory
                  value: "{{= 1000 * int(inputs.parameters.multiplier)}}" # Start with 1 GB (1000 MB)
          # This template is only run if the run-stage template threw an OOM error
          # This triggers the recursion
          - name: oom-retry
            template: single-band-template
            arguments:
              parameters:
                - name: multiplier
                  value: "{{= int(inputs.parameters.multiplier) + 1}}"
            depends: "run-stage.Failed"
            when: "\"{{tasks.run-stage.exitCode}}\" == '137' && {{inputs.parameters.multiplier}} < 4"
          # This template is only called if the run-stage template was successful
          - name: process-next-stage
            template: hello
            depends: "run-stage.Succeeded"

    - name: start
      steps:
        - - name: call-single-band
            template: single-band-template
            arguments:
              parameters:
                - name: multiplier
                  value: 1
    # Placeholder template to simulate another template being called after the run-stage template
    - name: hello
      container:
        image: alpine:3.6
        command: [sh, -c]
        args: ["echo \"Hello World\""]

Here is the result of the above workflow:

Image
Ideally, the parent node should have a status of Success since the child recursive template succeeded in the end.

@kevinc3n kevinc3n added the type/feature Feature request label Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature Feature request
Projects
None yet
Development

No branches or pull requests

1 participant