You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a task within a DAG recursively calls its own template, the parent node’s status should reflect the last iteration in the recursive stack (i.e. the result of the last recursive template should propagate back up to the parent node).
My purpose of a recursive template is to retry a specific section of the workflow.
Therefore, it is more informative for the parent’s status to match the outcome of the final recursive template call rather than being marked as a failure due to an earlier failed retry that ultimately succeeded.
The use of continueOn and failFast have not changed the behavior.
Use Cases
I am currently using a recursive template call to dynamically increase the memory allocated to a template whenever it detects an exit code of 137 (OOM error).
Implementing this change means that if the third attempt with increased memory succeeds, the parent node will be marked as successful despite OOM failures in the previous two retry attempts.
NOTE: The use of retryStrategy currently does not support dynamic memory increase with each retry while simultaneously retrying for transient errors.
Message from the maintainers:
Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.
Example
Here is an example workflow that handles OOM errors by increasing memory allocation using recursive templates.
This is using Argo v3.6.4.
apiVersion: argoproj.io/v1alpha1kind: WorkflowTemplatemetadata:
name: oom-test-recursivenamespace: argo-workflowsspec:
entrypoint: startserviceAccountName: s3-argotemplates:
- name: run-sb-stageretryStrategy:
limit: "10"retryPolicy: "Always"expression: "lastRetry.status == 'Error'"backoff:
duration: "1"cap: "1m"factor: 2inputs:
parameters:
- name: memory
- name: cpuvalue: 4container:
image: python:3.9command: ["python", "-c"]args: ["s = ' ' * (1024 * 1024 * 1024 * 2)"] # Attempts to make a string of size 2 GBpodSpecPatch: '{"containers":[{"name":"main", "resources":{"limits":{"cpu": "{{inputs.parameters.cpu}}", "memory": "{{inputs.parameters.memory}}Mi"}}}]}'
- name: single-band-templateinputs:
parameters:
- name: multiplierdag:
failFast: falsetasks:
- name: run-stagetemplate: run-sb-stagearguments:
parameters:
- name: memoryvalue: "{{= 1000 * int(inputs.parameters.multiplier)}}"# Start with 1 GB (1000 MB)# This template is only run if the run-stage template threw an OOM error# This triggers the recursion
- name: oom-retrytemplate: single-band-templatearguments:
parameters:
- name: multipliervalue: "{{= int(inputs.parameters.multiplier) + 1}}"depends: "run-stage.Failed"when: "\"{{tasks.run-stage.exitCode}}\" == '137' && {{inputs.parameters.multiplier}} < 4"# This template is only called if the run-stage template was successful
- name: process-next-stagetemplate: hellodepends: "run-stage.Succeeded"
- name: startsteps:
- - name: call-single-bandtemplate: single-band-templatearguments:
parameters:
- name: multipliervalue: 1# Placeholder template to simulate another template being called after the run-stage template
- name: hellocontainer:
image: alpine:3.6command: [sh, -c]args: ["echo \"Hello World\""]
Here is the result of the above workflow:
Ideally, the parent node should have a status of Success since the child recursive template succeeded in the end.
The text was updated successfully, but these errors were encountered:
Summary
continueOn
andfailFast
have not changed the behavior.Use Cases
retryStrategy
currently does not support dynamic memory increase with each retry while simultaneously retrying for transient errors.Message from the maintainers:
Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.
Example
Here is an example workflow that handles OOM errors by increasing memory allocation using recursive templates.
This is using Argo
v3.6.4
.Here is the result of the above workflow:
Ideally, the parent node should have a status of
Success
since the child recursive template succeeded in the end.The text was updated successfully, but these errors were encountered: