How does Flyte terminate deployments after encountering an error the first time? #2664

SmritiSatyanV · 2022-07-05T06:41:15Z

SmritiSatyanV
Jul 5, 2022
Collaborator

Question: Hello there we have an issue and we were wondering if there was a way to go around it we got the following:
We have a namespace with a 3Go memory limit and default task memory request to 1Go
We run 3 workflows running a first task were we had

[1/1] currentAttempt done. Last Error: USER::task execution timeout [5m0s] expired error. Unable to attach or mount volumes: unmounted volumes=[onxg542gnrqwwzk6], unattached volumes=[kube-api-access-8cmfw onxg542gnrqwwzk6 aws-iam-token]: timed out waiting for the condition MountVolume.SetUp failed for volume "onxg542gnrqwwzk6" : references non-existent secret key: password

The reason is that the pod was trying to mount a secret volume that didn’t exist (there was a typo in the secret name)
The problem is that the deployment on K8s of the those tasks was still available after those 5 minutes, and that for hours, keeping the 3Go for themselves. Making the other tasks waiting.
Ultimately, after that deployments was “removed” the other tasks were picked up
I would expect Flyte to terminate the deployments right after the first error no? Freeing the ressource usage for other task?
Any clue?

SmritiSatyanV · 2022-07-05T06:41:24Z

SmritiSatyanV
Jul 5, 2022
Collaborator Author

Response: Termination immediately after failure. Default is to keep the state to help in debugging and logs. In most cases, the pod will not use resources unless it is in a back off error etc. k8s does not do a great job of surfacing these errors. Timeout is initiated by Flyte.
This PR changes the default behavior of failure handling to abort, which deletes the k8s Pod, rather than finalize, which requires the delete-resource-on-finalize flag to be set so to delete the resource. So right now, there are no configuration changes necessary to make sure resources are cleaned-up when there is a failure. When a k8s Pod execution is successful we delete the Pod rather than let the k8s garbage collector clean it up.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does Flyte terminate deployments after encountering an error the first time? #2664

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How does Flyte terminate deployments after encountering an error the first time? #2664

SmritiSatyanV Jul 5, 2022 Collaborator

Replies: 1 comment

SmritiSatyanV Jul 5, 2022 Collaborator Author

SmritiSatyanV
Jul 5, 2022
Collaborator

SmritiSatyanV
Jul 5, 2022
Collaborator Author