You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Question: Hello there we have an issue and we were wondering if there was a way to go around it we got the following:
We have a namespace with a 3Go memory limit and default task memory request to 1Go
We run 3 workflows running a first task were we had
[1/1] currentAttempt done. Last Error: USER::task execution timeout [5m0s] expired error. Unable to attach or mount volumes: unmounted volumes=[onxg542gnrqwwzk6], unattached volumes=[kube-api-access-8cmfw onxg542gnrqwwzk6 aws-iam-token]: timed out waiting for the condition MountVolume.SetUp failed for volume "onxg542gnrqwwzk6" : references non-existent secret key: password
The reason is that the pod was trying to mount a secret volume that didn’t exist (there was a typo in the secret name)
The problem is that the deployment on K8s of the those tasks was still available after those 5 minutes, and that for hours, keeping the 3Go for themselves. Making the other tasks waiting.
Ultimately, after that deployments was “removed” the other tasks were picked up
I would expect Flyte to terminate the deployments right after the first error no? Freeing the ressource usage for other task?
Any clue?
Response: Termination immediately after failure. Default is to keep the state to help in debugging and logs. In most cases, the pod will not use resources unless it is in a back off error etc. k8s does not do a great job of surfacing these errors. Timeout is initiated by Flyte. This PR changes the default behavior of failure handling to abort, which deletes the k8s Pod, rather than finalize, which requires the delete-resource-on-finalize flag to be set so to delete the resource. So right now, there are no configuration changes necessary to make sure resources are cleaned-up when there is a failure. When a k8s Pod execution is successful we delete the Pod rather than let the k8s garbage collector clean it up.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Slack conversation
Question:
Hello there we have an issue and we were wondering if there was a way to go around it we got the following:We have a namespace with a 3Go memory limit and default task memory request to 1Go
We run 3 workflows running a first task were we had
[1/1] currentAttempt done. Last Error: USER::task execution timeout [5m0s] expired error. Unable to attach or mount volumes: unmounted volumes=[onxg542gnrqwwzk6], unattached volumes=[kube-api-access-8cmfw onxg542gnrqwwzk6 aws-iam-token]: timed out waiting for the condition MountVolume.SetUp failed for volume "onxg542gnrqwwzk6" : references non-existent secret key: password
The reason is that the pod was trying to mount a secret volume that didn’t exist (there was a typo in the secret name)
The problem is that the deployment on K8s of the those tasks was still available after those 5 minutes, and that for hours, keeping the 3Go for themselves. Making the other tasks waiting.
Ultimately, after that deployments was “removed” the other tasks were picked up
I would expect Flyte to terminate the deployments right after the first error no? Freeing the ressource usage for other task?
Any clue?
Beta Was this translation helpful? Give feedback.
All reactions