-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[e2e] Ignore cluster delete failures #2745
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: trasc The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
hack/e2e-common.sh
Outdated
@@ -34,7 +34,7 @@ function cluster_cleanup { | |||
$KIND export logs $ARTIFACTS --name $1 || true | |||
kubectl describe pods -n kueue-system > $ARTIFACTS/$1-kueue-system-pods.log || true | |||
kubectl describe pods > $ARTIFACTS/$1-default-pods.log || true | |||
$KIND delete cluster --name $1 | |||
$KIND delete cluster -v=3 --name $1 || echo "Ignoring cluster delete failure." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point out another example where we ignore such a failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the three lines above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong view, but maybe it makes sense to only merge the -v3
so that we can get more input if this repeats?
Otherwise we will not know if this happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I doubt that -v=3
will make much of a difference (based on how I see the implementation in kind).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I see two paths forward:
- only increase log verbosity now (the level TBD, maybe v5 is better?), and wait for repeating to see if we can understand it
- ignore the errors as in the current version of the PR
I would be slightly leaning towards 1, as we are not under any pressure (the flake is rare), and it would be nice to understand what is going on.
OTOH I'm ok to accept as is, if this is @tenzen-y preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linked kind issue is really old, do we have current examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cluster delete failure should not be happening.
Yeah, but they do, as shown in the example. I'm not sure about the frequency,
Cluster deletion is reentrant and will succeed if there is nothing to do.
So, we may expect there is a good chance a retry would help? This might be safer, but it is a code complication, so I'm not convinced either.
Maybe we just park it and only proceed with ignoring if this repeats, so that we see the frequency? Did you see this before @trasc or @tenzen-y ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind is just dropping any output of the docker delete command and only prints the exit code.
that's not accurate. in kind exec errors contain the command output (which is always captured) and it is being printed in the error message ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind is just dropping any output of the docker delete command and only prints the exit code.
that's not accurate. in kind exec errors contain the command output (which is always captured) and it is being printed in the error message ...
Indeed, I missed that.
Cluster delete failure should not be happening. Cluster deletion is reentrant and will succeed if there is nothing to do.
It is, more or less, happening in our case as cleanup_dind()
is able to delete all the containers.
Waiting 30 seconds for pods stopped with terminationGracePeriod:30
Cleaning up after docker
ed68da3fb667
6beb571d417e
bf442dfcffc1
...
No
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenTheElder I think we could make the DeleteNodes
a two step op, docker stop
(this is able to send two signals with a timeout between them) and docker rm
after that.
I can open a PR in kind for this, just let me know.
/cc @tenzen-y |
a95d1b0
to
e4bb871
Compare
/close For now, I'm open to exploring solutions on another layer, possibly as suggested in this comment, but I'll leave the decision to @BenTheElder. I'm also open to revisiting the idea of ignoring errors in Kueue if this issue becomes more frequent, we rule out that the errors are related to Kueue, and we've exhausted options for resolving it at other layers. We can keep the issue open for further discussion. |
@mimowo: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/reopen |
@mimowo: Failed to re-open PR: state cannot be changed. The flaky-e2e-delete-cluster branch was force-pushed or recreated. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/reopen |
@mimowo: Failed to re-open PR: state cannot be changed. The flaky-e2e-delete-cluster branch was force-pushed or recreated. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/reopen |
@trasc: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
e4bb871
to
8fcd637
Compare
What type of PR is this?
/kind bug
/kind flake
What this PR does / why we need it:
Ignore the failure of cluster deletion in the e2e test's cleanup.
Which issue(s) this PR fixes:
Fixes #2738
Special notes for your reviewer:
Does this PR introduce a user-facing change?