-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Fix e2e test for dockermachinePool #11440
base: main
Are you sure you want to change the base?
Conversation
This PR is currently missing an area label, which is used to identify the modified component when generating release notes. Area labels can be added by org members by writing Please see the labels list for possible areas. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @serngawy. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test /assign @AndiDog |
// Providers should iterate through their infrastructure instances and ensure that each instance has a corresponding InfraMachine. | ||
for _, machine := range externalMachines { | ||
if existingMachine, ok := dockerMachineMap[machine.Name()]; ok { | ||
log.V(2).Info("Patching existing DockerMachine", "DockerMachine", klog.KObj(&existingMachine)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we losing this entire branch?
I don't follow how this change solves the problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the creation of the DockerMachine with the Container creation here to avoid the loop for creating DockerMachine CR based on the previously created container.
Based on my investigation ex logs blow, for some reason while the DockerMachinePool cleaning all dockerMachine a delay happen for deleting the container which make the DockerMachine get re-created here at same time the DockerMachinePool is been deleted
logs ex; the dockerMachine is failed to patch as it is deleted , next log is the dockerMachine created (I think dockerMachinePool go to delete same time the patch is happening to create missing dockerMachine) and then we stuck waiting for the machine get created but never happen as the dockermachinePool and machinePool are gone.
I1119 02:08:36.160435 1 dockermachinepool_controller_phases.go:147] "Patching existing DockerMachine" controller="dockermachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachinePool" DockerMachinePool="self-hosted-jqke01/self-hosted-r77ad9-mp-0-8xsxc" namespace="self-hosted-jqke01" name="self-hosted-r77ad9-mp-0-8xsxc" reconcileID="904d6ace-3baa-48e1-b000-97ed711f9802" MachinePool="self-hosted-r77ad9-mp-0-s8xbz" Cluster="self-hosted-jqke01/self-hosted-r77ad9" DockerMachine="self-hosted-jqke01/worker-1tpa4a"
E1119 02:08:36.181156 1 controller.go:316] "Reconciler error" err="failed to patch DockerMachine self-hosted-jqke01/worker-1tpa4a: dockermachines.infrastructure.cluster.x-k8s.io \"worker-1tpa4a\" not found" controller="dockermachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachine" DockerMachine="self-hosted-jqke01/worker-1tpa4a" namespace="self-hosted-jqke01" name="worker-1tpa4a" reconcileID="f3f06dcd-f11a-4c3f-ae30-8bbef471ca00"
E1119 02:08:36.183408 1 controller.go:316] "Reconciler error" err="failed to update DockerMachine \"self-hosted-jqke01/worker-1tpa4a\": failed to apply DockerMachine self-hosted-jqke01/worker-1tpa4a: Operation cannot be fulfilled on dockermachines.infrastructure.cluster.x-k8s.io \"worker-1tpa4a\": uid mismatch: the provided object specified uid fb561549-45d7-4a09-af27-6931b56bd020, and no existing object was found" controller="dockermachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachinePool" DockerMachinePool="self-hosted-jqke01/self-hosted-r77ad9-mp-0-8xsxc" namespace="self-hosted-jqke01" name="self-hosted-r77ad9-mp-0-8xsxc" reconcileID="904d6ace-3baa-48e1-b000-97ed711f9802"
I1119 02:08:36.186594 1 dockermachinepool_controller_phases.go:158] "Creating a new DockerMachine for Docker container" controller="dockermachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachinePool" DockerMachinePool="self-hosted-jqke01/self-hosted-r77ad9-mp-0-8xsxc" namespace="self-hosted-jqke01" name="self-hosted-r77ad9-mp-0-8xsxc" reconcileID="9cc3622a-5a30-41c7-8b0a-6f0b212d0bdb" MachinePool="self-hosted-r77ad9-mp-0-s8xbz" Cluster="self-hosted-jqke01/self-hosted-r77ad9" container="worker-1tpa4a"
I1119 02:08:36.253031 1 dockermachine_controller.go:104] "Waiting for Machine Controller to set OwnerRef on DockerMachine" controller="dockermachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachine" DockerMachine="self-hosted-jqke01/worker-1tpa4a" namespace="self-hosted-jqke01" name="worker-1tpa4a" reconcileID="9799d2ef-c1a7-4924-b347-875155993995" DockerMachinePool="self-hosted-jqke01/self-hosted-r77ad9-mp-0-8xsxc"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for some reason while the DockerMachinePool cleaning all dockerMachine a delay happen for deleting the container which make the DockerMachine get re-created here at same time the DockerMachinePool is been deleted
have you considered to prevent creation of new machines in this func when the DockerMachinePool has a deletion timestamp? (it should probably be an if around L155-L161)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the code logic shouldn't allow this to happen (it should go through the delete reconcile). In any case, tying the creation of dockerMachine after the container creation is better implementation to avoid such random execution to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the code logic shouldn't allow this to happen (it should go through the delete reconcile).
This is confusing me a little bit (and my lack of knowledge in MP doesn't help)
If I got it right, we are trying to fix a race condition that happens when deleting the MP.
But, to fix this error on deletion, we are shuffling machine creation from reconcileDockerMachines to reconcileDockerContainers, and both are not called on reconcile delete 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for confusion, this changes fix the race condition. here during reconcile the dockerMachines if the container has no dockerMachine the dockerMachine get created, while the delete could happen at sametime. Before this changes; all containers get created first then all dockerMachines get created (in between race condition could happen). So this changes create every container with its dockerMachine at the sametime to avoid race condition. Hope that make it clear :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bearing with me with the explanation.
Ok, we are probably getting to to the point that confuses me.
When I review the PR, I start from the assumption that, if we consider a single MP, controller runtime ensures us that only one concurrent reconcile can happen anytime.
That's means, that if we are running the code here we are in reconcile normal, and reconcileDelete won't start until the operation completes or fails.
This rules out most of the race conditions. Also, assuming normal completes, the order of operations doesn't matter much, because at the end both dockerMachine and container should exist.
However, what might happen is that it takes some time for the cache to see the newly created dockerMachines, so it will seems it is missing when reconcileDelete starts, even if it it actually exists on the API server (but this requires a different fix).
Does this match with what are you observing?
To check this you can look e.g. at reconcile IDs in logs + look at errors when creating machines
@@ -107,15 +113,13 @@ func createDockerContainer(ctx context.Context, name string, cluster *clusterv1. | |||
|
|||
// reconcileDockerMachines creates and deletes DockerMachines to match the MachinePool's desired number of replicas and infrastructure spec. | |||
// It is responsible for | |||
// - Ensuring each Docker container has an associated DockerMachine by creating one if it doesn't already exist. | |||
// - Ensuring that deletion for Docker container happens by calling delete on the associated Machine so that the node is cordoned/drained and the infrastructure is cleaned up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The delete call seems unchanged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the delete cause if the container get deleted out side of capi. It has to reflect what is exist in the infrastructure. Also it handle the replica scale down use-case if something wrong happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then we probably shouldn't drop the comment mentioning the deletion?
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
return ctrl.Result{}, r.reconcileDelete(ctx, cluster, machinePool, dockerMachinePool) | ||
if !dockerMachinePool.DeletionTimestamp.IsZero() { | ||
// perform dockerMachinePool delete and requeue in 30s giving time for containers to be deleted. | ||
return ctrl.Result{RequeueAfter: 30 * time.Second}, r.reconcileDelete(ctx, cluster, machinePool, dockerMachinePool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CR will complain if we return a non zero result and an error, so we should split this into:
return ctrl.Result{RequeueAfter: 30 * time.Second}, r.reconcileDelete(ctx, cluster, machinePool, dockerMachinePool) | |
if err:= r.reconcileDelete(ctx, cluster, machinePool, dockerMachinePool); err != nil { | |
return ctrl.Result{}, err | |
} | |
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil |
Also, for my better understanding, could you kindly expand a little bit on why do you want to always reconcile every 30s instead of relying on watches/events or error backoff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, The e2e-Blocking job was failing as the docker cluster delete-wait-time is 3m here and it takes longer than this. Testing it locally, it takes up to 20min to reconcile after deleting the container to let the CR be deleted. So to avoid increasing the wait-time force reconcile and fail fast if there is an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for details. Considering we usually resync every 10m, 20m is something unusual.
If the 20m are due to race between creating machines and deleting them, and with this PR we are going to prevent this race to happen, I would prefer to avoid the requeue given that we have a proper fix in place.
If instead the 20m are due to some unknown reason, then we should probably try to root cause (requeing will probably make the test pass, but not fix the underlying issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no race condition, this how the code logic work check here in the reconcileDelete, We list the dockerMachines delete them and return (I believe to give time for containers & machines) then next reconcile if there is no dockerMachine we remove the dockerMachinePool finalizer. So as you said if we rysync in 10m and in the docker.yaml->wait-delete-cluster we wait 3m for delete so its normal to fail. Either we increase the docker.yaml->wait-delete-cluster to 15-20m (long time to fail) OR can we set the global resync to 1m during the e2e test ?
// Providers should iterate through their infrastructure instances and ensure that each instance has a corresponding InfraMachine. | ||
for _, machine := range externalMachines { | ||
if existingMachine, ok := dockerMachineMap[machine.Name()]; ok { | ||
log.V(2).Info("Patching existing DockerMachine", "DockerMachine", klog.KObj(&existingMachine)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for some reason while the DockerMachinePool cleaning all dockerMachine a delay happen for deleting the container which make the DockerMachine get re-created here at same time the DockerMachinePool is been deleted
have you considered to prevent creation of new machines in this func when the DockerMachinePool has a deletion timestamp? (it should probably be an if around L155-L161)
Signed-off-by: serngawy <[email protected]>
@fabriziopandini @sbueringer , Do you have any other comments ? can we merge this ? |
What this PR does / why we need it:
This PR fix the missing return error for non exist machine and add more logs
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #11162