🐛 Fix e2e test for dockermachinePool #11440

serngawy · 2024-11-18T22:51:25Z

What this PR does / why we need it:
This PR fix the missing return error for non exist machine and add more logs

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #11162

k8s-ci-robot · 2024-11-18T22:51:34Z

This PR is currently missing an area label, which is used to identify the modified component when generating release notes.

Area labels can be added by org members by writing /area ${COMPONENT} in a comment

Please see the labels list for possible areas.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2024-11-18T22:51:35Z

Hi @serngawy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

sbueringer · 2024-11-19T07:42:59Z

/ok-to-test

/assign @AndiDog
/assign @fabriziopandini (for the parts affecting "regular non-MP CAPD"

sbueringer · 2024-11-19T07:47:07Z

test/infrastructure/docker/exp/internal/controllers/dockermachinepool_controller_phases.go

-	// Providers should iterate through their infrastructure instances and ensure that each instance has a corresponding InfraMachine.
-	for _, machine := range externalMachines {
-		if existingMachine, ok := dockerMachineMap[machine.Name()]; ok {
-			log.V(2).Info("Patching existing DockerMachine", "DockerMachine", klog.KObj(&existingMachine))


Are we losing this entire branch?

I don't follow how this change solves the problem

I moved the creation of the DockerMachine with the Container creation here to avoid the loop for creating DockerMachine CR based on the previously created container.
Based on my investigation ex logs blow, for some reason while the DockerMachinePool cleaning all dockerMachine a delay happen for deleting the container which make the DockerMachine get re-created here at same time the DockerMachinePool is been deleted

logs ex; the dockerMachine is failed to patch as it is deleted , next log is the dockerMachine created (I think dockerMachinePool go to delete same time the patch is happening to create missing dockerMachine) and then we stuck waiting for the machine get created but never happen as the dockermachinePool and machinePool are gone.

I1119 02:08:36.160435 1 dockermachinepool_controller_phases.go:147] "Patching existing DockerMachine" controller="dockermachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachinePool" DockerMachinePool="self-hosted-jqke01/self-hosted-r77ad9-mp-0-8xsxc" namespace="self-hosted-jqke01" name="self-hosted-r77ad9-mp-0-8xsxc" reconcileID="904d6ace-3baa-48e1-b000-97ed711f9802" MachinePool="self-hosted-r77ad9-mp-0-s8xbz" Cluster="self-hosted-jqke01/self-hosted-r77ad9" DockerMachine="self-hosted-jqke01/worker-1tpa4a" E1119 02:08:36.181156 1 controller.go:316] "Reconciler error" err="failed to patch DockerMachine self-hosted-jqke01/worker-1tpa4a: dockermachines.infrastructure.cluster.x-k8s.io \"worker-1tpa4a\" not found" controller="dockermachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachine" DockerMachine="self-hosted-jqke01/worker-1tpa4a" namespace="self-hosted-jqke01" name="worker-1tpa4a" reconcileID="f3f06dcd-f11a-4c3f-ae30-8bbef471ca00" E1119 02:08:36.183408 1 controller.go:316] "Reconciler error" err="failed to update DockerMachine \"self-hosted-jqke01/worker-1tpa4a\": failed to apply DockerMachine self-hosted-jqke01/worker-1tpa4a: Operation cannot be fulfilled on dockermachines.infrastructure.cluster.x-k8s.io \"worker-1tpa4a\": uid mismatch: the provided object specified uid fb561549-45d7-4a09-af27-6931b56bd020, and no existing object was found" controller="dockermachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachinePool" DockerMachinePool="self-hosted-jqke01/self-hosted-r77ad9-mp-0-8xsxc" namespace="self-hosted-jqke01" name="self-hosted-r77ad9-mp-0-8xsxc" reconcileID="904d6ace-3baa-48e1-b000-97ed711f9802" I1119 02:08:36.186594 1 dockermachinepool_controller_phases.go:158] "Creating a new DockerMachine for Docker container" controller="dockermachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachinePool" DockerMachinePool="self-hosted-jqke01/self-hosted-r77ad9-mp-0-8xsxc" namespace="self-hosted-jqke01" name="self-hosted-r77ad9-mp-0-8xsxc" reconcileID="9cc3622a-5a30-41c7-8b0a-6f0b212d0bdb" MachinePool="self-hosted-r77ad9-mp-0-s8xbz" Cluster="self-hosted-jqke01/self-hosted-r77ad9" container="worker-1tpa4a" I1119 02:08:36.253031 1 dockermachine_controller.go:104] "Waiting for Machine Controller to set OwnerRef on DockerMachine" controller="dockermachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachine" DockerMachine="self-hosted-jqke01/worker-1tpa4a" namespace="self-hosted-jqke01" name="worker-1tpa4a" reconcileID="9799d2ef-c1a7-4924-b347-875155993995" DockerMachinePool="self-hosted-jqke01/self-hosted-r77ad9-mp-0-8xsxc"

for some reason while the DockerMachinePool cleaning all dockerMachine a delay happen for deleting the container which make the DockerMachine get re-created here at same time the DockerMachinePool is been deleted

have you considered to prevent creation of new machines in this func when the DockerMachinePool has a deletion timestamp? (it should probably be an if around L155-L161)

Well, the code logic shouldn't allow this to happen (it should go through the delete reconcile). In any case, tying the creation of dockerMachine after the container creation is better implementation to avoid such random execution to happen.

Well, the code logic shouldn't allow this to happen (it should go through the delete reconcile).

This is confusing me a little bit (and my lack of knowledge in MP doesn't help)
If I got it right, we are trying to fix a race condition that happens when deleting the MP.
But, to fix this error on deletion, we are shuffling machine creation from reconcileDockerMachines to reconcileDockerContainers, and both are not called on reconcile delete 🤔

I'm too new to CAPD for having a strong opinion, but the concern I have is that now creation only happens if the controller creates a Docker container. If one is created in another way, for instance manually, it won't be reconciled into a DockerMachine. Not sure if that's a case to consider at all. Just a general slightly bad feeling after seeing hundreds of reconciliation code bugs.

If we're fine about this thread, the rest = LGTM.

It's strange to me as well, why do we even create dockerMachines ? as far as I understand we use machinePool to not manage individual machines. If there is a special handling for the machinePool machines then we create machienPoolMachine to perform this special handling (not the case for dockerMachienPool and dockerMachine).
I'm might be over thinking but this is another discussion over what this PR doing :)

why do we even create dockerMachines ?

You can find context in https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20220209-machinepool-machines.md

the concern I have is that now creation only happens if the controller creates a Docker container. If one is created in another way, for instance manually, it won't be reconciled into a DockerMachine.

Considering this is a test provider, I think the real question is if the current implementation is enough to validate MP implementation in core CAPI (In other words I think it is ok if it isn't perfect, but it should serve to a goal).
I assume the answer is yes, given that the current implementation has been contributed by the authors of the proposal linked above

I'm following the patch func here but not sure how the cache is delayed

it goes down into cache implementation in controller runtime, informers etc, rif https://github.com/kubernetes-sigs/controller-runtime/tree/v0.19.3/pkg/cache

In order to get this moving, what I suggest is

To drop this change set (both the change of operation order in reconcile normal, both the requeue after.

To add a loop after DockerMachine create forcing the controller to wait for the cache to be updated (*).

To test if this solve the issue.

(*) Change should be implemented here, and an this is an example of the wait loop we should implement (replace MS with DockerMachines, fixup error management)

Thanks @fabriziopandini for referring to cache implementation. However, what are you referring to is not helping, you can reproduce it by using this repo here just build e2e and then run $ make test-e2e SKIP_RESOURCE_CLEANUP=true GINKGO_FOCUS="self-host" the test may pass however check the DockerMachine CR in the created kind containers. You will find dockerMachine CR is created while the DockerMachinePool is deleted and those dockerMachine stuck waiting for its corresponding machine CR get created.
In any case; what this PR is trying to do is attaching the creation of the container with the dockerMachine which is better implementation from old one to avoid race conditions from happening.
Can we focus on this PR changes, Is creating the dockerMachine with the container raise an alarm ?

Thanks for testing my theory

You will find dockerMachine CR is created while the DockerMachinePool is deleted and those dockerMachine stuck waiting for its corresponding machine CR get created.

I don't have yet a good explanation about why this happen, but I would prefer to take time to investigate this properly because CR should prevent those race conditions to happen, and if not, we should root cause and get a fix there.

I will try to reproduce locally, but I can't guarantee when I will get to it with the upcoming release and usual EOY rush.

cc @mboersma @AndiDog who might be interested in work this area as well

test/infrastructure/docker/exp/internal/controllers/dockermachinepool_controller_phases.go

fabriziopandini · 2024-11-21T21:41:26Z

test/infrastructure/docker/exp/internal/controllers/dockermachinepool_controller.go

-		return ctrl.Result{}, r.reconcileDelete(ctx, cluster, machinePool, dockerMachinePool)
+	if !dockerMachinePool.DeletionTimestamp.IsZero() {
+		// perform dockerMachinePool delete and requeue in 30s giving time for containers to be deleted.
+		return ctrl.Result{RequeueAfter: 30 * time.Second}, r.reconcileDelete(ctx, cluster, machinePool, dockerMachinePool)


CR will complain if we return a non zero result and an error, so we should split this into:

Suggested change

return ctrl.Result{RequeueAfter: 30 * time.Second}, r.reconcileDelete(ctx, cluster, machinePool, dockerMachinePool)

if err:= r.reconcileDelete(ctx, cluster, machinePool, dockerMachinePool); err != nil {

return ctrl.Result{}, err

}

return ctrl.Result{RequeueAfter: 30 * time.Second}, nil

Also, for my better understanding, could you kindly expand a little bit on why do you want to always reconcile every 30s instead of relying on watches/events or error backoff?

Okay, The e2e-Blocking job was failing as the docker cluster delete-wait-time is 3m here and it takes longer than this. Testing it locally, it takes up to 20min to reconcile after deleting the container to let the CR be deleted. So to avoid increasing the wait-time force reconcile and fail fast if there is an error.

Thanks for details. Considering we usually resync every 10m, 20m is something unusual.

If the 20m are due to race between creating machines and deleting them, and with this PR we are going to prevent this race to happen, I would prefer to avoid the requeue given that we have a proper fix in place.

If instead the 20m are due to some unknown reason, then we should probably try to root cause (requeing will probably make the test pass, but not fix the underlying issue)

There is no race condition, this how the code logic work check here in the reconcileDelete, We list the dockerMachines delete them and return (I believe to give time for containers & machines) then next reconcile if there is no dockerMachine we remove the dockerMachinePool finalizer. So as you said if we rysync in 10m and in the docker.yaml->wait-delete-cluster we wait 3m for delete so its normal to fail. Either we increase the docker.yaml->wait-delete-cluster to 15-20m (long time to fail) OR can we set the global resync to 1m during the e2e test ?

fabriziopandini · 2024-11-21T22:05:36Z

test/infrastructure/docker/exp/internal/controllers/dockermachinepool_controller_phases.go

-	// Providers should iterate through their infrastructure instances and ensure that each instance has a corresponding InfraMachine.
-	for _, machine := range externalMachines {
-		if existingMachine, ok := dockerMachineMap[machine.Name()]; ok {
-			log.V(2).Info("Patching existing DockerMachine", "DockerMachine", klog.KObj(&existingMachine))


for some reason while the DockerMachinePool cleaning all dockerMachine a delay happen for deleting the container which make the DockerMachine get re-created here at same time the DockerMachinePool is been deleted

have you considered to prevent creation of new machines in this func when the DockerMachinePool has a deletion timestamp? (it should probably be an if around L155-L161)

serngawy · 2024-12-02T15:23:40Z

@fabriziopandini @sbueringer , Do you have any other comments ? can we merge this ?

chrischdi · 2024-12-09T16:08:28Z

test/infrastructure/docker/exp/internal/controllers/dockermachinepool_controller_phases.go

 // - Ensuring that deletion for Docker container happens by calling delete on the associated Machine so that the node is cordoned/drained and the infrastructure is cleaned up.
 // - Deleting DockerMachines referencing a container whose Kubernetes version or custom image no longer matches the spec.
 // - Deleting DockerMachines that correspond to a deleted/non-existent Docker container.
 // - Deleting DockerMachines when scaling down such that DockerMachines whose owner Machine has the clusterv1.DeleteMachineAnnotation is given priority.
 func (r *DockerMachinePoolReconciler) reconcileDockerMachines(ctx context.Context, cluster *clusterv1.Cluster, machinePool *expv1.MachinePool, dockerMachinePool *infraexpv1.DockerMachinePool) error {
 	log := ctrl.LoggerFrom(ctx)

-	log.V(2).Info("Reconciling DockerMachines", "DockerMachinePool", klog.KObj(dockerMachinePool))
+	log.Info("Reconciling DockerMachines", "DockerMachinePool", klog.KObj(dockerMachinePool))


Let's please keep using log levels (also for all other places in this PR)

k8s-ci-robot · 2025-01-14T19:33:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from fabriziopandini. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

chrischdi · 2025-01-15T10:13:11Z

test/infrastructure/docker/exp/internal/controllers/dockermachinepool_controller_phases.go

+		log.Info("Creating a new DockerMachine for dockerMachinePool", "DockerMachinePool", klog.KObj(dockerMachinePool))
+		dockerMachine := computeDesiredDockerMachine(name, cluster, machinePool, dockerMachinePool, nil)
+		if err := ssa.Patch(ctx, r.Client, dockerMachinePoolControllerName, dockerMachine); err != nil {
+			return errors.Wrap(err, "failed to create a new docker machine")
+		}


If this creation fails here, the next reconcile won't try to create the DockerMachine right? Se we may leak dockermachines?

Just noticed, this is also on the other thread :-)

I have change the logic in reconcileNormal to update the dockerMachine OR create Docker container if the dockerMachine has no container

Signed-off-by: serngawy <[email protected]>

serngawy · 2025-01-22T21:19:53Z

I get back to this, I have change the logic to consider the DockerMachine CR is the source of truth not the created Docker container. So the order of logic become; 1- create DockerMachine CR 2- create Docker Container.
If there is a dockerMachine without container, a docker container will be created for it. The rest of the logic scale up/down OR delete outdated containers remain the same. By doing so I don't see any left over DockerMachines as previous logic was doing. In order to make this move forward, @fabriziopandini let me know if you still have concern.

chrischdi · 2025-01-23T10:53:02Z

1- create DockerMachine CR 2- create Docker Container

This sounds like were then doing the same as MachineDeployments now 🤔
Isn't the nature of MachinePool implementations the other way around?
Or is it okay for CAPD MachinePools to be different here compared to real infra providers for MachinePools?

sbueringer · 2025-01-23T17:21:04Z

Isn't the nature of MachinePool implementations the other way around?
Or is it okay for CAPD MachinePools to be different here compared to real infra providers for MachinePools?

I think the reason for implementing MachinePool Machines in CAPD was to have a realistic test provider

k8s-ci-robot · 2025-02-19T13:55:16Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label labels Nov 18, 2024

k8s-ci-robot requested review from jackfrancis and JoelSpeed November 18, 2024 22:51

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 18, 2024

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 18, 2024

serngawy force-pushed the e2efix branch from 0213ff1 to 49502cd Compare November 19, 2024 03:31

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 19, 2024

serngawy changed the title ~~🐛 (WIP) Fix e2e test for dockermachinePool~~ 🐛 Fix e2e test for dockermachinePool Nov 19, 2024

k8s-ci-robot assigned AndiDog and fabriziopandini Nov 19, 2024

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 19, 2024

sbueringer reviewed Nov 19, 2024

View reviewed changes

test/infrastructure/docker/exp/internal/controllers/dockermachinepool_controller_phases.go Show resolved Hide resolved

serngawy force-pushed the e2efix branch 2 times, most recently from fa64f09 to 75b147f Compare November 20, 2024 15:09

fabriziopandini reviewed Nov 21, 2024

View reviewed changes

serngawy force-pushed the e2efix branch from 75b147f to a1865cd Compare November 22, 2024 18:16

chrischdi reviewed Dec 9, 2024

View reviewed changes

serngawy force-pushed the e2efix branch from e03f98a to a1865cd Compare January 14, 2025 20:30

chrischdi reviewed Jan 15, 2025

View reviewed changes

serngawy force-pushed the e2efix branch from a1865cd to 462be4a Compare January 22, 2025 20:37

k8s-ci-robot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 22, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 22, 2025

Fix e2e test for dockermachinePool

3ede182

Signed-off-by: serngawy <[email protected]>

serngawy force-pushed the e2efix branch from 462be4a to 3ede182 Compare January 22, 2025 21:02

This was referenced Jan 30, 2025

✨ Remediate unhealthy MachinePool machines #11392

Open

Timed out after 180.001s. waiting for cluster deletion timed out #11162

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Fix e2e test for dockermachinePool #11440

🐛 Fix e2e test for dockermachinePool #11440

serngawy commented Nov 18, 2024 •

edited

Loading

k8s-ci-robot commented Nov 18, 2024

k8s-ci-robot commented Nov 18, 2024

sbueringer commented Nov 19, 2024 •

edited

Loading

sbueringer Nov 19, 2024

serngawy Nov 19, 2024 •

edited

Loading

fabriziopandini Nov 21, 2024 •

edited

Loading

serngawy Nov 22, 2024

fabriziopandini Nov 25, 2024 •

edited

Loading

AndiDog Dec 3, 2024

serngawy Dec 4, 2024

fabriziopandini Dec 4, 2024 •

edited

Loading

serngawy Dec 5, 2024

fabriziopandini Dec 6, 2024

fabriziopandini Nov 21, 2024

serngawy Nov 22, 2024 •

edited

Loading

fabriziopandini Nov 25, 2024

serngawy Nov 25, 2024

fabriziopandini Nov 21, 2024 •

edited

Loading

serngawy commented Dec 2, 2024

chrischdi Dec 9, 2024

k8s-ci-robot commented Jan 14, 2025

chrischdi Jan 15, 2025

chrischdi Jan 15, 2025

serngawy Jan 22, 2025

serngawy commented Jan 22, 2025

chrischdi commented Jan 23, 2025

sbueringer commented Jan 23, 2025 •

edited

Loading

k8s-ci-robot commented Feb 19, 2025

-		return ctrl.Result{RequeueAfter: 30 * time.Second}, r.reconcileDelete(ctx, cluster, machinePool, dockerMachinePool)
+		if err:= r.reconcileDelete(ctx, cluster, machinePool, dockerMachinePool); err != nil {
+		     return ctrl.Result{}, err
+		}
+		return ctrl.Result{RequeueAfter: 30 * time.Second}, nil

🐛 Fix e2e test for dockermachinePool #11440

Are you sure you want to change the base?

🐛 Fix e2e test for dockermachinePool #11440

Conversation

serngawy commented Nov 18, 2024 • edited Loading

k8s-ci-robot commented Nov 18, 2024

k8s-ci-robot commented Nov 18, 2024

sbueringer commented Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

serngawy Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

fabriziopandini Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabriziopandini Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabriziopandini Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serngawy Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabriziopandini Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

serngawy commented Dec 2, 2024

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serngawy commented Jan 22, 2025

chrischdi commented Jan 23, 2025

sbueringer commented Jan 23, 2025 • edited Loading

k8s-ci-robot commented Feb 19, 2025

serngawy commented Nov 18, 2024 •

edited

Loading

sbueringer commented Nov 19, 2024 •

edited

Loading

serngawy Nov 19, 2024 •

edited

Loading

fabriziopandini Nov 21, 2024 •

edited

Loading

fabriziopandini Nov 25, 2024 •

edited

Loading

fabriziopandini Dec 4, 2024 •

edited

Loading

serngawy Nov 22, 2024 •

edited

Loading

fabriziopandini Nov 21, 2024 •

edited

Loading

sbueringer commented Jan 23, 2025 •

edited

Loading