Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caa's StartupHandler(): should it accept Completed pods? #2075

Open
wainersm opened this issue Sep 30, 2024 · 0 comments
Open

caa's StartupHandler(): should it accept Completed pods? #2075

wainersm opened this issue Sep 30, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@wainersm
Copy link
Member

Describe the bug

This is what I did:

  1. Installed CAA and created some pods (e2e tests)
  2. All pods (tests) failed and I realized I had forgotten to set the AWS VPC in peer pods configmap
  3. Deleted the CAA installation but forgot to delete the pods created
  4. Re-installed CAA, this time passing the VPC
  5. StartupHandler() start running to determined whether CAA is ready or not
  6. CAA never got Ready, because GetAllPeerPods() always return error

It turns out that the some pods have its status transitioning from Error to Running (and vice-versa), but others are Completed (always, no restarts). See for example two snapshots of the namespace:

$ kubectl get pod -n coco-pp-e2e-test-84e87435
NAME                     READY   STATUS      RESTARTS   AGE
deletion-test            0/1     Error       3          51m
env-variable-in-both     0/1     Completed   0          71m
env-variable-in-config   0/1     Completed   0          81m
env-variable-in-image    0/1     Completed   0          91m
largeimage-pod           0/1     Error       3          61m
simple-test              0/1     Error       3          121m
workdirpod               0/1     Completed   0          101m
NAME                     READY   STATUS      RESTARTS        AGE
deletion-test            1/1     Running     5 (9m37s ago)   69m
env-variable-in-both     0/1     Completed   0               89m
env-variable-in-config   0/1     Completed   0               99m
env-variable-in-image    0/1     Completed   0               109m
largeimage-pod           1/1     Running     5               79m
simple-test              1/1     Running     5 (9m37s ago)   139m
workdirpod               0/1     Completed   0               119m

Looking at the 2nd snapshot above. Even when all pods are either Running or Completed, GetAllPeerPods() returns error because it expect all pods Ready (see here). See the CAA logs on that scenario:

2024/09/30 18:30:25 [probe/probe] nodeName: peer-pods-worker-0
2024/09/30 18:30:25 [probe/probe] Selected pods count: 18
2024/09/30 18:30:25 [probe/probe] Dealing with PeerPod: deletion-test, with Ready condition: {Ready True 0001-01-01 00:00:00 +0000 UTC 2024-09-30 18:29:46 +0000 UTC  }
2024/09/30 18:30:25 [probe/probe] Dealing with PeerPod: env-variable-in-both, with Ready condition: {Ready False 0001-01-01 00:00:00 +0000 UTC 2024-09-30 17:10:35 +0000 UTC PodCompleted }
2024/09/30 18:30:25 [probe/probe] Not all PeerPods ready, because PeerPod env-variable-in-both is not Ready.

I'm opening this issue to raise this debate: shouldn't it pass when a pod is also Completed? i.e. check that the existing pods are either Running (Ready) and Completed?

How to reproduce

Reproduce it should be hard I'm afraid :(

CoCo version information

caa 0.10.0-alpha1

What TEE are you seeing the problem on

None

Failing command and relevant log output

No response

@wainersm wainersm added the bug Something isn't working label Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant