Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Felix Exited #7440

Open
sujan46 opened this issue Dec 18, 2024 · 8 comments
Open

Felix Exited #7440

sujan46 opened this issue Dec 18, 2024 · 8 comments

Comments

@sujan46
Copy link

sujan46 commented Dec 18, 2024

Environmental Info:
RKE2 Version: v1.31.3+rke2r1

Node(s) CPU architecture, OS, and Version: x86_64, Ubuntu 22.04

Cluster Configuration: 3 controlplanes, 4 linux agents and 1 windows agent

Describe the bug: RKE2 service exited with error Felix exited and non-hpc pods lost connection to the internet and kubernetes gateway

Steps To Reproduce:

  1. Create a new cluster v1.31.3
  2. Join windows node to the cluster
  3. Launch a dummy pod and try to reach ping8.8.8.8
  4. After 10-15 mins rke2 services reports error Felix Exited but service itself is not stopped.
  5. After restarting rke2 it will be back again but will fail again after 10-15 mins.
  • Installed RKE2:

Expected behavior:

ping 8.8.8.8 should be able to pong or nslookup just breaks with error DNS request timed out

Actual behavior:

ping would respond with a pong

Additional context / logs:

TimeWritten           ReplacementStrings
-----------           ------------------
12/17/2024 8:55:27 PM {Felix exited}
12/17/2024 8:40:20 PM {Running RKE2 kube-proxy [--bind-address=10.107.22.24 --enable-dsr=true
                      --feature-gates=WinDSR=true --network-name=Calico --source-vip=172.25.99.194
                      --cluster-cidr=172.25.0.0/17 --healthz-bind-address=127.0.0.1
                      --hostname-override=uls-ep-kubert28
                      --kubeconfig=C:\var\lib\rancher\rke2\agent\kubeproxy.kubeconfig --proxy-mode=kernelspace]}
12/17/2024 8:40:20 PM {WinDSR support is enabled}
12/17/2024 8:40:20 PM {HCN feature check, version={13 3} supportedFeatures={{true true true true} {true true} true
                      true true true true true true true true true true false false false false false}}
12/17/2024 8:40:20 PM {Reserved VIP for kube-proxy: 172.25.99.194}
12/17/2024 8:40:17 PM {Calico started correctly}
@manuelbuil
Copy link
Contributor

Can you check the felix logs and see if you get more information?

@sujan46
Copy link
Author

sujan46 commented Dec 18, 2024

@manuelbuil I have few warning in felix logs

2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 203: This is a stale endpoint with no container attached id="ddafb3be-6baa-4487-a292-1e7ad485bb9a" name="6418c83b6e6d2eeb0e52d4e264252cbb329c8c37fdab25cafadb543b9123f1bf_Calico"
2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 203: This is a stale endpoint with no container attached id="fc0693ad-a9af-469e-a71a-6fadc20b0031" name="ba7714587f517d6353151e8c1a70b998ea2fed4b6d398071443a522b2a12bed2_Calico"
2024-12-18 05:25:34.403 [INFO][15024] felix/endpoint_mgr.go 560: Could not resolve hns endpoint id ip="172.25.17.196/32"
2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 350: Failed to look up HNS endpoint for workload id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"grafana-loki/loki-canary-6jwh6", EndpointId:"eth0"}
2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 440: Failed to look up one or more HNS endpoints; will schedule a retry
2024-12-18 05:25:34.403 [WARNING][15024] felix/win_dataplane.go 346: CompleteDeferredWork returned an error - scheduling a retry error=Endpoint could not be found
2024-12-18 05:27:09.696 [WARNING][15024] felix/l3_route_resolver.go 688: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=172.25.17.192
2024-12-18 05:27:09.696 [WARNING][15024] felix/l3_route_resolver.go 688: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=172.25.17.193

@sujan46
Copy link
Author

sujan46 commented Dec 18, 2024

And also found this in rke2 logs

12/18/2024 7:12:53 AM {Error encountered while importing C:\var\lib\rancher\rke2\agent\images\runtime-image.txt:
                      failed to pull images from C:\var\lib\rancher\rke2\agent\images\runtime-image.txt: rpc error:
                      code = Unknown desc = failed to pull and unpack image
                      "artifactory.xxxx.com:6609/rancher/rke2-runtime:v1.31.3-rke2r1-windows-amd64": failed to
                      extract layer sha256:a982c1cdcfe20bc827701769532a931379ec341822f0d096b394f4f5c46c8a6f:
                      hcsshim::ProcessBaseLayer
                      \\?\C:\var\lib\rancher\rke2\agent\containerd\io.containerd.snapshotter.v1.windows\snapshots\198:
                      The system cannot find the path specified.: unknown}

@sujan46
Copy link
Author

sujan46 commented Dec 18, 2024

And we also have autodetection enable for calico installation

  installation:
    calicoNetwork:
      nodeAddressAutodetectionV4:
        canReach: <gateway ip>

@brandond
Copy link
Member

brandond commented Dec 18, 2024

failed to pull and unpack image "artifactory.xxxx.com:6609/rancher/rke2-runtime:v1.31.3-rke2r1-windows-amd64"

This is fine, it's not a real Windows image that is used to run a pod. This message can be ignored.

@sujan46
Copy link
Author

sujan46 commented Dec 18, 2024

Added the fresh windows node still faced the same issue. When we start adding the workloads(~40 pods) it breaks

@manuelbuil
Copy link
Contributor

Perhaps you are running out of IPs in the windows node? Check the output of:

kubectl get ipamblocks.crd.projectcalico.org $YOURWINDOWSNODECIDR -o yaml

That should provide further information

@sujan46
Copy link
Author

sujan46 commented Dec 19, 2024

@manuelbuil We are facing issue with just 30 pods. The steps I followed to reproduce

  1. After Felix Exited error restart of rke2 service temp fixes it.
  2. I launched 30 windows pods IP allocation seems to be happening fine I was able to ping github.com for example from within the pod
  3. Noticed we had enough IP left for allocation. As soon as I started to terminate the pods I got Felix Exited error.

IP allocation example.

  allocations:
  - 0
  - 0
  - 0
  - null
  - 14
  - 11
  - 6
  - null
  - 23
  - 16
  - null
  - 9

We even tried rke2 1.28.15 we encountered same error.

Just FYI, We reverted back to rke2 version v1.28.10 with calico version 3.27.3 everything seems to be working as expected. It seems like latest version rke2 coupled with new version of calico causes these errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants