Felix Exited #7440

sujan46 · 2024-12-18T12:01:13Z

Environmental Info:
RKE2 Version: v1.31.3+rke2r1

Node(s) CPU architecture, OS, and Version: x86_64, Ubuntu 22.04

Cluster Configuration: 3 controlplanes, 4 linux agents and 1 windows agent

Describe the bug: RKE2 service exited with error Felix exited and non-hpc pods lost connection to the internet and kubernetes gateway

Steps To Reproduce:

Create a new cluster v1.31.3
Join windows node to the cluster
Launch a dummy pod and try to reach ping8.8.8.8
After 10-15 mins rke2 services reports error Felix Exited but service itself is not stopped.
After restarting rke2 it will be back again but will fail again after 10-15 mins.

Installed RKE2:

Expected behavior:

ping 8.8.8.8 should be able to pong or nslookup just breaks with error DNS request timed out

Actual behavior:

ping would respond with a pong

Additional context / logs:

TimeWritten           ReplacementStrings
-----------           ------------------
12/17/2024 8:55:27 PM {Felix exited}
12/17/2024 8:40:20 PM {Running RKE2 kube-proxy [--bind-address=10.107.22.24 --enable-dsr=true
                      --feature-gates=WinDSR=true --network-name=Calico --source-vip=172.25.99.194
                      --cluster-cidr=172.25.0.0/17 --healthz-bind-address=127.0.0.1
                      --hostname-override=uls-ep-kubert28
                      --kubeconfig=C:\var\lib\rancher\rke2\agent\kubeproxy.kubeconfig --proxy-mode=kernelspace]}
12/17/2024 8:40:20 PM {WinDSR support is enabled}
12/17/2024 8:40:20 PM {HCN feature check, version={13 3} supportedFeatures={{true true true true} {true true} true
                      true true true true true true true true true true false false false false false}}
12/17/2024 8:40:20 PM {Reserved VIP for kube-proxy: 172.25.99.194}
12/17/2024 8:40:17 PM {Calico started correctly}

The text was updated successfully, but these errors were encountered:

manuelbuil · 2024-12-18T14:31:21Z

Can you check the felix logs and see if you get more information?

sujan46 · 2024-12-18T15:36:13Z

@manuelbuil I have few warning in felix logs

2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 203: This is a stale endpoint with no container attached id="ddafb3be-6baa-4487-a292-1e7ad485bb9a" name="6418c83b6e6d2eeb0e52d4e264252cbb329c8c37fdab25cafadb543b9123f1bf_Calico"
2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 203: This is a stale endpoint with no container attached id="fc0693ad-a9af-469e-a71a-6fadc20b0031" name="ba7714587f517d6353151e8c1a70b998ea2fed4b6d398071443a522b2a12bed2_Calico"

2024-12-18 05:25:34.403 [INFO][15024] felix/endpoint_mgr.go 560: Could not resolve hns endpoint id ip="172.25.17.196/32"
2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 350: Failed to look up HNS endpoint for workload id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"grafana-loki/loki-canary-6jwh6", EndpointId:"eth0"}
2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 440: Failed to look up one or more HNS endpoints; will schedule a retry
2024-12-18 05:25:34.403 [WARNING][15024] felix/win_dataplane.go 346: CompleteDeferredWork returned an error - scheduling a retry error=Endpoint could not be found

2024-12-18 05:27:09.696 [WARNING][15024] felix/l3_route_resolver.go 688: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=172.25.17.192
2024-12-18 05:27:09.696 [WARNING][15024] felix/l3_route_resolver.go 688: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=172.25.17.193

sujan46 · 2024-12-18T15:37:34Z

And also found this in rke2 logs

12/18/2024 7:12:53 AM {Error encountered while importing C:\var\lib\rancher\rke2\agent\images\runtime-image.txt:
                      failed to pull images from C:\var\lib\rancher\rke2\agent\images\runtime-image.txt: rpc error:
                      code = Unknown desc = failed to pull and unpack image
                      "artifactory.xxxx.com:6609/rancher/rke2-runtime:v1.31.3-rke2r1-windows-amd64": failed to
                      extract layer sha256:a982c1cdcfe20bc827701769532a931379ec341822f0d096b394f4f5c46c8a6f:
                      hcsshim::ProcessBaseLayer
                      \\?\C:\var\lib\rancher\rke2\agent\containerd\io.containerd.snapshotter.v1.windows\snapshots\198:
                      The system cannot find the path specified.: unknown}

sujan46 · 2024-12-18T16:11:54Z

And we also have autodetection enable for calico installation

  installation:
    calicoNetwork:
      nodeAddressAutodetectionV4:
        canReach: <gateway ip>

brandond · 2024-12-18T16:38:01Z

failed to pull and unpack image "artifactory.xxxx.com:6609/rancher/rke2-runtime:v1.31.3-rke2r1-windows-amd64"

This is fine, it's not a real Windows image that is used to run a pod. This message can be ignored.

sujan46 · 2024-12-18T20:12:52Z

Added the fresh windows node still faced the same issue. When we start adding the workloads(~40 pods) it breaks

manuelbuil · 2024-12-19T10:55:05Z

Perhaps you are running out of IPs in the windows node? Check the output of:

kubectl get ipamblocks.crd.projectcalico.org $YOURWINDOWSNODECIDR -o yaml

That should provide further information

sujan46 · 2024-12-19T16:15:33Z

@manuelbuil We are facing issue with just 30 pods. The steps I followed to reproduce

After Felix Exited error restart of rke2 service temp fixes it.
I launched 30 windows pods IP allocation seems to be happening fine I was able to ping github.com for example from within the pod
Noticed we had enough IP left for allocation. As soon as I started to terminate the pods I got Felix Exited error.

IP allocation example.

  allocations:
  - 0
  - 0
  - 0
  - null
  - 14
  - 11
  - 6
  - null
  - 23
  - 16
  - null
  - 9

We even tried rke2 1.28.15 we encountered same error.

Just FYI, We reverted back to rke2 version v1.28.10 with calico version 3.27.3 everything seems to be working as expected. It seems like latest version rke2 coupled with new version of calico causes these errors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Felix Exited #7440

Felix Exited #7440

sujan46 commented Dec 18, 2024

manuelbuil commented Dec 18, 2024

sujan46 commented Dec 18, 2024 •

edited

Loading

sujan46 commented Dec 18, 2024 •

edited

Loading

sujan46 commented Dec 18, 2024

brandond commented Dec 18, 2024 •

edited

Loading

sujan46 commented Dec 18, 2024 •

edited

Loading

manuelbuil commented Dec 19, 2024

sujan46 commented Dec 19, 2024 •

edited

Loading

Felix Exited #7440

Felix Exited #7440

Comments

sujan46 commented Dec 18, 2024

manuelbuil commented Dec 18, 2024

sujan46 commented Dec 18, 2024 • edited Loading

sujan46 commented Dec 18, 2024 • edited Loading

sujan46 commented Dec 18, 2024

brandond commented Dec 18, 2024 • edited Loading

sujan46 commented Dec 18, 2024 • edited Loading

manuelbuil commented Dec 19, 2024

sujan46 commented Dec 19, 2024 • edited Loading

sujan46 commented Dec 18, 2024 •

edited

Loading

sujan46 commented Dec 18, 2024 •

edited

Loading

brandond commented Dec 18, 2024 •

edited

Loading

sujan46 commented Dec 18, 2024 •

edited

Loading

sujan46 commented Dec 19, 2024 •

edited

Loading