- Add modules to Terraform;
- Tests?
- Branch
use-latest-ubuntu-server-image
; check commit log and links for more details. - Ubuntu 22.04 apparently introduced a change that broke the
cgroup
driver configuration and the previouscontainerd
configuration stopped working. What does work is forcingcontainerd
to usesystemd
'scgroup
driver.- kubernetes/kubernetes#110177
- containerd/containerd#4203
- kubernetes/kubeadm#2449
- containerd/containerd#4581
- https://github.com/containerd/containerd/blob/main/docs/cri/config.md
- https://kubernetes.io/docs/setup/production-environment/container-runtimes/#containerd-systemd
- https://kubernetes.io/docs/setup/production-environment/container-runtimes/#systemd-cgroup-driver
- https://kubernetes.io/docs/concepts/architecture/cgroups/#using-cgroupv2
- Tried
bash ./kubetail -l app.kubernetes.io/component=speaker -n metallb-system
and didcurl -kv https://192.168.56.46/#/workloads?namespace=default
; got:
Will tail 2 logs...
metallb-speaker-2zj7t
metallb-speaker-l4dr8
[metallb-speaker-l4dr8] W0301 19:14:42.866262 1 warnings.go:70] metallb.io v1beta1 AddressPool is deprecated, consider using IPAddressPool
[metallb-speaker-2zj7t] {"caller":"node_controller.go:42","controller":"NodeReconciler","level":"info","start reconcile":"/master-1","ts":"2023-03-01T19:15:36Z"}
[metallb-speaker-2zj7t] {"caller":"speakerlist.go:271","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2023-03-01T19:15:36Z"}
[metallb-speaker-2zj7t] {"caller":"node_controller.go:64","controller":"NodeReconciler","end reconcile":"/master-1","level":"info","ts":"2023-03-01T19:15:36Z"}
[metallb-speaker-l4dr8] {"caller":"node_controller.go:42","controller":"NodeReconciler","level":"info","start reconcile":"/worker-4","ts":"2023-03-01T19:15:51Z"}
[metallb-speaker-l4dr8] {"caller":"speakerlist.go:271","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2023-03-01T19:15:51Z"}
[metallb-speaker-l4dr8] {"caller":"node_controller.go:64","controller":"NodeReconciler","end reconcile":"/worker-4","level":"info","ts":"2023-03-01T19:15:51Z"}
- These look like pretty standard logs; it doesn't look that they appear when I execute the
curl
command. - Tried
sudo tcpdump -i enp0s8 host 192.168.56.46 -vv
on bothworker-4
andmaster-1
, which are the only active nodes:tcpdump: listening on enp0s9, link-type EN10MB (Ethernet), snapshot length 262144 bytes 19:13:41.295688 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has kubernetes-dashboard.my-cluster.local tell Slav-PC, length 46 19:13:42.315033 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has kubernetes-dashboard.my-cluster.local tell Slav-PC, length 46 19:13:43.338968 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has kubernetes-dashboard.my-cluster.local tell Slav-PC, length 46
- The requests are coming in, but they get dropped or something.
- Tried all active network interfaces, but only on
enp0s9
(the bridged adapter) traffic comes in.
- Turned on promiscuous mode
sudo ifconfig enp0s8 promisc
on both nodes and triedtcpdump
again, same results. - If I do the
curl
command fromworker-4
(where the dashboard is deployed), I get success, but nothing intcpdump
. - Tried this SO answer; value is 0 for both nodes, which seems ok.
- Checked
/etc/hosts.allow
and/etc/hosts.deny
on both nodes, no rules there. - Tried setting
externalTrafficPolicy: Local
in the k8s service, no success; - Tried removing the strict ARP resource (apparently needed only when using IPVS); same thing;
- Set log level to
all
in the DS, seeing debug messages, but nothing useful; - I can connect to the nodes, though, so the host machine knows about them and how to access them.
tcpdump
-ing the gateway (192.168.56.1
) shows the same results - ARP requests, but no replies. The master'stcpdump
shows more traffic; on the workers only ARP appears.- Tried configuring BGP using some default configurations, same result - internally it works, from host - no.
- Tried setting kube-proxy to IPVS, same result; same with also enabling promiscuous mode on
enp0s8
. - Recreated the cluster with the latest settings (IPVS, gateway 192.168.56.1 for host-only network). The behavior now is a bit different (the below output is seen only on master,
tcpdump
on workers is silent):
vagrant@master-1:~$ sudo tcpdump -i enp0s8 host 192.168.56.46 -vv
tcpdump: listening on enp0s8, link-type EN10MB (Ethernet), snapshot length 262144 bytes
06:03:29.834424 IP (tos 0x0, ttl 64, id 51942, offset 0, flags [DF], proto TCP (6), length 60)
_gateway.52868 > master-1.https: Flags [S], cksum 0x62c0 (correct), seq 581808951, win 64240, options [mss 1460,sackOK,TS val 2385626232 ecr 0,nop,wscale 7], length 0
06:03:52.883432 IP (tos 0x0, ttl 64, id 16155, offset 0, flags [DF], proto TCP (6), length 60)
_gateway.37884 > master-1.https: Flags [S], cksum 0xf565 (correct), seq 134224830, win 64240, options [mss 1460,sackOK,TS val 2385649281 ecr 0,nop,wscale 7], length 0
06:03:53.898432 IP (tos 0x0, ttl 64, id 16156, offset 0, flags [DF], proto TCP (6), length 60)
_gateway.37884 > master-1.https: Flags [S], cksum 0xf16e (correct), seq 134224830, win 64240, options [mss 1460,sackOK,TS val 2385650296 ecr 0,nop,wscale 7], length 0
06:03:55.914412 IP (tos 0x0, ttl 64, id 16157, offset 0, flags [DF], proto TCP (6), length 60)
_gateway.37884 > master-1.https: Flags [S], cksum 0xe98e (correct), seq 134224830, win 64240, options [mss 1460,sackOK,TS val 2385652312 ecr 0,nop,wscale 7], length 0
06:03:57.994345 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has master-1 tell _gateway, length 46
06:03:57.994368 ARP, Ethernet (len 6), IPv4 (len 4), Reply master-1 is-at 08:00:27:96:6b:0b (oui Unknown), length 28
06:04:00.042448 IP (tos 0x0, ttl 64, id 16158, offset 0, flags [DF], proto TCP (6), length 60)
_gateway.37884 > master-1.https: Flags [S], cksum 0xd96e (correct), seq 134224830, win 64240, options [mss 1460,sackOK,TS val 2385656440 ecr 0,nop,wscale 7], length 0
06:04:08.234394 IP (tos 0x0, ttl 64, id 16159, offset 0, flags [DF], proto TCP (6), length 60)
_gateway.37884 > master-1.https: Flags [S], cksum 0xb96e (correct), seq 134224830, win 64240, options [mss 1460,sackOK,TS val 2385664632 ecr 0,nop,wscale 7], length 0
-
Changing
externalTrafficPolicy
fromLocal
toCluster
in theLoadBalancer
service resolved the problem. -
https://www.practicalnetworking.net has good materials on networking.
-
Good resource on using
tcpdump
.
- The error is:
2023-03-11 10:59:22.772 [ERROR][79] felix/health.go 360: Health endpoint failed, trying to restart it... error=listen tcp: lookup localhost on 8.8.8.8:53: no such host
- it appears kind of randomly when I recreate the cluster;
- Tried setting IP autodetection method with interface, but only
enp0s8
got as far as not setting autodetection manually. The other options (enp0s3,enp0s8
orenp0s3
) I tried resulted in earlier errors.- Similar information here.
- Tried setting the localhost for the probes and it worked.
- Reference here.
- It seems more like a patch instead of a solution though.
- It seems there are various ways to configure Felix, as per their documentation. However, automating the addition of
FELIX_HEALTHHOST
seems clumsy with them. The easiest approach would be to usekubectl patch
with a patch file, or even easier withkubectl set env daemonset/calico-node FELIX_HEALTHHOST="127.0.0.1"
.- Finally (2023-03-19), the patch was implemented in Ansible and this seems to resolve the issue.
- This link helped the most. It turned out that the best way was to just copy the scrape configs Linkerd's Prometheus came with to a secret and have the "external" Prometheus just include it.
- It turned out this was intentional configuration of LVM. Check the documentation of
subiquity
. There are two solutions - either make the configuration explicit using thesizing-policy
key, or provision more disk space adhering to Subiquity's rules.
- Had to use
break_system_packages
for pip to be able to installkubernetes
system-wide. Not recommended. - Had to use the new Kubernetes repositories; documentation. This now results in the need to update the Kubernetes version in the keys URL, when new Kubernetes version is to be installed!!!
10-kubeadm.conf
has moved to/usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
. Previously it was in/etc/...
, and apparently it is possible to put there overrides. Also needed to edit the file in/etc/default/kubelet
, adding the argument. The latter thing did work.