Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap stats is not shown as part of the metrics/resource endpoint #3834

Open
iholder101 opened this issue Dec 30, 2024 · 11 comments
Open

Swap stats is not shown as part of the metrics/resource endpoint #3834

iholder101 opened this issue Dec 30, 2024 · 11 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.

Comments

@iholder101
Copy link

What happened:

The following:

> kubectl get --raw "/api/v1/nodes/k8s-dev-worker/proxy/metrics/resource"
# HELP container_cpu_usage_seconds_total [STABLE] Cumulative cpu time consumed by the container in core-seconds
# TYPE container_cpu_usage_seconds_total counter
container_cpu_usage_seconds_total{container="kindnet-cni",namespace="kube-system",pod="kindnet-ndczz"} 1.121325 1735032838055
container_cpu_usage_seconds_total{container="kube-proxy",namespace="kube-system",pod="kube-proxy-l5jhs"} 1.100665 1735032838936
container_cpu_usage_seconds_total{container="metrics-server",namespace="kube-system",pod="metrics-server-8598789fdb-nw6cq"} 7.333964 1735032837430
# HELP container_memory_working_set_bytes [STABLE] Current working set of the container in bytes
# TYPE container_memory_working_set_bytes gauge
container_memory_working_set_bytes{container="kindnet-cni",namespace="kube-system",pod="kindnet-ndczz"} 3.2923648e+07 1735032838055
container_memory_working_set_bytes{container="kube-proxy",namespace="kube-system",pod="kube-proxy-l5jhs"} 4.0628224e+07 1735032838936
container_memory_working_set_bytes{container="metrics-server",namespace="kube-system",pod="metrics-server-8598789fdb-nw6cq"} 4.7026176e+07 1735032837430
# HELP container_start_time_seconds [STABLE] Start time of the container since unix epoch in seconds
# TYPE container_start_time_seconds gauge
container_start_time_seconds{container="kindnet-cni",namespace="kube-system",pod="kindnet-ndczz"} 1.7350309825441425e+09
container_start_time_seconds{container="kube-proxy",namespace="kube-system",pod="kube-proxy-l5jhs"} 1.7350309819809804e+09
container_start_time_seconds{container="metrics-server",namespace="kube-system",pod="metrics-server-8598789fdb-nw6cq"} 1.7350309993126562e+09
# HELP node_cpu_usage_seconds_total [STABLE] Cumulative cpu time consumed by the node in core-seconds
# TYPE node_cpu_usage_seconds_total counter
node_cpu_usage_seconds_total 71.41304 1735032832343
# HELP node_memory_working_set_bytes [STABLE] Current working set of the node in bytes
# TYPE node_memory_working_set_bytes gauge
node_memory_working_set_bytes 2.134016e+08 1735032832343
# HELP pod_cpu_usage_seconds_total [STABLE] Cumulative cpu time consumed by the pod in core-seconds
# TYPE pod_cpu_usage_seconds_total counter
pod_cpu_usage_seconds_total{namespace="kube-system",pod="kindnet-ndczz"} 1.145182 1735032830497
pod_cpu_usage_seconds_total{namespace="kube-system",pod="kube-proxy-l5jhs"} 1.108676 1735032837395
pod_cpu_usage_seconds_total{namespace="kube-system",pod="metrics-server-8598789fdb-nw6cq"} 7.336168 1735032831254
# HELP pod_memory_working_set_bytes [STABLE] Current working set of the pod in bytes
# TYPE pod_memory_working_set_bytes gauge
pod_memory_working_set_bytes{namespace="kube-system",pod="kindnet-ndczz"} 3.3222656e+07 1735032830497
pod_memory_working_set_bytes{namespace="kube-system",pod="kube-proxy-l5jhs"} 4.0914944e+07 1735032837395
pod_memory_working_set_bytes{namespace="kube-system",pod="metrics-server-8598789fdb-nw6cq"} 4.732928e+07 1735032831254
# HELP resource_scrape_error [STABLE] 1 if there was an error while getting container metrics, 0 otherwise
# TYPE resource_scrape_error gauge
resource_scrape_error 0

As can be seen, swap stats is not shown here:

> kubectl get --raw "/api/v1/nodes/k8s-dev-worker/proxy/metrics/resource" | grep -i swap
> 

What you expected to happen:
Swap to be included in metrics/resource endpoint stats.

How to reproduce it (as minimally and precisely as possible):

  1. Bring up a cluster
  2. Install metrics server
  3. Run kubectl get --raw "/api/v1/nodes/<NODE-NAME>/proxy/metrics/resource" | grep -i swap

Anything else we need to know?:
Swap stats were introduced in this PR: kubernetes/kubernetes#118865.
It also shows the expected output.

Environment:

  • kind version: (use kind version): > kind v0.22.0 go1.21.7 linux/amd64
  • Runtime info: (use docker info, podman info or nerdctl info): docker 27.2.1
  • OS (e.g. from /etc/os-release): Fedora 39
  • Kubernetes version: (use kubectl version): 1.32 (main)
  • Any proxies or other special environment settings?: no
@iholder101 iholder101 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 30, 2024
@BenTheElder
Copy link
Member

FWIW you're on an oudated release, but I would not be surprised that some kernel info like this is not working properly, the "node containers" are a bit leaky, and I don't think SIG node officially supports this environment.

What's your use case? This will probably take a bit of debugging ...

@iholder101
Copy link
Author

FWIW you're on an oudated release, but I would not be surprised that some kernel info like this is not working properly, the "node containers" are a bit leaky, and I don't think SIG node officially supports this environment.

I can try to test it on a current release if it would be valuable

What's your use case? This will probably take a bit of debugging ...

Just a development environment, was trying to work on swap metrics.
This is not urgent to me by any means.

@BenTheElder
Copy link
Member

BenTheElder commented Jan 24, 2025

I suspect we'll see the same thing but ... worth a shot.

Makes sense, sorry, there hasn't been a ton of demand for metrics overall and they're not part of conformance, we have some known issues around e.g. CPU and memory reflecting the underlying host (Which is then repeated for each cluster/node), it's messy and ideally we'd require more cooperation from kubelet and/or cadvisor to mitigate.

Maybe kubelet has relevant logs?

FWIW swap support is a recent thing in Kubernetes, historically kubernetes has recommended disabling swap, and kubelet even had a hard requirement for it by default (it was possible to opt-out with a warning log instead).

EDIT: of course @iholder101 is working on the swap support, "development environment" is ambiguous for kind, though the kubernetes project itself is our first priority.

For some SIG node work, you might have better luck with hack/local-up-cluster.sh in the main Kubernetes repo. It will turn the host into a single node cluster.

@iholder101
Copy link
Author

Thanks for the answer @BenTheElder!

Makes sense, sorry, there hasn't been a ton of demand for metrics overall and they're not part of conformance, we have some known issues around e.g. CPU and memory reflecting the underlying host (Which is then repeated for each cluster/node), it's messy and ideally we'd require more cooperation from kubelet and/or cadvisor to mitigate.

Thanks for the detailed explanation!

Maybe kubelet has relevant logs?

Yeah I can use some logs, but when working on adding a swap metric for example I have to actually test that it's there and working as expected. As of now, I don't see a way of doing so without actually taking some nodes and creating a cluster out of them with something like kubeadm, which is a tiresome process.

EDIT: of course @iholder101 is working on the swap support, "development environment" is ambiguous for kind, though the kubernetes project itself is our first priority.

For some SIG node work, you might have better luck with hack/local-up-cluster.sh in the main Kubernetes repo. It will turn the host into a single node cluster.

As @dims have expressed here, it seems that the local-up cluster is in "maintenance mode" and is not really being developed anymore. He claims that kind became the de-facto development platform for SIG node.

In any case, the local-up cluster also seems to not support calling /metrics/resource for some reason :(

> k get node
NAME        STATUS   ROLES    AGE   VERSION
localhost   Ready    <none>   39s   v0.0.0-master+$Format:%H$

> k get --raw "/api/v1/nodes/localhost/proxy/metrics/resource"
Error from server (BadRequest): address not allowed

@aojea
Copy link
Contributor

aojea commented Jan 29, 2025

I think you can "emulat" local-up by running kind with. single node using host-network

@dims
Copy link
Member

dims commented Jan 29, 2025

For the record, here's what i said @iholder101

Image

@iholder101
Copy link
Author

For the record, here's what i said @iholder101

Image

Right, thanks for bringing the source. For the record, you've also said:
"in general this script is as-is. if it works it works if it doesn't it doesn't. so you will need to dig into it.".

If so, is it true to assume that local-up cluster is now in a "maintanance mode" and that kind is the now de-facto the development environment for SIG node? Sorry if I didn't understood it correctly.

@BenTheElder
Copy link
Member

Thanks, I wasn't aware, I don't think SIG Node has a stance as a SIG, officially, but previously maintainers had indicated a preference for running kubelet directly on a host versus in kind, which is understandable (consider e.g. #1422 ...), I haven't been fully clear if kind is even considered supported for node versus

For a lot of node development node_e2e tests are commonly used against a single target host over SSH (there's a script for this with GCE, and I think dims worked out something with EC2?, but I don't know if there's e.g. a limactl approach yet),, but I can't speak for the SIG.

I would be happy to have this working correctly, and I'd consider any proposed bug fixes, but I don't personally have much spare time at the moment :(.

I think you can "emulat" local-up by running kind with. single node using host-network

host network won't work and isn't currently supported, I also think it would be more confusing than just using single node since other aspects would still be containerized.


If you or someone else can dig more into what's happening here, we can consider options to patch.
/help

@k8s-ci-robot
Copy link
Contributor

@BenTheElder:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

Thanks, I wasn't aware, I don't think SIG Node has a stance as a SIG, officially, but previously maintainers had indicated a preference for running kubelet directly on a host versus in kind, which is understandable (consider e.g. #1422 ...), I haven't been fully clear if kind is even considered supported for node versus

For a lot of node development node_e2e tests are commonly used against a single target host over SSH (there's a script for this with GCE, and I think dims worked out something with EC2?, but I don't know if there's e.g. a limactl approach yet),, but I can't speak for the SIG.

I would be happy to have this working correctly, and I'd consider any proposed bug fixes, but I don't personally have much spare time at the moment :(.

I think you can "emulat" local-up by running kind with. single node using host-network

host network won't work and isn't currently supported, I also think it would be more confusing than just using single node since other aspects would still be containerized.


If you or someone else can dig more into what's happening here, we can consider options to patch.
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jan 29, 2025
@iholder101
Copy link
Author

I would be happy to have this working correctly, and I'd consider any proposed bug fixes, but I don't personally have much spare time at the moment :(.

Ah, sorry to hear that @BenTheElder! Wish you the best and a fast recovery!

If you or someone else can dig more into what's happening here, we can consider options to patch.

I'll try to find the time, although TBH I'm pretty overloaded myself.
Generally, I've always thought the k8s space can benefit from a better development environment. In Kubevirt we work with something called KubevirtCI which is based of VM nodes that are being brought up with Vagrant. This makes a very flexible dev env since you can basically configure the node however you want. However it demands some modification to make it work properly with custom k8s code. Perhaps somewhere soon I'll find the time to do so.

@BenTheElder
Copy link
Member

Ah, sorry to hear that @BenTheElder! Wish you the best and a fast recovery!

Thank you! I'm up and down, it is still really limiting my throughput so I'm having to be more strategic about time use...

Generally, I've always thought the k8s space can benefit from a better development environment. In Kubevirt we work with something called KubevirtCI which is based of VM nodes that are being brought up with Vagrant. This makes a very flexible dev env since you can basically configure the node however you want. However it demands some modification to make it work properly with custom k8s code. Perhaps somewhere soon I'll find the time to do so.

FWIW I don't think Kubernetes would adopt vagrant today due to licensing concerns.
We actually recently switched some of kind's own testing away for this reason (we are using lima).

Kubernetes had a vagrant based solution when I first worked on the project, since then it doesn't ... I built kind as a replacement in part, but it admittedly has some trade-offs that may be unsuitable to some node work.

I think a limactl solution for node_e2e would be moderately popular.

For other parts of the project, it's hard to beat the speed of containers and there are less issues with them.
There would probably be some pushback on hosting another cluster tool (we have many, there are lots of external competing projects, and it's a lot to maintain), kind had some as-is, but at the time there were no container based options and we were originally scoped pretty tightly to testing Kubernetes (since minikube has adopted container bits from kind, but it still is aimed at external developers rather than k/k developers, meanwhile kind has accepted that there are other use cases though we explicitly prioritize kubernetes).

node_e2e doesn't have that problem though in that it's not end user facing and it doesn't even really involve clusters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants