Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ratelimit metrics are not working as expected in v1.3.0 #3444

Open
pratiklotia opened this issue Feb 27, 2025 · 9 comments
Open

Ratelimit metrics are not working as expected in v1.3.0 #3444

pratiklotia opened this issue Feb 27, 2025 · 9 comments
Labels
kind/bug Something isn't working

Comments

@pratiklotia
Copy link

What happened?

We upgraded tetragon to v1.3.0.

The tetragon_export_ratelimit_events_dropped_total (which got renamed from tetragon_ratelimit_dropped_total in this release) for tracking number of rate-limited events does not seem to work as expected. The metric exists but results in 0 value.

Log output included below.

Fix: I haven't looked into the codebase yet but I suspect some part of code got missed while renaming the metric.

Impact: We don't have visibility into how well our rate-limit filters are working and that makes it hard to tune the policy correctly.

Tetragon Version

v1.3.0

Kernel Version

6.8.0-1016-aws

Kubernetes Version

Server version 1.30

Bugtool

No response

Relevant log output

tetragon_export_ratelimit_events_dropped_total{account="<snip>", cluster="<snip>", cluster_group="<snip>", endpoint="metrics", instance="<snip>", job="tetragon", namespace="<snip>", node="<snip>", pod="<snip>", prometheus="<snip>", provider="<snip>", service="tetragon", tetragon_pod="<snip>", vpc="<snip>", zone="<snip>"} -  0

Anything else?

No response

@pratiklotia pratiklotia added the kind/bug Something isn't working label Feb 27, 2025
@mtardy
Copy link
Member

mtardy commented Feb 28, 2025

Hey, so this is the commit that did the rename 1848253 ("exporter: Rename ratelimit drops counter"), and here's the full PR that includes some other changes #2890 around that rename. This commit moves the place of incrementation but very slightly: 5231093 ("Move ratelimitmetrics inside pkg/exporter").

Here's an extract from the diff: it doesn't fundamentally change the code, instead of being called in the Drop() function itself, it's now called from the caller, after the drop occured.

diff --git a/pkg/exporter/exporter.go b/pkg/exporter/exporter.go
index bc8bf0729..2d21a0e70 100644
--- a/pkg/exporter/exporter.go
+++ b/pkg/exporter/exporter.go
@@ -56,6 +56,7 @@ func (e *Exporter) Start() error {
 func (e *Exporter) Send(event *tetragon.GetEventsResponse) error {
        if e.rateLimiter != nil && !e.rateLimiter.Allow() {
                e.rateLimiter.Drop()
+               rateLimitDropped.Inc()
                return nil
        }
diff --git a/pkg/ratelimit/ratelimit.go b/pkg/ratelimit/ratelimit.go
index 404395225..52c5512a4 100644
--- a/pkg/ratelimit/ratelimit.go
+++ b/pkg/ratelimit/ratelimit.go
@@ -78,5 +77,4 @@ func (r *RateLimiter) reportRateLimitInfo(encoder encoder.EventEncoder) {

 func (r *RateLimiter) Drop() {
        atomic.AddUint64(&r.dropped, 1)
-       ratelimitmetrics.RateLimitDropped.Inc()
 }

My questions would be: did this ever work for you (seems like yes since you mention an upgrade)? If yes, for which version/commit? And how do you test this so we can try to reproduce on our side? Thanks!

@sp3nx0r
Copy link

sp3nx0r commented Feb 28, 2025

We were on 1.1.2 previously this week, and turns out this metric was also 0 at that point as well. Seems as if the metric Inc() doesn't actually get called by the agent when a ratelimit is triggered. We do see very nice spikes when the ratelimit kicks in, but tetragon_export_ratelimit_events_dropped_total is always 0 from our scrapes.

Image

@pratiklotia
Copy link
Author

IIRC, I don't think we were using ratelimit filters while on 1.1.2 so hard to say if the metric worked or not.

@mtardy
Copy link
Member

mtardy commented Mar 3, 2025

Hey @kevsecurity did you observe this metric to work?

@kevsecurity
Copy link
Contributor

Hey @kevsecurity did you observe this metric to work?

Not looked at it! Maybe @lambdanis knows about it?

@lambdanis
Copy link
Contributor

@pratiklotia @sp3nx0r Did you observe rate limiting itself working as expected? I'm not sure if this metric worked correctly before, I'll check.

@pratiklotia
Copy link
Author

@lambdanis When we add ratelimit actions to our TracingPolicies, we see the number of exported events go down. It goes further down when the ratelimit window is wider (say 1m instead of 10s). That indicates that ratelimit is working as expected.

@sp3nx0r
Copy link

sp3nx0r commented Mar 5, 2025

Event graphs sawtooth with the ratelimiting so that's definitely working. But the ratelimiting metric remains 0 (it's been 0 "forever" it seems based on our metric retention)

Image

Image

@lambdanis
Copy link
Contributor

When we add ratelimit actions to our TracingPolicies, we see the number of exported events go down.

I see, the gotcha here is that Tetragon provides two different rate limiting mechanisms. You're using rate limits embedded in TracingPolicy, but this is not what tetragon_export_ratelimit_events_dropped_total metric measures. The metric counts events rate limited on export - this is configured with export-rate-limit flag or tetragon.exportRateLimit Helm value. That's why the metric got renamed actually - to make it tied to export, not policy.

There is no metric reporting events rate limited by a policy at the moment I'm afraid. It definitely would be useful, so feel free to open an issue to add it, or a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants