Ratelimit metrics are not working as expected in v1.3.0 #3444

pratiklotia · 2025-02-27T21:18:16Z

What happened?

We upgraded tetragon to v1.3.0.

The tetragon_export_ratelimit_events_dropped_total (which got renamed from tetragon_ratelimit_dropped_total in this release) for tracking number of rate-limited events does not seem to work as expected. The metric exists but results in 0 value.

Log output included below.

Fix: I haven't looked into the codebase yet but I suspect some part of code got missed while renaming the metric.

Impact: We don't have visibility into how well our rate-limit filters are working and that makes it hard to tune the policy correctly.

Tetragon Version

v1.3.0

Kernel Version

6.8.0-1016-aws

Kubernetes Version

Server version 1.30

Bugtool

No response

Relevant log output

tetragon_export_ratelimit_events_dropped_total{account="<snip>", cluster="<snip>", cluster_group="<snip>", endpoint="metrics", instance="<snip>", job="tetragon", namespace="<snip>", node="<snip>", pod="<snip>", prometheus="<snip>", provider="<snip>", service="tetragon", tetragon_pod="<snip>", vpc="<snip>", zone="<snip>"} -  0

Anything else?

No response

The text was updated successfully, but these errors were encountered:

mtardy · 2025-02-28T10:12:39Z

Hey, so this is the commit that did the rename 1848253 ("exporter: Rename ratelimit drops counter"), and here's the full PR that includes some other changes #2890 around that rename. This commit moves the place of incrementation but very slightly: 5231093 ("Move ratelimitmetrics inside pkg/exporter").

Here's an extract from the diff: it doesn't fundamentally change the code, instead of being called in the Drop() function itself, it's now called from the caller, after the drop occured.

diff --git a/pkg/exporter/exporter.go b/pkg/exporter/exporter.go
index bc8bf0729..2d21a0e70 100644
--- a/pkg/exporter/exporter.go
+++ b/pkg/exporter/exporter.go
@@ -56,6 +56,7 @@ func (e *Exporter) Start() error {
 func (e *Exporter) Send(event *tetragon.GetEventsResponse) error {
        if e.rateLimiter != nil && !e.rateLimiter.Allow() {
                e.rateLimiter.Drop()
+               rateLimitDropped.Inc()
                return nil
        }
diff --git a/pkg/ratelimit/ratelimit.go b/pkg/ratelimit/ratelimit.go
index 404395225..52c5512a4 100644
--- a/pkg/ratelimit/ratelimit.go
+++ b/pkg/ratelimit/ratelimit.go
@@ -78,5 +77,4 @@ func (r *RateLimiter) reportRateLimitInfo(encoder encoder.EventEncoder) {

 func (r *RateLimiter) Drop() {
        atomic.AddUint64(&r.dropped, 1)
-       ratelimitmetrics.RateLimitDropped.Inc()
 }

My questions would be: did this ever work for you (seems like yes since you mention an upgrade)? If yes, for which version/commit? And how do you test this so we can try to reproduce on our side? Thanks!

sp3nx0r · 2025-02-28T21:10:10Z

We were on 1.1.2 previously this week, and turns out this metric was also 0 at that point as well. Seems as if the metric Inc() doesn't actually get called by the agent when a ratelimit is triggered. We do see very nice spikes when the ratelimit kicks in, but tetragon_export_ratelimit_events_dropped_total is always 0 from our scrapes.

pratiklotia · 2025-03-01T21:15:49Z

IIRC, I don't think we were using ratelimit filters while on 1.1.2 so hard to say if the metric worked or not.

mtardy · 2025-03-03T09:38:10Z

Hey @kevsecurity did you observe this metric to work?

kevsecurity · 2025-03-03T11:18:44Z

Hey @kevsecurity did you observe this metric to work?

Not looked at it! Maybe @lambdanis knows about it?

lambdanis · 2025-03-04T18:52:29Z

@pratiklotia @sp3nx0r Did you observe rate limiting itself working as expected? I'm not sure if this metric worked correctly before, I'll check.

pratiklotia · 2025-03-05T15:09:56Z

@lambdanis When we add ratelimit actions to our TracingPolicies, we see the number of exported events go down. It goes further down when the ratelimit window is wider (say 1m instead of 10s). That indicates that ratelimit is working as expected.

sp3nx0r · 2025-03-05T16:59:09Z

Event graphs sawtooth with the ratelimiting so that's definitely working. But the ratelimiting metric remains 0 (it's been 0 "forever" it seems based on our metric retention)

lambdanis · 2025-03-06T10:18:20Z

When we add ratelimit actions to our TracingPolicies, we see the number of exported events go down.

I see, the gotcha here is that Tetragon provides two different rate limiting mechanisms. You're using rate limits embedded in TracingPolicy, but this is not what tetragon_export_ratelimit_events_dropped_total metric measures. The metric counts events rate limited on export - this is configured with export-rate-limit flag or tetragon.exportRateLimit Helm value. That's why the metric got renamed actually - to make it tied to export, not policy.

There is no metric reporting events rate limited by a policy at the moment I'm afraid. It definitely would be useful, so feel free to open an issue to add it, or a PR.

pratiklotia added the kind/bug Something isn't working label Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ratelimit metrics are not working as expected in v1.3.0 #3444

Ratelimit metrics are not working as expected in v1.3.0 #3444

pratiklotia commented Feb 27, 2025

mtardy commented Feb 28, 2025 •

edited

Loading

sp3nx0r commented Feb 28, 2025

pratiklotia commented Mar 1, 2025

mtardy commented Mar 3, 2025

kevsecurity commented Mar 3, 2025

lambdanis commented Mar 4, 2025

pratiklotia commented Mar 5, 2025

sp3nx0r commented Mar 5, 2025

lambdanis commented Mar 6, 2025

Ratelimit metrics are not working as expected in v1.3.0 #3444

Ratelimit metrics are not working as expected in v1.3.0 #3444

Comments

pratiklotia commented Feb 27, 2025

What happened?

Tetragon Version

Kernel Version

Kubernetes Version

Bugtool

Relevant log output

Anything else?

mtardy commented Feb 28, 2025 • edited Loading

sp3nx0r commented Feb 28, 2025

pratiklotia commented Mar 1, 2025

mtardy commented Mar 3, 2025

kevsecurity commented Mar 3, 2025

lambdanis commented Mar 4, 2025

pratiklotia commented Mar 5, 2025

sp3nx0r commented Mar 5, 2025

lambdanis commented Mar 6, 2025

mtardy commented Feb 28, 2025 •

edited

Loading