enforce max series for metrics queries #4525

ie-pham · 2025-01-07T23:58:18Z

What this PR does: Add config to enforce max time series returned in a metrics query. This is enforced at 4 levels: front-end combiner, querier combiner, metrics-generator local blocks, and metrics evaluation. The configuration is set in the query-frontend config and is passed to all levels as maxSeries in the QueryRangeRequest proto.

new config: max_response_series <default 1000>
Setting the value to 0 will disable this feature.

approach : Keep track of number of series for every observe and observeSeries call and exit as soon as the limit is reached. Whatever series were generated up to this point will be truncated at the limit and returned as partial results. This may mean that partial results are not useful as each series could potentially contain just one data point.

Which issue(s) this PR fixes:
Fixes #4219

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pkg/api/http.go

pkg/tempopb/tempo.proto

CHANGELOG.md

docs/sources/tempo/configuration/_index.md

modules/frontend/combiner/metrics_query_range.go

pkg/api/http.go

pkg/tempopb/tempo.proto

knylander-grafana

Thank you for adding docs.

modules/frontend/combiner/metrics_query_range.go

joe-elliott · 2025-02-12T13:03:33Z

docs/sources/tempo/configuration/_index.md

+
+    metrics:
+        # Maximum number of time series returned for a metrics query.
+        [max_response_series: <int> | default = 1000]


this is an interesting choice. normally we would communicate the max series through a query param from the frontend to the queriers. the negative of your approach is that we have to make sure that 2 settings are aligned or tempo may appear subtly broken. the advantage is that we don't repeatedly marshal something like series=1000 once for every subquery.

can you bring this up with the team and see if we have consensus either way?

this does make me wonder if we should have a shared section of the config for querying like we do for storage. that feels like overkill for one setting tho.

switching to passing the config in the request proto since it needs to get passed all the way to the metrics evaluator

should this be removed from the docs then?

integration/e2e/config-query-range-max-series.yaml

integration/e2e/query_range_test.go

CHANGELOG.md

modules/frontend/combiner/metrics_query_range.go

modules/frontend/metrics_query_handler.go

modules/querier/querier_query_range.go

pkg/traceql/combine.go

…n all data points

javiermolinar

How does the metric query look now? Are the series evenly distributed? I'm asking because we have an issue with exemplars where we enforce a similar limit, and they appear to be skewed

joe-elliott · 2025-03-19T18:15:37Z

docs/sources/tempo/configuration/_index.md

+
+    metrics:
+        # Maximum number of time series returned for a metrics query.
+        [max_response_series: <int> | default = 1000]


should this be removed from the docs then?

joe-elliott · 2025-03-19T18:25:32Z

modules/frontend/combiner/metrics_query_range.go

@@ -62,9 +72,22 @@ func NewQueryRange(req *tempopb.QueryRangeRequest) (Combiner, error) {
 			diff := diffResponse(prevResp, resp)
 			// store resp for next diff
 			prevResp = resp
+			prevTotalSeries := totalSeries
+			totalSeries += len(diff.Series)


i don't think this logic is correct for streaming. this assumes that all of diff.Series is new. i think you should just check resp.Series directly as it represents the complete response at this time.

also, it seems odd that in quit we use combiner.MaxSeriesReached() but we use a different calculation in diff and finalize. can we not use combiner.MaxSeriesReached() in all cases?

joe-elliott · 2025-03-19T18:54:13Z

modules/frontend/metrics_query_range_handler.go

@@ -34,6 +34,7 @@ func newQueryRangeStreamingGRPCHandler(cfg Config, next pipeline.AsyncRoundTripp
 		if req.Step == 0 {
 			req.Step = traceql.DefaultQueryRangeStep(req.Start, req.End)
 		}
+		req.MaxSeries = uint32(cfg.Metrics.Sharder.MaxResponseSeries)


i'm a little weirded out by just overwriting this value. should we 400 if this is set to non-zero b/c we don't accept it being set externally? do we have any other settings that use this pattern we can pull from?

perhaps we should treat it the same way we do limits? if it's specified in the request (and not larger than the max) honor it else use the max?

this makes sense

joe-elliott · 2025-03-19T20:11:26Z

modules/frontend/metrics_query_range_sharder.go

@@ -271,6 +273,7 @@ func (s *queryRangeSharder) buildBackendRequests(ctx context.Context, tenantID s
 					FooterSize:    m.FooterSize,
 					// DedicatedColumns: dc, for perf reason we pass dedicated columns json in directly to not have to realloc object -> proto -> json
 					Exemplars: exemplars,
+					MaxSeries: uint32(s.cfg.MaxResponseSeries),


this method is cloning the upstream request. should we do that here as well instead of copying the value from the config

joe-elliott · 2025-03-19T20:15:04Z

modules/frontend/metrics_query_range_sharder.go

@@ -115,6 +116,7 @@ func (s queryRangeSharder) RoundTrip(pipelineRequest pipeline.Request) (pipeline
 		}
 	}
 	req.Exemplars = maxExemplars
+	req.MaxSeries = uint32(s.cfg.MaxResponseSeries)


i find the code that is setting the value on the request confusing. we are doing it in the handler, again here, and in the backend requests below. i'm finding this confusing.

joe-elliott · 2025-03-19T20:15:27Z

modules/generator/instance.go

@@ -540,16 +541,24 @@ func (i *instance) QueryRange(ctx context.Context, req *tempopb.QueryRangeReques
 	for _, p := range processors {
 		err = p.QueryRange(ctx, req, rawEval, jobEval)
 		if err != nil {
+			fmt.Printf("error in query range: %v\n", err)


remove?

bunch of similar test lines below

joe-elliott · 2025-03-19T20:29:10Z

modules/generator/processor/localblocks/query_range.go

+
+	// listens for series counts and updates the total series count.
+	go func() {
+		for count := range seriesCountCh {


can we find a simpler way to do this? there's a lot of moving parts here to just count the current series.

i'm also concerned about the cost of repeatedly calling .Results() on the series. does this require a lot of work? how does it compare for different functions like rates/quantiles/avgs/etc?

also, the metrics aggregated at this level are different than those in the frontend. for instance quantile_over_time will produce considerably more series here than at the frontend b/c it is a histogram here that is aggregated into quantiles in the frontend. how does this work into our solution?

also also, if series are truncated here do we pass that back in the response?

Correct me if I'm wrong but I thought the series from "jobEval.ObserveSeries" are already aggregated series. So at worst, we would have the truncated amount of series. But if the environment has more than one metrics-generator, that is additional series that would get aggregated again. I think this would be suffice but what do you think?

The .Results() is calling this https://github.com/grafana/tempo/blob/main/pkg/traceql/engine_metrics.go#L716

Since these are the raw not-yet aggregated results, it's a lot harder to count them. Since (I think) we avoided calling them for performance and calling it only once at the end https://github.com/grafana/tempo/blob/main/modules/generator/instance.go#L548 do you think maybe we should just not keep track of the series on the head and wal blocks?

Correct me if I'm wrong but I thought the series from "jobEval.ObserveSeries" are already aggregated series. So at worst, we would have the truncated amount of series.

It's dependent on the metric type and aggregation level. So when you aggregate quantiles at the lower levels you are actually aggregating histograms and passing those up to the frontend. A histogram has considerably more series than a quantile. Up to N times more where N is the number of buckets in the histogram

When you aggregate a simple rates at all levels it is the same number of series.

The .Results() is calling this https://github.com/grafana/tempo/blob/main/pkg/traceql/engine_metrics.go#L716

yeah, that looks potentially expensive. I did track down all the .Observe calls at the series level and the count they are returning seems a lot cheaper

joe-elliott · 2025-03-19T20:37:21Z

pkg/traceql/combine.go

@@ -326,6 +326,10 @@ type QueryRangeCombiner struct {
 	req     *tempopb.QueryRangeRequest
 	eval    *MetricsFrontendEvaluator
 	metrics *tempopb.SearchMetrics
+
+	maxSeries        int


you only need 2 of these 3, right?

why not remove maxSeriesReached and change the function below to

func (q *QueryRangeCombiner) MaxSeriesReached() bool { return q.seriesCount >= q.maxSeries }

joe-elliott · 2025-03-20T13:38:24Z

pkg/traceql/engine_metrics.go

@@ -313,15 +313,15 @@ type VectorAggregator interface {
 // RangeAggregator sorts spans into time slots
 // TODO - for efficiency we probably combine this with VectorAggregator (see todo about CountOverTimeAggregator)
 type RangeAggregator interface {
-	Observe(s Span)
+	Observe(s Span) int


at the span level the GroupingAggregator is stored directly on the MetricsAggregate object and is the only thing we store there:

a.agg = NewGroupingAggregator(a.op.String(), func() RangeAggregator { return NewStepAggregator(q.Start, q.End, q.Step, innerAgg) }, a.by, byFunc, byFuncLabel)

of the span level aggregators it also seems like we actually only need to count series in the GroupingAggregator. should we change a.agg to be a concrete type and only add the ability to count series to that object instead of polluting all these interfaces? and meaningless data (0s) from some Aggregators?

joe-elliott · 2025-03-20T13:58:24Z

modules/frontend/metrics_query_range_handler.go

@@ -34,6 +34,7 @@ func newQueryRangeStreamingGRPCHandler(cfg Config, next pipeline.AsyncRoundTripp
 		if req.Step == 0 {
 			req.Step = traceql.DefaultQueryRangeStep(req.Start, req.End)
 		}
+		req.MaxSeries = uint32(cfg.Metrics.Sharder.MaxResponseSeries)


perhaps we should treat it the same way we do limits? if it's specified in the request (and not larger than the max) honor it else use the max?

joe-elliott · 2025-03-20T14:05:34Z

pkg/traceql/engine_metrics_compare.go

@@ -167,6 +167,7 @@ func (m *MetricsCompare) observe(span Span) {
 		}
 		totals[i]++
 	})
+	return len(m.seriesAgg.Results())


can we get tests on these all these series aggregators observe's returning total series?

ie-pham mentioned this pull request Jan 8, 2025

Limit series produced by TraceQL Metrics #4219

Closed

ie-pham marked this pull request as ready for review January 8, 2025 17:54

ie-pham requested review from knylander-grafana, joe-elliott, mdisibio, mapno, yvrhdn, zalegrala, electron0zero, stoewer and javiermolinar as code owners January 8, 2025 17:54

electron0zero reviewed Jan 21, 2025

View reviewed changes

joe-elliott reviewed Jan 21, 2025

View reviewed changes

pkg/api/http.go Outdated Show resolved Hide resolved

pkg/tempopb/tempo.proto Show resolved Hide resolved

knylander-grafana reviewed Jan 27, 2025

View reviewed changes

electron0zero reviewed Feb 4, 2025

View reviewed changes

modules/frontend/combiner/metrics_query_range.go Outdated Show resolved Hide resolved

ie-pham force-pushed the maxmetricsseries branch from 604930d to 2c986ed Compare February 10, 2025 17:56

joe-elliott reviewed Feb 12, 2025

View reviewed changes

ie-pham added 10 commits March 17, 2025 13:05

enforce max series for metrics queries

cde2056

changelog

6d81aa9

manifest.md

b5128dd

bail as soon as max series is hit even if each series does not contai…

7aed9dd

…n all data points

added test and removed v2

c6e6cfc

add querier implementation

53ed7da

lint

7f41ee1

enforce limit at metrics evaluator

182dfdc

typo

b3056b4

rebase mishap

a073454

ie-pham force-pushed the maxmetricsseries branch from 4aa4ee3 to a073454 Compare March 17, 2025 18:19

ie-pham requested a review from carles-grafana as a code owner March 17, 2025 18:19

lint

67b3337

ie-pham added 5 commits March 18, 2025 07:49

lint

b2bae56

fix race maybe

41ee74d

i love racing

d0741c4

lint

a066f59

handle diff

ea2fd86

javiermolinar reviewed Mar 19, 2025

View reviewed changes

joe-elliott reviewed Mar 19, 2025

View reviewed changes

joe-elliott reviewed Mar 20, 2025

View reviewed changes

ie-pham added 2 commits March 21, 2025 13:19

address comments p1

21dc130

return seriesCount instead of using results()

ba3fb50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enforce max series for metrics queries #4525

enforce max series for metrics queries #4525

ie-pham commented Jan 7, 2025 •

edited

Loading

knylander-grafana left a comment

joe-elliott Feb 12, 2025

joe-elliott Feb 12, 2025

ie-pham Mar 18, 2025

joe-elliott Mar 19, 2025

javiermolinar left a comment •

edited

Loading

joe-elliott Mar 19, 2025

joe-elliott Mar 19, 2025

joe-elliott Mar 19, 2025

joe-elliott Mar 20, 2025

ie-pham Mar 24, 2025

joe-elliott Mar 19, 2025

joe-elliott Mar 19, 2025

joe-elliott Mar 19, 2025

joe-elliott Mar 19, 2025

ie-pham Mar 24, 2025

joe-elliott Mar 25, 2025

joe-elliott Mar 19, 2025

joe-elliott Mar 20, 2025

joe-elliott Mar 20, 2025

joe-elliott Mar 20, 2025

enforce max series for metrics queries #4525

Are you sure you want to change the base?

enforce max series for metrics queries #4525

Conversation

ie-pham commented Jan 7, 2025 • edited Loading

knylander-grafana left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

javiermolinar left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ie-pham commented Jan 7, 2025 •

edited

Loading

javiermolinar left a comment •

edited

Loading