-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong query result from aggregation Operators #703
Comments
Thanks for reaching out, lets jump into it!
I believe there may be a typo in this example; as described this configuration has some overlapping DCs (
This sounds like maybe the servergroup configuration isn't quite right -- as the NodeReplacer (that does the max/rewrite) is done at the top-level. All of the servergroup merging is done lower down. So this does sound like an issue with the servergroup configuration rather than the aggregation rewrite in NodeReplacer.
This seems correct; this seems like an issue with the promxy servergroup config not quite matching your setup.
If we are running into
I'd be happy to give a hand here! Could you provide your promxy config? Or at least the servergroup configuration. As well as re-iterating the downstreams, their data, and desired merging behavior. I think from there we'll be able to make some progress :) |
@jacksontj here is the promxy configuration for us to kick off this analysis :D promxy:
server_groups:
- http_client:
dial_timeout: 1s
tls_config:
insecure_skip_verify: true
http_headers:
X-Scope-OrgID: dc1|google-us-dc1|google-us-dc2
path_prefix: /prometheus
remote_read: false
scheme: https
static_configs:
- targets:
- mimir.dc1.local:8080
- http_client:
dial_timeout: 1s
tls_config:
insecure_skip_verify: true
http_headers:
X-Scope-OrgID: dc2|google-us-dc1|google-us-dc2
path_prefix: /prometheus
remote_read: false
scheme: https
static_configs:
- targets:
- mimir.dc2.local:8080
- http_client:
dial_timeout: 1s
tls_config:
insecure_skip_verify: true
http_headers:
X-Scope-OrgID: dc3|google-eu-dc1|google-eu-dc2
path_prefix: /prometheus
remote_read: false
scheme: https
static_configs:
- targets:
- mimir.dc3.local:8080 |
👋 I work with @paulojmdias and though I have less experience with this stack, I think can answer some of the questions. I have tried in my message to be clear about when I am referring to a server group and when I am referring to a datacenter (DC). In the example queries, we use
Mimir dc3 server group is our EU. group, I believe eu-dc3/4/5 were omitted for simplicity in the original post, it could have been:
(The same can be assumed for the servergroup config Paulo shared yesterday, dc3 server group
Mimir dc1 server group and mimir dc2 server group are both in NA and separately contain metrics from themselves. Our cloud metrics Our expected behavior when making a query such as |
I have the following setup with 3 server groups
When we do the following query
count(up{label_key="label_value"}) by (region)
we have the following results:If we remove the aggregator and do the query
count(up{label_key="label_value"})
I expect to have the value918
, but the truth is promxy are returning the max value from the 3 server groups we have, which is404
and in this case comes from the sum from the data which resides onGrafana mimir dc1
I also did a test, I added a dedicated label to each server group, named
__dc__
, and when we do the querycount(count(up{stack="persistence"}) without (__dc__))
, we have the desired value which is918
.However, let's go and do the expected query
count(up{stack="persistence"})
. We will have the value980
since they are counting the values fromgoogle-us-dc1
andgoogle-us-dc2
twice because when we add custom labels per server group, we are saying the data on each server group is unique, which is not the case.Although we are using Mimir, in the end, is a Prometheus query API that we are using, so I don't feel it is related.
We are not overriding the
prefer_max
option and we are using the versionv0.0.91
.I already tried to debug in Promxy code, but I ran without ideas and I decided to open this issue. I'm open to contribute either way if I find something 🙌
The text was updated successfully, but these errors were encountered: