Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong query result from aggregation Operators #703

Open
paulojmdias opened this issue Dec 13, 2024 · 3 comments · May be fixed by #704
Open

Wrong query result from aggregation Operators #703

paulojmdias opened this issue Dec 13, 2024 · 3 comments · May be fixed by #704
Labels

Comments

@paulojmdias
Copy link

paulojmdias commented Dec 13, 2024

I have the following setup with 3 server groups

promxy
  -> Grafana mimir dc1
      -> dc1
      -> google-us-dc1
      -> google-us-dc2
  -> Grafana mimir dc2
      -> dc2
      -> google-us-dc1
      -> google-us-dc2
  -> Grafana mimir dc3
      -> dc3
      -> google-eu-dc1
      -> google-eu-dc2

When we do the following query count(up{label_key="label_value"}) by (region) we have the following results:

{region="dc1"}                       342
{region="google-us-dc1"}    31
{region="google-us-dc2"}    31
{region="dc2"}                       341
{region="dc3"}                       30
{region="google-eu-dc1"}     25
{region="google-eu-dc2"}     24
{region="google-eu-dc3"}     29
{region="google-eu-dc4"}     36
{region="google-eu-dc5"}     29

If we remove the aggregator and do the query count(up{label_key="label_value"}) I expect to have the value 918, but the truth is promxy are returning the max value from the 3 server groups we have, which is 404 and in this case comes from the sum from the data which resides on Grafana mimir dc1

{region="dc1"}                       342
{region="google-us-dc1"}    31
{region="google-us-dc2"}    31

I also did a test, I added a dedicated label to each server group, named __dc__, and when we do the query count(count(up{stack="persistence"}) without (__dc__)), we have the desired value which is 918.
However, let's go and do the expected query count(up{stack="persistence"}). We will have the value 980 since they are counting the values from google-us-dc1 and google-us-dc2 twice because when we add custom labels per server group, we are saying the data on each server group is unique, which is not the case.

Although we are using Mimir, in the end, is a Prometheus query API that we are using, so I don't feel it is related.

We are not overriding the prefer_max option and we are using the version v0.0.91.

I already tried to debug in Promxy code, but I ran without ideas and I decided to open this issue. I'm open to contribute either way if I find something 🙌

@jacksontj
Copy link
Owner

Thanks for reaching out, lets jump into it!

I have the following setup with 3 server groups

I believe there may be a typo in this example; as described this configuration has some overlapping DCs (google-us-dc1 is in mimir dc1 and dc2). Given that the example below has eu-dc1..5 -- I'm assuming mimir dc2 was supposed to be eu? (since otherwise i don't see eu dc3,4,5).

but the truth is promxy are returning the max value from the 3 server groups we have,

This sounds like maybe the servergroup configuration isn't quite right -- as the NodeReplacer (that does the max/rewrite) is done at the top-level. All of the servergroup merging is done lower down. So this does sound like an issue with the servergroup configuration rather than the aggregation rewrite in NodeReplacer.

Although we are using Mimir, in the end, is a Prometheus query API that we are using, so I don't feel it is related.

This seems correct; this seems like an issue with the promxy servergroup config not quite matching your setup.

We are not overriding the prefer_max option and we are using the version v0.0.91.

If we are running into prefer_max we are definitely hitting a servergroup configuration issue. The prefer_max is intended to handle merging of data within a servergroup (defined as a set of API endpoints that "have the same data").

I ran without ideas and I decided to open this issue

I'd be happy to give a hand here! Could you provide your promxy config? Or at least the servergroup configuration. As well as re-iterating the downstreams, their data, and desired merging behavior. I think from there we'll be able to make some progress :)

@paulojmdias
Copy link
Author

@jacksontj here is the promxy configuration for us to kick off this analysis :D

promxy:
  server_groups:
  - http_client:
      dial_timeout: 1s
      tls_config:
        insecure_skip_verify: true
    http_headers:
      X-Scope-OrgID: dc1|google-us-dc1|google-us-dc2
    path_prefix: /prometheus
    remote_read: false
    scheme: https
    static_configs:
    - targets:
      - mimir.dc1.local:8080
  - http_client:
      dial_timeout: 1s
      tls_config:
        insecure_skip_verify: true
    http_headers:
      X-Scope-OrgID: dc2|google-us-dc1|google-us-dc2
    path_prefix: /prometheus
    remote_read: false
    scheme: https
    static_configs:
    - targets:
      - mimir.dc2.local:8080
  - http_client:
      dial_timeout: 1s
      tls_config:
        insecure_skip_verify: true
    http_headers:
      X-Scope-OrgID: dc3|google-eu-dc1|google-eu-dc2
    path_prefix: /prometheus
    remote_read: false
    scheme: https
    static_configs:
    - targets:
      - mimir.dc3.local:8080

@mut3
Copy link

mut3 commented Jan 14, 2025

👋 I work with @paulojmdias and though I have less experience with this stack, I think can answer some of the questions.

I have tried in my message to be clear about when I am referring to a server group and when I am referring to a datacenter (DC). In the example queries, we use region as a label for the individual datacenters.

I believe there may be a typo in this example; as described this configuration has some overlapping DCs (google-us-dc1 is in mimir dc1 and dc2). Given that the example below has eu-dc1..5 -- I'm assuming mimir dc2 was supposed to be eu? (since otherwise i don't see eu dc3,4,5).

Mimir dc3 server group is our EU. group, I believe eu-dc3/4/5 were omitted for simplicity in the original post, it could have been:

promxy
  ...
  -> Grafana mimir dc3
      -> dc3
      -> google-eu-dc1
      -> google-eu-dc2
      -> google-eu-dc3
      -> google-eu-dc4
      -> google-eu-dc5

(The same can be assumed for the servergroup config Paulo shared yesterday, dc3 server group X-Scope-OrgID headers include all google-eu DCs)

I'd be happy to give a hand here! Could you provide your promxy config? Or at least the servergroup configuration. As well as re-iterating the downstreams, their data, and desired merging behavior. I think from there we'll be able to make some progress :)

Mimir dc1 server group and mimir dc2 server group are both in NA and separately contain metrics from themselves. Our cloud metrics google-us-dc1/google-us-dc2 send to both for redundancy. We can assume that the series with region=~"google-us-dc[12]" in mimir dc1 server group match those in mimir dc2 server group during normal operation. For all other DCs, there is only one copy of their data between the three server groups.

Our expected behavior when making a query such as count(up{label_key="label_value"}) is that series from google-us-dc1 and google-us-dc2 are deduplicated and all other DCs are retained. See the original post for actual behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants