Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

approx_percentile: Does not enforce constant percentage between input rows within an aggregation #24592

Open
yuandagits opened this issue Feb 18, 2025 · 5 comments
Labels

Comments

@yuandagits
Copy link
Contributor

yuandagits commented Feb 18, 2025

In Meta production environment, we noticed failures in the aggregate function approx_percentile with Presto-cpp and not Presto-Java. Specifically, certain queries in Presto-cpp throws an exception like VeloxUserError: base->equalValueAt(base, baseRow, baseFirstRow) Percentile argument must be constant for all input rows.

According to the Presto documentation, this is actually the desired behavior.

However, in Presto-Java, it seems like the last percentile value amongst the rows is respected and all of the previous percentiles are ignored. This goes against the intended behavior described by the documentation.

Presto-Java Behavior

SELECT 
  APPROX_PERCENTILE(score, pct) 
FROM 
  (
    SELECT 
      * 
    FROM 
      (
        VALUES 
          (0.01, 0.1), 
          (0.02, 0.5), 
          (0.03, 0.9)
      ) AS t(score, pct)
  )

 _col0 
-------
  0.03 
(1 row)


SELECT 
  APPROX_PERCENTILE(score, pct) 
FROM 
  (
    SELECT 
      * 
    FROM 
      (
        VALUES 
          (0.01, 0.9), 
          (0.02, 0.5), 
          (0.03, 0.1)
      ) AS t(score, pct)
  )
 _col0 
-------
  0.01 
(1 row)

SELECT 
  APPROX_PERCENTILE(score, pct) 
FROM 
  (
    SELECT 
      * 
    FROM 
      (
        VALUES 
          (0.01, 0.9), 
          (0.02, 0.1), 
          (0.03, 0.5)
      ) AS t(score, pct)
  )
 _col0 
-------
0.02

Presto-C++ Behavior

SELECT 
  APPROX_PERCENTILE(score, pct) 
FROM 
  (
    SELECT 
      * 
    FROM 
      (
        VALUES 
          (0.01, 0.1), 
          (0.02, 0.5), 
          (0.03, 0.9)
      ) AS t(score, pct)
  )

Query 20250218_202100_07481_tj3cr failed: base->equalValueAt(base, baseRow, baseFirstRow) Percentile argument must be constant for all input rows: 0.5 vs. 0.1

The behavior in Java actually seems unintentional and may produce results that the user is not knowingly expecting. We are raising this issue to the community to solicit ideas for how we can move forward.

Thanks!

cc @spershin @rschlussel @kgpai @amitkdutta @Yuhta @kagamiori

@kaikalur
Copy link
Contributor

Java behavior is correct and we have seen people use non-const second arg.

@rschlussel
Copy link
Contributor

Java behavior is correct and we have seen people use non-const second arg.

It's weird behavior to just use whichever value happens to come last for the whole aggregation. Even allowing different values only for different groups is a bit surprising, that each group is effectively running a different aggregation. IMO throwing an exception is the better thing to do.

@kaikalur
Copy link
Contributor

Java behavior is correct and we have seen people use non-const second arg.

It's weird behavior to just use whichever value happens to come last for the whole aggregation. Even allowing different values only for different groups is a bit surprising, that each group is effectively running a different aggregation. IMO throwing an exception is the better thing to do.

Well we can error out if you get different values for different rows but it doesn't have to be constant. I guess that's what the title is saying?

@rschlussel
Copy link
Contributor

Ah okay, so i think we agree. It doesn't have to be a literal value, but it should be the same value for all rows.

@yuandagits
Copy link
Contributor Author

@kaikalur Apologies for being unclear, I've updated with more examples and the title to be (hopefully more clear)

@yuandagits yuandagits changed the title approx_percentile: Does not enforce constant percentage between all input rows approx_percentile: Does not enforce constant percentage between input rows within an aggregation Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🆕 Unprioritized
Development

No branches or pull requests

3 participants