Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior in json functions #24563

Open
bikramSingh91 opened this issue Feb 14, 2025 · 6 comments
Open

Inconsistent behavior in json functions #24563

bikramSingh91 opened this issue Feb 14, 2025 · 6 comments
Labels

Comments

@bikramSingh91
Copy link
Contributor

bikramSingh91 commented Feb 14, 2025

While working on implementing JSON functions in Velox for Presto-CPP, I came across some inconsistencies that I wanted to highlight and get the community's opinion on whether it makes sense for us to fix them, either both in Presto-Java and Presto-CPP or just in Presto-CPP.

1. json_extract Produces Non-Canonicalized JSON as Output

We expect the JSON type to be canonicalized whenever it exists in a query. Since Presto does not store JSON as a type that can be written to files, this canonicalization is usually done when a VARCHAR column is converted to JSON using one of the JSON functions (typically json_parse). This is important because it can affect the equality of two JSON objects and provide inconsistent results if sometimes the JSONs are canonicalized and sometimes they are not.

For example, suppose the input string is:

{ "key_2": 2, "key_3": 3, "key_1": 1 }

json_parse will canonicalize this to:

presto:default> select JSON_PARSE('{ "key_2": 2, "key_3": 3, "key_1": 1 }');
_col0
---------------------------------
{"key_1":1,"key_2":2,"key_3":3}

NOTE: canonicalization puts the keys in ascending order and removes spaces.

Now, if there are two inputs:

{ "key_2": 2, "key_3": 3, "key_1": 1 }
{ "key_1": 1, "key_2": 2, "key_3": 3 }

These should be considered equal JSONs when parsed and compared; otherwise, this can result in correctness issues. If we use json_extract, which returns a JSON object type, it would extract and return a non-canonicalized JSON:

presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_3": 3, "key_1": 1 }', '$');
_col0
---------------------------------
{"key_2":2,"key_3":3,"key_1":1}

NOTE: Spaces are removed, but keys remain untouched.

The same issue occurs in json_array_get:

presto:default> select json_array_get('[{ "key_2": 2, "key_1": 1 }]', 0);
_col0
-----------------------
{"key_2":2,"key_1":1}

Example of a possible correctness issue:
A group by on the json column returning wrong results

SELECT
    ("json_extract"((output_fields), '$.key')) c0
FROM (
    SELECT
        *
    FROM (
        VALUES
            ('{"key": {"inner_1":"1", "inner_2":"2"}'),
            ('{"key": {"inner_2":"2", "inner_1":"1"}}')
    ) AS t (output_fields)
)
GROUP BY
    1;

              c0               
-------------------------------
 {"inner_1":"1","inner_2":"2"} 
 {"inner_2":"2","inner_1":"1"} 

Recommendation: The expectation is that once a VARCHAR is converted to JSON, it should be in a canonicalized form. Therefore, we should always canonicalize the resultant JSON in all functions that take VARCHAR as prospective JSON input and output a JSON.

2. json_extract / _scalar, json_array_get Accepts Invalid JSON to Produce Valid Result

If the JSON is invalid, it will still produce valid output if and only if the target key/index specified in the path comes before the part where the JSON becomes invalid.

presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', '$.key_2');
_col0
-------
2
presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', '$.key_1');
_col0
-------
"z"

NOTE: "z"a1" is the invalid part. If you json_parse the above string, it will throw an error.

The same issue occurs with json_array_get:

presto:default> select json_array_get('[{ "key_2": 2, "key_1": 1 }, {"z"aa"}]', 0);
_col0
-----------------------
{"key_2":2,"key_1":1}

Why is this a problem?

  • This makes the behavior unpredictable in the face of invalid JSON.
  • Since json_extract does not canonicalize the input, this can return a valid result in one input and invalid in another, even if the two inputs are considered equal when canonicalized.
  • Moreover, json_extract seems to have inconsistency depending on the path. For example, Jayway treats $. as optional in the beginning, therefore $.key_2 should be the same as key_2. However, as we'll see, it returns different values when the JSON is invalid.
presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', '$.key_2');
_col0
-------
2
presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', 'key_2');
_col0
-------
NULL

NOTE: Path with $.key_2 returns a value, but key_2 returns NULL.

Recommendation: If we decide to move ahead with canonicalization of all functions that produce JSON from VARCHAR (as discussed in # 1), then we can catch invalid JSONs in that step and always return null.

3. json_array_get Returns Invalid JSON When Output is a String Scalar

Since we are already on this topic and talking about json_array_get, I thought this would be a good time to discuss this known problem in json_array_get (this is already highlighted in Presto docs, see [Presto Documentation](https://prestodb.io/docs/current/functions/json.html#json_array_get-json_array-index-json)).

SELECT json_array_get('["a", [3, 9], "c"]', 0); -- JSON 'a' < == (invalid JSON)

As we decide to add support for this in Presto-CPP (Velox), I was wondering if there are any objections to ensuring we support the correct behavior, that is, a string scalar is correctly enclosed in quotes:

SELECT json_array_get('["a", [3, 9], "c"]', 0); -- JSON '"a"' < == (valid JSON)

cc: @amitkdutta @spershin @kgpai @kevinwilfong @Yuhta @kaikalur @feilong-liu @rschlussel @tdcmeehan @aditi-pandit

@kgpai
Copy link
Contributor

kgpai commented Feb 14, 2025

@tdcmeehan This is a major correctness hole with json_extract right ?

json_extract:

 SELECT
    y,
    COUNT(1)
FROM (
    SELECT
        JSON_EXTRACT(x, '$') AS y
    FROM (
        VALUES
            ('{ "key_2": 2, "key_3": 3, "key_1": 1 }'),
            ('{"key_3": 3, "key_2": 2, "key_1": 1 }')
    ) AS t(x)
)
GROUP BY
    1

------- Returns

{"key_2":2,"key_3":3,"key_1":1} | 1
{"key_3":3,"key_2":2,"key_1":1} | 1

compared to json_parse:

SELECT
    y,
    COUNT(1)
FROM (
    SELECT
        JSON_PARSE(x) AS y
    FROM (
        VALUES
            ('{ "key_2": 2, "key_3": 3, "key_1": 1 }'),
            ('{"key_3": 3, "key_2": 2, "key_1": 1 }')
    ) AS t(x)
)
GROUP BY
    1

---- Returns


{"key_1":1,"key_2":2,"key_3":3} | 2

@spershin
Copy link
Contributor

Ideally, as everything eventually moves to Prestissimo and it will behave 'canonically', then I see no reason of not fixing Presto Java.
Depends how hard it is to achieve.

Can it be achieved by making a canonicalize() call before returning the result?
This would introduce a performance regression, which we could overcome by wrapping it into a config property, which we would only turn on for verification purposes and it can also be turned on by whoever wants to opt in into this behavior.

@kgpai
Copy link
Contributor

kgpai commented Feb 14, 2025

as everything eventually moves to Prestissimo and it will behave 'canonically', then I see no reason of not fixing Presto Java.

Hi Sergey, maybe I misunderstand, but you are saying if everything is moving to prestissimo, then we shouldnt have to fix Presto Java right ? (It reads the other way).

This would introduce a performance regression,

This should take back seat to correctness in my view, but agree with having a config property.

@rschlussel
Copy link
Contributor

I believe @spershin is suggesting that we shouldn't worry that we're introducing a behavior change in Java since we are moving to that behavior for c++ anyway.

  1. I'm not sure it's technically a correctness issue because json isn't guaranteed to be in any particular order. But it's at the very least an inconvenient and unexpected behavior, which is almost as bad. @spershin's suggestion here makes sense to me.
  2. This seems similar to json_parse parse's bad jsons #24090 and probably can be fixed the same way.
  3. Yes, I think it would be good to ensure the new implementation has correct behavior.

@rschlussel
Copy link
Contributor

for 1, it looks like json_parse uses this SORTED_MAPPER to write out the cannonicalized Json https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/JsonFunctions.java#L147. We would need to do something similar for the other json functions.

@aditi-pandit
Copy link
Contributor

Agree i) canonicalization and ii) error on invalid json are more consistent behavior.

It would be great to have a config in Presto Java to force the native engine behavior for verification purposes. We do have some customers using JSONs with Presto Java that we will migrate to Prestissimo at some point.

bikramSingh91 pushed a commit to bikramSingh91/velox that referenced this issue Feb 20, 2025
Summary:
We recently noticed as detailed in prestodb/presto#24563
that some json functions that take varchar as input and output a json type can skip
parsing the input. This can result in non-canonicalized result and inconsistent behavior.
This change ensures that json_extract and json_extract_scalar always parse the input
in these instances.

Differential Revision: D69611952
bikramSingh91 pushed a commit to bikramSingh91/velox that referenced this issue Feb 20, 2025
…acebookincubator#12409)

Summary:

We recently noticed as detailed in prestodb/presto#24563
that some json functions that take varchar as input and output a json type can skip
parsing the input. This can result in non-canonicalized result and inconsistent behavior.
This change ensures that json_extract and json_extract_scalar always parse the input
in these instances.

Differential Revision: D69611952
bikramSingh91 pushed a commit to bikramSingh91/velox that referenced this issue Feb 21, 2025
…acebookincubator#12409)

Summary:

We recently noticed as detailed in prestodb/presto#24563
that some json functions that take varchar as input and output a json type can skip
parsing the input. This can result in non-canonicalized result and inconsistent behavior.
This change ensures that json_extract and json_extract_scalar always parse the input
in these instances.

Differential Revision: D69611952
bikramSingh91 pushed a commit to bikramSingh91/velox that referenced this issue Feb 24, 2025
…acebookincubator#12409)

Summary:

We recently noticed as detailed in prestodb/presto#24563
that some json functions that take varchar as input and output a json type can skip
parsing the input. This can result in non-canonicalized result and inconsistent behavior.
This change ensures that json_extract and json_extract_scalar always parse the input
in these instances.

Differential Revision: D69611952
bikramSingh91 pushed a commit to bikramSingh91/velox that referenced this issue Feb 25, 2025
…acebookincubator#12409)

Summary:

We recently noticed as detailed in prestodb/presto#24563
that some json functions that take varchar as input and output a json type can skip
parsing the input. This can result in non-canonicalized result and inconsistent behavior.
This change ensures that json_extract and json_extract_scalar always parse the input
in these instances.

Differential Revision: D69611952
bikramSingh91 pushed a commit to bikramSingh91/velox that referenced this issue Feb 25, 2025
…acebookincubator#12409)

Summary:

We recently noticed as detailed in prestodb/presto#24563
that some json functions that take varchar as input and output a json type can skip
parsing the input. This can result in non-canonicalized result and inconsistent behavior.
This change ensures that json_extract and json_extract_scalar always parse the input
in these instances.

Differential Revision: D69611952
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🆕 Unprioritized
Development

No branches or pull requests

5 participants