Inconsistent behavior in json functions #24563

bikramSingh91 · 2025-02-14T02:19:21Z

While working on implementing JSON functions in Velox for Presto-CPP, I came across some inconsistencies that I wanted to highlight and get the community's opinion on whether it makes sense for us to fix them, either both in Presto-Java and Presto-CPP or just in Presto-CPP.

1. `json_extract` Produces Non-Canonicalized JSON as Output

We expect the JSON type to be canonicalized whenever it exists in a query. Since Presto does not store JSON as a type that can be written to files, this canonicalization is usually done when a VARCHAR column is converted to JSON using one of the JSON functions (typically json_parse). This is important because it can affect the equality of two JSON objects and provide inconsistent results if sometimes the JSONs are canonicalized and sometimes they are not.

For example, suppose the input string is:

{ "key_2": 2, "key_3": 3, "key_1": 1 }

json_parse will canonicalize this to:

presto:default> select JSON_PARSE('{ "key_2": 2, "key_3": 3, "key_1": 1 }');
_col0
---------------------------------
{"key_1":1,"key_2":2,"key_3":3}

NOTE: canonicalization puts the keys in ascending order and removes spaces.

Now, if there are two inputs:

{ "key_2": 2, "key_3": 3, "key_1": 1 }
{ "key_1": 1, "key_2": 2, "key_3": 3 }

These should be considered equal JSONs when parsed and compared; otherwise, this can result in correctness issues. If we use json_extract, which returns a JSON object type, it would extract and return a non-canonicalized JSON:

presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_3": 3, "key_1": 1 }', '$');
_col0
---------------------------------
{"key_2":2,"key_3":3,"key_1":1}

NOTE: Spaces are removed, but keys remain untouched.

The same issue occurs in json_array_get:

presto:default> select json_array_get('[{ "key_2": 2, "key_1": 1 }]', 0);
_col0
-----------------------
{"key_2":2,"key_1":1}

Example of a possible correctness issue:
A group by on the json column returning wrong results

SELECT
    ("json_extract"((output_fields), '$.key')) c0
FROM (
    SELECT
        *
    FROM (
        VALUES
            ('{"key": {"inner_1":"1", "inner_2":"2"}'),
            ('{"key": {"inner_2":"2", "inner_1":"1"}}')
    ) AS t (output_fields)
)
GROUP BY
    1;

              c0               
-------------------------------
 {"inner_1":"1","inner_2":"2"} 
 {"inner_2":"2","inner_1":"1"}

Recommendation: The expectation is that once a VARCHAR is converted to JSON, it should be in a canonicalized form. Therefore, we should always canonicalize the resultant JSON in all functions that take VARCHAR as prospective JSON input and output a JSON.

2. `json_extract` / `_scalar`, `json_array_get` Accepts Invalid JSON to Produce Valid Result

If the JSON is invalid, it will still produce valid output if and only if the target key/index specified in the path comes before the part where the JSON becomes invalid.

presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', '$.key_2');
_col0
-------
2
presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', '$.key_1');
_col0
-------
"z"

NOTE: "z"a1" is the invalid part. If you json_parse the above string, it will throw an error.

The same issue occurs with json_array_get:

presto:default> select json_array_get('[{ "key_2": 2, "key_1": 1 }, {"z"aa"}]', 0);
_col0
-----------------------
{"key_2":2,"key_1":1}

Why is this a problem?

This makes the behavior unpredictable in the face of invalid JSON.
Since json_extract does not canonicalize the input, this can return a valid result in one input and invalid in another, even if the two inputs are considered equal when canonicalized.
Moreover, json_extract seems to have inconsistency depending on the path. For example, Jayway treats $. as optional in the beginning, therefore $.key_2 should be the same as key_2. However, as we'll see, it returns different values when the JSON is invalid.

presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', '$.key_2');
_col0
-------
2
presto:default> select JSON_EXTRACT('{ "key_2": 2, "key_1": "z"a1" }', 'key_2');
_col0
-------
NULL

NOTE: Path with $.key_2 returns a value, but key_2 returns NULL.

Recommendation: If we decide to move ahead with canonicalization of all functions that produce JSON from VARCHAR (as discussed in # 1), then we can catch invalid JSONs in that step and always return null.

3. `json_array_get` Returns Invalid JSON When Output is a String Scalar

Since we are already on this topic and talking about json_array_get, I thought this would be a good time to discuss this known problem in json_array_get (this is already highlighted in Presto docs, see [Presto Documentation](https://prestodb.io/docs/current/functions/json.html#json_array_get-json_array-index-json)).

SELECT json_array_get('["a", [3, 9], "c"]', 0); -- JSON 'a' < == (invalid JSON)

As we decide to add support for this in Presto-CPP (Velox), I was wondering if there are any objections to ensuring we support the correct behavior, that is, a string scalar is correctly enclosed in quotes:

SELECT json_array_get('["a", [3, 9], "c"]', 0); -- JSON '"a"' < == (valid JSON)

cc: @amitkdutta @spershin @kgpai @kevinwilfong @Yuhta @kaikalur @feilong-liu @rschlussel @tdcmeehan @aditi-pandit

The text was updated successfully, but these errors were encountered:

kgpai · 2025-02-14T17:54:27Z

@tdcmeehan This is a major correctness hole with json_extract right ?

json_extract:

 SELECT
    y,
    COUNT(1)
FROM (
    SELECT
        JSON_EXTRACT(x, '$') AS y
    FROM (
        VALUES
            ('{ "key_2": 2, "key_3": 3, "key_1": 1 }'),
            ('{"key_3": 3, "key_2": 2, "key_1": 1 }')
    ) AS t(x)
)
GROUP BY
    1

------- Returns

{"key_2":2,"key_3":3,"key_1":1} | 1
{"key_3":3,"key_2":2,"key_1":1} | 1

compared to json_parse:

SELECT
    y,
    COUNT(1)
FROM (
    SELECT
        JSON_PARSE(x) AS y
    FROM (
        VALUES
            ('{ "key_2": 2, "key_3": 3, "key_1": 1 }'),
            ('{"key_3": 3, "key_2": 2, "key_1": 1 }')
    ) AS t(x)
)
GROUP BY
    1

---- Returns


{"key_1":1,"key_2":2,"key_3":3} | 2

spershin · 2025-02-14T21:06:10Z

Ideally, as everything eventually moves to Prestissimo and it will behave 'canonically', then I see no reason of not fixing Presto Java.
Depends how hard it is to achieve.

Can it be achieved by making a canonicalize() call before returning the result?
This would introduce a performance regression, which we could overcome by wrapping it into a config property, which we would only turn on for verification purposes and it can also be turned on by whoever wants to opt in into this behavior.

kgpai · 2025-02-14T21:23:21Z

as everything eventually moves to Prestissimo and it will behave 'canonically', then I see no reason of not fixing Presto Java.

Hi Sergey, maybe I misunderstand, but you are saying if everything is moving to prestissimo, then we shouldnt have to fix Presto Java right ? (It reads the other way).

This would introduce a performance regression,

This should take back seat to correctness in my view, but agree with having a config property.

rschlussel · 2025-02-18T15:49:17Z

I believe @spershin is suggesting that we shouldn't worry that we're introducing a behavior change in Java since we are moving to that behavior for c++ anyway.

I'm not sure it's technically a correctness issue because json isn't guaranteed to be in any particular order. But it's at the very least an inconvenient and unexpected behavior, which is almost as bad. @spershin's suggestion here makes sense to me.
This seems similar to json_parse parse's bad jsons #24090 and probably can be fixed the same way.
Yes, I think it would be good to ensure the new implementation has correct behavior.

rschlussel · 2025-02-18T19:18:13Z

for 1, it looks like json_parse uses this SORTED_MAPPER to write out the cannonicalized Json https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/JsonFunctions.java#L147. We would need to do something similar for the other json functions.

aditi-pandit · 2025-02-19T06:20:53Z

Agree i) canonicalization and ii) error on invalid json are more consistent behavior.

It would be great to have a config in Presto Java to force the native engine behavior for verification purposes. We do have some customers using JSONs with Presto Java that we will migrate to Prestissimo at some point.

Summary: We recently noticed as detailed in prestodb/presto#24563 that some json functions that take varchar as input and output a json type can skip parsing the input. This can result in non-canonicalized result and inconsistent behavior. This change ensures that json_extract and json_extract_scalar always parse the input in these instances. Differential Revision: D69611952

…acebookincubator#12409) Summary: We recently noticed as detailed in prestodb/presto#24563 that some json functions that take varchar as input and output a json type can skip parsing the input. This can result in non-canonicalized result and inconsistent behavior. This change ensures that json_extract and json_extract_scalar always parse the input in these instances. Differential Revision: D69611952

bikramSingh91 added the bug label Feb 14, 2025

github-project-automation bot added this to Bugs and support requests Feb 14, 2025

github-project-automation bot moved this to 🆕 Unprioritized in Bugs and support requests Feb 14, 2025

bikramSingh91 mentioned this issue Feb 20, 2025

fix(json): Ensure input is parsed as json for json extract functions facebookincubator/velox#12409

Open

duxiao1212 mentioned this issue Feb 24, 2025

fix Update json_extract to Produce Canonicalized Output #24614

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behavior in json functions #24563

Inconsistent behavior in json functions #24563

bikramSingh91 commented Feb 14, 2025 •

edited

Loading

kgpai commented Feb 14, 2025

spershin commented Feb 14, 2025

kgpai commented Feb 14, 2025

rschlussel commented Feb 18, 2025

rschlussel commented Feb 18, 2025

aditi-pandit commented Feb 19, 2025

Inconsistent behavior in json functions #24563

Inconsistent behavior in json functions #24563

Comments

bikramSingh91 commented Feb 14, 2025 • edited Loading

1. json_extract Produces Non-Canonicalized JSON as Output

2. json_extract / _scalar, json_array_get Accepts Invalid JSON to Produce Valid Result

3. json_array_get Returns Invalid JSON When Output is a String Scalar

kgpai commented Feb 14, 2025

spershin commented Feb 14, 2025

kgpai commented Feb 14, 2025

rschlussel commented Feb 18, 2025

rschlussel commented Feb 18, 2025

aditi-pandit commented Feb 19, 2025

bikramSingh91 commented Feb 14, 2025 •

edited

Loading

1. `json_extract` Produces Non-Canonicalized JSON as Output

2. `json_extract` / `_scalar`, `json_array_get` Accepts Invalid JSON to Produce Valid Result

3. `json_array_get` Returns Invalid JSON When Output is a String Scalar