some ideas for improvement #5

tfriedel · 2023-12-06T12:16:57Z

Hi,

I was dealing with having to parse json returned from ChatGPT and was looking for some kind of best effort json parser, which is how I found this.
Good work on doing this!
I noticed that the unit tests fail currently for some instances, like parsing "12." to 12. Or the test_incomplete_string fails because no exception is raised. Here I wondered why that would even be the expected behaviour? Shouldn't a string that doesn't end with a " be parsed correctly?

Dealing with incomplete json is only one of the issues, and actually not my main issue.
I found that often ChatGPT would include newlines.

For this I found you can use

json_decoder = json.JSONDecoder(strict=False)
json_decoder.decode(s)

This will preserve the newlines, which is not really legal json, but what we want.

There's also a similar project for javascript:
https://github.com/beenotung/best-effort-json-parser

And an old python based json parser, that's supposedly also able to parse somewhat illegal json:
https://pypi.org/project/demjson/

This one is outdated and doesn't work with newer python version.

There may be some ideas for features or edge cases in those projects to help you improve this library!
Would be nice to have a very robust parser for gpt produced json.

iw4p · 2023-12-06T13:43:34Z

Hi
Sure! Thank you. If you want to contribute, feel free.

korabs-x · 2024-04-04T09:27:08Z

@tfriedel had the same issues of newlines not being parsed correctly, found this one instead which works for us:
https://github.com/promplate/partial-json-parser

iw4p · 2024-04-04T09:52:32Z

@korabs-x Can you give me more details to fix it for the next version?

korabs-x · 2024-04-04T09:58:03Z

Sure!
The inconsistency here should make it most clear:

>>> parser.parse('{"x": "1st line\\n2nd line').get('x')
'1st line\\n2nd line'
>>> parser.parse('{"x": "1st line\\n2nd line"').get('x')
'1st line\n2nd line'

In both versions, the \n should not be escaped. Right now it is escaped as long as the string is not finished, and switches to being not escaped once there's the finishing quotes.
Hope it helps, thanks for the otherwise great library!

iw4p · 2024-04-04T11:26:32Z

Oh I see! a little bit tricky to handle and fix it, but I'll try to do my best. ty!

pthimon · 2024-04-10T12:43:02Z

If it helps, this is how I worked around the issues of the LLM output containing real line breaks (using strict=False) and for the partial string case not handling escaped line breaks (by passing it though json.loads).

class LenientJSONParser(JSONParser):
    def parse_string(self, s, e):
        end = s.find('"', 1)
        while end != -1 and s[end - 1] == '\\':  # Handle escaped quotes
            end = s.find('"', end + 1)
        if end == -1:
            # add the missing ending quote, and parse as a json string so that escaping is handled correctly
            return json.loads(s + '"', strict=False), ""
        str_val = s[:end + 1]
        s = s[end + 1:]
        # ignore that actual newline characters might appear in the string using strict=False
        return json.loads(str_val, strict=False), s

iw4p · 2024-08-03T18:42:57Z

Hi everyone @pthimon @tfriedel @korabs-x
Strict mode is added. You can make it false to keep the \n characters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some ideas for improvement #5

some ideas for improvement #5

tfriedel commented Dec 6, 2023 •

edited

Loading

iw4p commented Dec 6, 2023

korabs-x commented Apr 4, 2024

iw4p commented Apr 4, 2024

korabs-x commented Apr 4, 2024

iw4p commented Apr 4, 2024

pthimon commented Apr 10, 2024

iw4p commented Aug 3, 2024

some ideas for improvement #5

some ideas for improvement #5

Comments

tfriedel commented Dec 6, 2023 • edited Loading

iw4p commented Dec 6, 2023

korabs-x commented Apr 4, 2024

iw4p commented Apr 4, 2024

korabs-x commented Apr 4, 2024

iw4p commented Apr 4, 2024

pthimon commented Apr 10, 2024

iw4p commented Aug 3, 2024

tfriedel commented Dec 6, 2023 •

edited

Loading