Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

\\udc81 surrogates not allowed #94

Open
visch opened this issue Jan 30, 2023 · 0 comments
Open

\\udc81 surrogates not allowed #94

visch opened this issue Jan 30, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@visch
Copy link
Member

visch commented Jan 30, 2023

https://meltano.slack.com/archives/CMN8HELB0/p1675083559152229

See if this character works with target-postgres. I believe we have a test that tests a lot of combinations of oddball UTF-8 characters. I believe \udc81 is a utf-8 char but I'm not sure will have to look.

Took a look and if you add this change

visch@DESKTOP-9BDPA9T:~/git/target-postgres$ git diff
diff --git a/target_postgres/tests/data_files/encoded_strings.singer b/target_postgres/tests/data_files/encoded_strings.singer
index 6e4c9c9..2407d77 100644
--- a/target_postgres/tests/data_files/encoded_strings.singer
+++ b/target_postgres/tests/data_files/encoded_strings.singer
@@ -29,4 +29,5 @@
 {"type": "RECORD", "stream": "test_strings_in_arrays", "record": {"id": 4, "strings": ["\u006D", "\u0101", "\u0199"]}}
 {"type": "RECORD", "stream": "test_strings_in_arrays", "record": {"id": 5, "strings": ["aaa", "Double quoting: \\u0041 \\u0001"]}}
 {"type": "RECORD", "stream": "test_strings_in_arrays", "record": {"id": 6, "strings": ["bbb", "Control Characters in string: \u0041 \u0001"]}}
+{"type": "RECORD", "stream": "test_strings_in_arrays", "record": {"id": 7, "strings": ["bbb", "Control Characters in string: \u0041 \u0001 \udc81"]}}
 {"type": "STATE", "value": {"test_strings": 11, "test_strings_in_objects": 11, "test_strings_in_arrays": 6}}

And run the related test psycopg will spit out:

E           sqlalchemy.exc.DataError: (psycopg2.errors.InvalidTextRepresentation) invalid input syntax for type json
E           LINE 1: ...n string: A \u0001"']::JSONB[]),(7, ARRAY['"bbb"','"Control ...
E                                                                        ^
E           DETAIL:  Unicode low surrogate must follow a high surrogate.
E           CONTEXT:  JSON data, line 1: ...
E
E           [SQL: INSERT INTO temp_test_strings_in_arrays_94e1245a_f2e0_4a0d_b6c7_33ff224f37e6 (id, strings) VALUES (%(id)s, %(strings)s::JSONB[])]
E           [parameters: ({'id': 1, 'strings': ['"simple string"', '"\\u03b1\\u03c0\\u03bb\\u03ae \\u03c3\\u03c5\\u03bc\\u03b2\\u03bf\\u03bb\\u03bf\\u03c3\\u03b5\\u03b9\\u03c1\\u03ac"', '"\\u7b80\\u5355\\u7684\\u5b57\\u4e32"']}, {'id': 2, 'strings': ['"cha\\u00eene simple"', '"quoted \\"string\\""']}, {'id': 3, 'strings': ['"various \\" \\\\ / \\n escape sequences"']}, {'id': 4, 'strings': ['"m"', '"\\u0101"', '"\\u0199"']}, {'id': 5, 'strings': ['"aaa"', '"Double quoting: \\\\u0041 \\\\u0001"']}, {'id': 6, 'strings': ['"bbb"', '"Control Characters in string: A \\u0001"']}, {'id': 7, 'strings': ['"bbb"', '"Control Characters in string: A \\u0001 \\udc81"']})]
E           (Background on this error at: https://sqlalche.me/e/14/9h9h)

.venv/lib/python3.8/site-packages/psycopg2/extras.py:1299: DataError

I think the way to allow this is to allow us to override this type to save this data in a binary format in Postgres instead of text. We'd need to allow overrides to data types (Meltano does allow this via a schema override) and we'd need to add support for BLOBs, and a mapping to JSON Schema somewhere

@visch visch added the enhancement New feature or request label Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant