-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support Singer Decimal as a config item #1890
Comments
So, I think there's two parts to this AFAICT. Support at the tapThis is arguably already support by the typing helpers via jsonschema = PropertiesList(
Property("my_number", CustomType({"type": "string", "format": "singer.decimal"})),
) A nicer wrapper could also be added. Support at the target
|
Hi @edgarrmondragon, To show an example of my usage of Singer Decimal. I have a working copying along with the usage of msgspec on the traditional singer-python frameworks used by some non-SDK taps. I am seeing some good performance from doing this. https://github.com/s7clarke10/pipelinewise-singer-python. For testing this I have created a branch on pipelinewise-tap-oracle to use msgspec. Of note is the use of a boolean config setting use_singer_decimal in the tap. As a user of a tap, I have a choice whether I wish to have decimal and float data emitted as a string - If that setting is omitted it will out default to outputting decimals and floats in their lossy native format - important for backwards compatibility. I should note however, there is logic in a tap around the schema definition to define the correct JSON data types when it is discovered so that the records matching the schema. Here is an example of output with (pipelinewise-tap-oracle) [ pipelinewise-tap-oracle]$ tap-oracle -c oracle_config_test.json --catalog oracle_properties_test.json
time=2023-08-17 10:12:04 name=singer level=INFO message=Selected streams: ['STEVE-STEVE_TEST']
time=2023-08-17 10:12:04 name=singer level=INFO message=No currently_syncing found
time=2023-08-17 10:12:04 name=singer level=INFO message=Beginning sync of stream(STEVE-STEVE_TEST) with sync method(full)
time=2023-08-17 10:12:04 name=singer level=INFO message=Stream STEVE-STEVE_TEST is using full_table replication
time=2023-08-17 10:12:04 name=singer level=INFO message=Singer Decimal Enabled! Floats and Decimals will be output as strings
time=2023-08-17 10:12:04 name=singer level=INFO message=Selected streams: ['STEVE-STEVE_TEST']
time=2023-08-17 10:12:04 name=singer level=INFO message=No currently_syncing found
time=2023-08-17 10:12:04 name=singer level=INFO message=Beginning sync of stream(STEVE-STEVE_TEST) with sync method(full)
time=2023-08-17 10:12:04 name=singer level=INFO message=Stream STEVE-STEVE_TEST is using full_table replication
time=2023-08-17 10:12:04 name=singer level=INFO message=Singer Decimal Enabled! Floats and Decimals will be output as strings
{"type":"SCHEMA","stream":"STEVE-STEVE_TEST","schema":{"properties":{"COL1":{"maxLength":100,"type":["null","string"]},"COL2":{"maxLength":50,"type":["null","string"]},"DECIMAL_FIELD":{"format":"singer.decimal","type":["null","string"],"additionalProperties":{"scale_precision":"(10,2)"}},"INTEGER_FIELD":{"type":["null","integer"]},"FLOAT_FIELD":{"format":"singer.decimal","type":["null","string"]}},"type":"object"},"key_properties":[]}
time=2023-08-17 10:12:04 name=singer level=INFO message=dsn: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=10.131.2.102)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=pndww2db.moh.govt.nz)))
{"type":"STATE","value":{"bookmarks":{"STEVE-STEVE_TEST":{"last_replication_method":"FULL_TABLE","version":1692223924745}},"currently_syncing":"STEVE-STEVE_TEST"}}
{"type":"ACTIVATE_VERSION","stream":"STEVE-STEVE_TEST","version":1692223924745}
time=2023-08-17 10:12:04 name=singer level=INFO message=select SELECT "COL1" , "COL2" , "DECIMAL_FIELD" , "INTEGER_FIELD" , "FLOAT_FIELD" , NULL as ORA_ROWSCN
FROM STEVE.STEVE_TEST
{"type":"RECORD","stream":"STEVE-STEVE_TEST","record":{"COL1":"Decimal Test","COL2":"201.35","DECIMAL_FIELD":"201.35","INTEGER_FIELD":null,"FLOAT_FIELD":null},"version":1692223924745,"time_extracted":"2023-08-16T22:12:04.745338Z"}
{"type":"RECORD","stream":"STEVE-STEVE_TEST","record":{"COL1":"Integer Test","COL2":"438437439439374974943","DECIMAL_FIELD":null,"INTEGER_FIELD":438437439439374974943,"FLOAT_FIELD":null},"version":1692223924745,"time_extracted":"2023-08-16T22:12:04.745338Z"}
{"type":"RECORD","stream":"STEVE-STEVE_TEST","record":{"COL1":"Float Test","COL2":"3.14159265359","DECIMAL_FIELD":null,"INTEGER_FIELD":null,"FLOAT_FIELD":"3.14159265359"},"version":1692223924745,"time_extracted":"2023-08-16T22:12:04.745338Z"}
{"type":"RECORD","stream":"STEVE-STEVE_TEST","record":{"COL1":null,"COL2":"Hello","DECIMAL_FIELD":null,"INTEGER_FIELD":null,"FLOAT_FIELD":null},"version":1692223924745,"time_extracted":"2023-08-16T22:12:04.745338Z"}
{"type":"RECORD","stream":"STEVE-STEVE_TEST","record":{"COL1":"World","COL2":null,"DECIMAL_FIELD":null,"INTEGER_FIELD":null,"FLOAT_FIELD":null},"version":1692223924745,"time_extracted":"2023-08-16T22:12:04.745338Z"}
time=2023-08-17 10:12:04 name=singer level=INFO message=METRIC: b'{"type":"counter","metric":"record_count","value":5,"tags":{"schema":"STEVE","table":"STEVE_TEST"}}'
{"type":"ACTIVATE_VERSION","stream":"STEVE-STEVE_TEST","version":1692223924745}
{"type":"STATE","value":{"bookmarks":{"STEVE-STEVE_TEST":{"last_replication_method":"FULL_TABLE","version":1692223924745,"ORA_ROWSCN":null}},"currently_syncing":null}} |
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the |
I've reopened cause someone brought it up in Slack1, and commenting here on minimal requirements for shipping this:
As for more implementation details, I'm not a fan of using
... but I'm willing to compromise if its use is common enough in the Singer ecosystem. Footnotes |
Adding the comments I added to the slack channel. So when we transfer data from databases like SQL Server, Sybase, DB2, Oracle, we contributed to the Meltano Hub default variants of these taps to allow you to emit the numeric data e.g. decimal / float data as strings i.e. not scientific notation - this requires the singer_decimal config item to be set to true. When this is set, logic in the tap also provides additional information we require to convert it from a string back into a number in the target loader once it has been transferred without any loss in precision. This is achieved by the taps having logic built-in to also give some hints as to what the original column definition was in the original source system e.g. "additionalProperties":{"scale_precision":"(38,20)". We have our own variant of the original pipelinewise target snowflake which has been patched and a few bugs removed - https://github.com/mjsqu/pipelinewise-target-snowflake . This variant has logic to recognize the singer decimal and make use of the additional properties. If you use this variant, it will automatically convert and store the data as a number with the correct scale and precision as per the source system. This has been very important for us as we found during our testing that there is a loss of precision if the numeric based decimal and floats were not emitted as strings via the enablement of singer decimal. I have been meaning to raise a PR to the Meltanolabs version of target-snowflake to add our logic so it could be our preferred loader. It is worth noting for Oracle Databases however that singer decimal is harder as Oracle doesn't really set a scale and precision. It is often set as just number and the user can store a mix of integers and decimals. If this is the case we will not emit any precision and scale and that data will be transferred as a string. It would be the responsibility of the data transform e.g. dbt to set an appropriate data type as it is too dangerous to guess what is correct. We will still transfer it as a string to avoid python and scientific notation leading rounding issues. |
This https://github.com/mjsqu/pipelinewise-target-snowflake/blob/e2a216c3cfa3d56fe66902e611812fe146528996/target_snowflake/db_sync.py#L99-L100 section in our variant of target-snowflake which will set the correct datatype in Snowflake for singer-decimal columns. The actual PR is here : mjsqu/pipelinewise-target-snowflake@558a0d6 A couple of notable enhancements we made to the original pipelinewise version.
|
Got a draft PR built on top on recent JSON schema mapping improvements: #2786 |
Thank you @edgarrmondragon, I think this will be a very useful feature for people building database taps and targets to receive the singer_decimal hint. |
Feature scope
Taps (catalog, state, stream maps, tests, etc.)
Description
There have been discussions before about whether it is a best practice to emit Decimal, Integer, and Float datatypes as a string or in their native formats.
I have mentioned this in passing in this JSON dicussion #1046 . The author of msgspec strongly recommends sending numeric data as a string.
There was an informal standard called Singer Decimal where the tap would emit the Numeric data as a string (and if possible) try and turn it back into a numeric format in the target.
You may ask, why would you do this? The key reasons are :
For a number of database related taps we have implement a setting called
use_singer_decimal
. When this is set to true, it will send the data as a string but provides additional information about the scale and precision (if available) in the singer messages. When the target receives the message, if it has this additional singer decimal data, it can cast the data with the correct datatype. The good thing about this approach is that it is a setting, therefore based on your use case, you can emit numeric data in its native numeric format (default setting), or preferable enable theuse_singer_decimal
setting to gain more accurate numeric data.We have found by using this methodology, there is no loss of precision. It would be great to introduce this ability into the SDK as it will ensure accurate numeric data being sent.
Examples of enhanced database taps which support this are:
Example enhanced target with support for decoding singer-decimal data:
More detail for tap-db2 written by @mjsqu .
DECIMALs from DB2 are given format property "singer.decimal" in the schema, with type property "number". The numeric scale & precision are stored as
schema[additionalProperties][scale_precision]
, e.g."additionalProperties":{"scale_precision":"(15,9)"}
.When the tap writes the RECORD message, "singer.decimal" informs the data value to be output into the message as a string
The snowflake target obtains the scale/precision from the schema message to create a column of type NUMBER(scale, precision).
The text was updated successfully, but these errors were encountered: