Skip to content
This repository has been archived by the owner on Sep 23, 2024. It is now read-only.

Handling empty schema #88

Open
A-Costa opened this issue Jun 29, 2020 · 3 comments · May be fixed by #87
Open

Handling empty schema #88

A-Costa opened this issue Jun 29, 2020 · 3 comments · May be fixed by #87
Labels
enhancement New feature or request

Comments

@A-Costa
Copy link

A-Costa commented Jun 29, 2020

Current Issue

According to singer docs here

Schemas are required, but they can be defined in the broadest terms - a JSON Schema of '{}' validates all data points.
But right now, pipelinewise-target-snowflake would break if fed with SCHEMA messages that define an empty schema.

In my use case, the source was pipelinewise-tap-salesforce. In this tap, when salesforce defines a field with datatype anyType, like with History objects, the schema will be empty in singer.

This is a sample stream of messages generated by the salesforce tap:

{"type": "STATE", "value": {"current_stream": "AccountHistory"}}
{"type": "SCHEMA", "stream": "AccountHistory", "schema": {"additionalProperties": false, "properties": {"AccountId": {"type": ["null", "string"]}, "CreatedById": {"type": ["null", "string"]}, "CreatedDate": {"anyOf": [{"format": "date-time", "type": "string"}, {"type": ["string", "null"]}]}, "Field": {"type": ["null", "string"]}, "Id": {"type": "string"}, "IsDeleted": {"type": ["null", "boolean"]}, "NewValue": {}, "OldValue": {}}, "type": "object"}, "key_properties": ["Id"], "bookmark_properties": ["CreatedDate"]}
{"type": "ACTIVATE_VERSION", "stream": "AccountHistory", "version": 1593166361466}
{"type": "RECORD", "stream": "AccountHistory", "record": {"AccountId": "WoYq3nK08PuOT5rYkR", "CreatedById": "4lkUFTtWU6ktX19cpe", "CreatedDate": "2016-12-07T21:28:07.000000Z", "Field": "accountUpdatedByLead", "Id": "wWZS7XwK5Bs5EFZwYC", "IsDeleted": false, "NewValue": null, "OldValue": null}, "version": 1593166361466, "time_extracted": "2020-06-26T10:12:41.467536Z"}
{"type": "STATE", "value": {"current_stream": "AccountHistory", "bookmarks": {"AccountHistory": {"version": 1593166361466, "CreatedDate": "2016-12-07T21:28:07.000000Z"}}}}
{"type": "RECORD", "stream": "AccountHistory", "record": {"AccountId": "KFO6SPVsoURdtmlxg6", "CreatedById": "pM1Ui74VCiijGHSXYs", "CreatedDate": "2016-12-08T02:42:23.000000Z", "Field": "Website", "Id": "dAyRQJr9RQ1YVIVh6g", "IsDeleted": false, "NewValue": "example.com", "OldValue": null}, "version": 1593166361466, "time_extracted": "2020-06-26T10:12:41.467536Z"}
{"type": "STATE", "value": {"current_stream": "AccountHistory", "bookmarks": {"AccountHistory": {"version": 1593166361466, "CreatedDate": "2016-12-08T02:42:23.000000Z"}}}}
{"type": "RECORD", "stream": "AccountHistory", "record": {"AccountId": "uIpoNCSWjcCebhMCG9", "CreatedById": "jidWq314JDa92a0Dnt", "CreatedDate": "2016-12-09T16:36:38.000000Z", "Field": "Some_Field__c", "Id": "6AoUgVJF8IhQTlIGMp", "IsDeleted": false, "NewValue": true, "OldValue": false}, "version": 1593166361466, "time_extracted": "2020-06-26T10:12:41.467536Z"}
{"type": "STATE", "value": {"current_stream": "AccountHistory", "bookmarks": {"AccountHistory": {"version": 1593166361466, "CreatedDate": "2016-12-09T16:36:38.000000Z"}}}}
{"type": "RECORD", "stream": "AccountHistory", "record": {"AccountId": "wfmR8DzpDpbiDeY2jj", "CreatedById": "XUGP2VNVDXHsRxpVbf", "CreatedDate": "2016-12-10T20:10:23.000000Z", "Field": "Another_Field__c", "Id": "aJgpPh9JkrRyIaMxu3", "IsDeleted": false, "NewValue": 120, "OldValue": 100}, "version": 1593166361466, "time_extracted": "2020-06-26T10:12:41.467536Z"}
{"type": "STATE", "value": {"current_stream": "AccountHistory", "bookmarks": {"AccountHistory": {"version": 1593166361466, "CreatedDate": "2016-12-10T20:10:23.000000Z"}}}}

It is possible to see in the SCHEMA message: "NewValue": {}, "OldValue": {}

When this sequence of message is piped to pipelinewise-target-snowflake, a KeyError exception is risen:

Traceback (most recent call last):
  File "/home/vagrant/pipelinewise/.virtualenvs/target-snowflake/bin/target-snowflake", line 33, in <module>
    sys.exit(load_entry_point('pipelinewise-target-snowflake', 'console_scripts', 'target-snowflake')())
  File "/vagrant/pipelinewise-target-snowflake/target_snowflake/__init__.py", line 446, in main
    persist_lines(config, singer_messages, table_cache)
  File "/vagrant/pipelinewise-target-snowflake/target_snowflake/__init__.py", line 189, in persist_lines
    adjust_timestamps_in_record(o['record'], schemas[stream])
  File "/vagrant/pipelinewise-target-snowflake/target_snowflake/__init__.py", line 148, in adjust_timestamps_in_record
    if 'string' in schema['properties'][key]['type'] and \
KeyError: 'type'

Proposed Solution

The proposed solution is to fix the bugs that cause the KeyError exceptions in the code, and at the same time implement a way to treat schemaless fields as snowflake Variant datatype.

@A-Costa A-Costa linked a pull request Jun 29, 2020 that will close this issue
@koszti
Copy link
Contributor

koszti commented Jun 30, 2020

What do you think, how we should deal with anyOf? How we can decide which datatype is the good one for snowflake when creating the target table?

@A-Costa
Copy link
Author

A-Costa commented Jul 6, 2020

Hi @koszti , thank you for your reply.

That's a good question, even if not exactly related to this specific case. Maybe good to open a different issue about that?

I guess using Snowflake's Variant type would be a solution for that too. How do you currently deal with it?

@dlouseiro
Copy link

@A-Costa @koszti I opened this PR with a proposed solution.

Still have to work a bit on the tests, but so far is working smoothly while using the code in my fork to sync all my salesforce entities.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants