Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use the generate_views when using fixed format data schema #34

Open
jcbmllgn opened this issue Apr 27, 2023 · 0 comments
Open

Comments

@jcbmllgn
Copy link
Contributor

jcbmllgn commented Apr 27, 2023

I'm running into an issue where this BigQuery target is unable to load data which contains nested JSON.

I have the following config:

  loaders:
  - name: target-bigquery
    variant: z3z1ma
    pip_url: git+https://github.com/z3z1ma/target-bigquery.git
    config:
      project: analytics-384214
      upsert: true
      schema_resolver_version: 2
      method: batch_job
      timeout: 1200
      generate_view: true

      column_name_transforms:
        quote: true
        add_underscore_when_invalid: true

Here's a small sample of the data coming from tap-postgres:

{"type": "SCHEMA", "stream": "public-channel_participants", "schema": {"type": "object", "properties": {"id": {"type": ["string"]}, "participant_id": {"type": ["null", "string"]}, "participant_type": {"type": ["null", "string"]}, "message_channel_id": {"type": ["null", "string"]}, "created_at": {"type": ["null", "string"], "format": "date-time"}, "updated_at": {"type": ["null", "string"], "format": "date-time"}, "discarded_at": {"type": ["null", "string"], "format": "date-time"}, "sources": {"type": ["null", "array"], "items": {"$ref": "#/definitions/sdc_recursive_string_array"}}, "discarded_reason": {"type": ["null", "string"]}, "created_by_id": {"type": ["null", "string"]}}, "definitions": {"sdc_recursive_integer_array": {"type": ["null", "integer", "array"], "items": {"$ref": "#/definitions/sdc_recursive_integer_array"}}, "sdc_recursive_number_array": {"type": ["null", "number", "array"], "items": {"$ref": "#/definitions/sdc_recursive_number_array"}}, "sdc_recursive_string_array": {"type": ["null", "string", "array"], "items": {"$ref": "#/definitions/sdc_recursive_string_array"}}, "sdc_recursive_boolean_array": {"type": ["null", "boolean", "array"], "items": {"$ref": "#/definitions/sdc_recursive_boolean_array"}}, "sdc_recursive_timestamp_array": {"type": ["null", "string", "array"], "format": "date-time", "items": {"$ref": "#/definitions/sdc_recursive_timestamp_array"}}, "sdc_recursive_object_array": {"type": ["null", "object", "array"], "items": {"$ref": "#/definitions/sdc_recursive_object_array"}}}}, "key_properties": ["id"], "bookmark_properties": []}
{"type": "RECORD", "stream": "public-channel_participants", "record": {"created_at": "2022-08-26T07:02:47.150141+00:00", "created_by_id": null, "discarded_at": null, "discarded_reason": null, "id": "a6e343be-9044-4c88-88c5-53e6101c7ccd", "message_channel_id": "ae44d92a-a18f-4553-ae24-67e8a5751e76", "participant_id": "717b15f5-164f-4747-a6b3-23221e64f2b7", "participant_type": "Chart", "sources": ["iehp"], "updated_at": "2022-10-19T07:04:09.323666+00:00"}, "version": 1682583863391, "time_extracted": "2023-04-27T08:24:23.391786Z"}
{"type": "RECORD", "stream": "public-channel_participants", "record": {"created_at": "2022-05-27T19:02:53.032560+00:00", "created_by_id": null, "discarded_at": null, "discarded_reason": null, "id": "834abe65-87d9-403d-b793-ae6510a091df", "message_channel_id": "02a8b9f2-274c-4706-b1ce-6fd8f81b1415", "participant_id": "d4446cf2-9ff9-4b5c-8b3a-fc303cfe9905", "participant_type": "Chart", "sources": [], "updated_at": "2022-05-27T19:02:53.032560+00:00"}, "version": 1682583863391, "time_extracted": "2023-04-27T08:24:23.391786Z"}

Here are the errors from the logs when i run meltano run tap-postgres target-bigquery:

2023-04-27T10:19:19.277544Z [info     ] Traceback (most recent call last): cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.278220Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/bin/target-bigquery", line 8, in <module> cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.278453Z [info     ]     sys.exit(TargetBigQuery.cli()) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.278631Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/click/core.py", line 1130, in __call__ cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279208Z [info     ]     return self.main(*args, **kwargs) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279378Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/click/core.py", line 1055, in main cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279461Z [info     ]     rv = self.invoke(ctx)      cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279524Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279588Z [info     ]     return ctx.invoke(self.callback, **ctx.params) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279698Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279757Z [info     ]     return __callback(*args, **kwargs) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279837Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/singer_sdk/target_base.py", line 578, in cli cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279965Z [info     ]     target.listen(file_input)  cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.280024Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/singer_sdk/io_base.py", line 34, in listen cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.280078Z [info     ]     self._process_lines(file_input) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.280128Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/singer_sdk/target_base.py", line 278, in _process_lines cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.280177Z [info     ]     counter = super()._process_lines(file_input) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.280226Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/singer_sdk/io_base.py", line 78, in _process_lines cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281028Z [info     ]     self._process_schema_message(line_dict) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281206Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/singer_sdk/target_base.py", line 378, in _process_schema_message cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281264Z [info     ]     _ = self.get_sink(         cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281718Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/target.py", line 482, in get_sink cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281804Z [info     ]     return self.add_sink(stream_name, schema, key_properties) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281869Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/singer_sdk/target_base.py", line 240, in add_sink cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281924Z [info     ]     sink = sink_class(         cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281977Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/batch_job.py", line 101, in __init__ cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.282535Z [info     ]     super().__init__(*args, **kwargs) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.282589Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/core.py", line 303, in __init__ cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.282779Z [info     ]     self.create_target(key_properties=key_properties) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.282839Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/tenacity/__init__.py", line 289, in wrapped_f cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.283045Z [info     ]     return self(f, *args, **kw) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.283336Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/tenacity/__init__.py", line 379, in __call__ cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.283413Z [info     ]     do = self.iter(retry_state=retry_state) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.284951Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/tenacity/__init__.py", line 314, in iter cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.285092Z [info     ]     return fut.result()        cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.285153Z [info     ]   File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 438, in result cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.293900Z [info     ]     return self.__get_result() cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.293978Z [info     ]   File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.294997Z [info     ]     raise self._exception      cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.295084Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/tenacity/__init__.py", line 382, in __call__ cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.295137Z [info     ]     result = fn(*args, **kwargs) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.298823Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/core.py", line 471, in create_target cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.298893Z [info     ]     self.client.query(         cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.299758Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/google/cloud/bigquery/job/query.py", line 1520, in result cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.299844Z [info     ]     do_get_result()            cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.299923Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/google/api_core/retry.py", line 349, in retry_wrapped_func cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.299974Z [info     ]     return retry_target(       cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.300024Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/google/api_core/retry.py", line 191, in retry_target cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.300473Z [info     ]     return target()            cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.300579Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/google/cloud/bigquery/job/query.py", line 1510, in do_get_result cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.300708Z [info     ]     super(QueryJob, self).result(retry=retry, timeout=timeout) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.301373Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/google/cloud/bigquery/job/base.py", line 911, in result cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.302222Z [info     ]     return super(_AsyncJob, self).result(timeout=timeout, **kwargs) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.302303Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/google/api_core/future/polling.py", line 261, in result cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.302659Z [info     ]     raise self._exception      cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.302774Z [info     ] google.api_core.exceptions.BadRequest: 400 Invalid cast from STRING to JSON at [12:24] cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres

I'm believe there's a bug somewhere in the generate_view_statement function in core.py, more specifically in how the sources column in the above records is parse and turned into a view. Here's the view SQL generated by generate_view_statement, this SQL does not compile in BigQuery:

SELECT
    JSON_VALUE(data, '$.id') as `id`,
    JSON_VALUE(data, '$.participant_id') as `participant_id`,
    JSON_VALUE(data, '$.participant_type') as `participant_type`,
    JSON_VALUE(data, '$.message_channel_id') as `message_channel_id`,
    CAST(JSON_VALUE(data, '$.created_at') as TIMESTAMP) as `created_at`,
    CAST(JSON_VALUE(data, '$.updated_at') as TIMESTAMP) as `updated_at`,
    CAST(JSON_VALUE(data, '$.discarded_at') as TIMESTAMP) as `discarded_at`,
        ARRAY(
          SELECT   CAST(JSON_VALUE(sources__rows, '$.sources') as JSON) as `sources`
          FROM UNNEST(
              JSON_QUERY_ARRAY(data, '$.sources')
          ) AS sources__rows
          WHERE   CAST(JSON_VALUE(sources__rows, '$.sources') as JSON) IS NOT NULL
        ) AS sources,
    JSON_VALUE(data, '$.discarded_reason') as `discarded_reason`,
    JSON_VALUE(data, '$.created_by_id') as `created_by_id`,
    _sdc_batched_at as _sdc_batched_at,
    _sdc_extracted_at as _sdc_extracted_at,
    _sdc_deleted_at as _sdc_deleted_at,
    _sdc_received_at as _sdc_received_at,
    _sdc_table_version as _sdc_table_version,
    _sdc_sequence as _sdc_sequence,
 FROM `analytics-384214`.`source_arc_postgres_dev`.`public_channel_participants`

Here's the error I get when i copy/paste the generated SQL into BigQuery:

image

The error goes away when I set generate_view to false. However, this isn't ideal because the data is of course much easier to work with if denormalized.

Let me know if there's any other context I can provide! I would like to be able to use the package with generate_view: true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant