Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres bit is a string but converted to Numeric in Bigquery #740

Open
AlexStansfield opened this issue Jul 14, 2021 · 5 comments
Open

Comments

@AlexStansfield
Copy link

@jmriego ran into issues with our bit fields.

As described by postgres a bit field is a string made up on 1s and 0s (https://www.postgresql.org/docs/8.2/datatype-bit.html).

If you have a bit(x) where x > 1 then it's possible you have a value that starts with a 0. However the fastsync to bigquery tries to put it into a NUMERIC field.

That results in issues like this:

ERROR: Could not parse '011111111111111111111111111111111111111111111111111111111111111111111111111111' as NUMERIC for field provinces (position 65) starting at location 0  with message 'Invalid NUMERIC value: 011111111111111111111111111111111111111111111111111111111111111111111111111111'

Offending line appears to be here:

When I changed NUMERIC to STRING and ran again then all is well.

@jmriego
Copy link
Contributor

jmriego commented Jul 14, 2021

thanks @AlexStansfield , I'm having a look now. I have to compare this to what the non-fastsync replication is doing and try to get the same result. It should be easy enough with all this detail

@jmriego
Copy link
Contributor

jmriego commented Jul 14, 2021

that's strange, I tested tap-postgres and it seems like it's not reading the BIT column at all:

create table jmtest (
id INTEGER,
name VARCHAR,
somebytes BIT(8),
etl_updated_timestamp TIMESTAMP);

@koszti when I run the tap in discovery mode is not detecting the type of somebytes
I'm going to prepare a MR to fix this issue and load correctly as a NUMERIC (in fastsync) to be as similar to the Snowflake target as possible. Does that make sense?

@jmriego
Copy link
Contributor

jmriego commented Jul 14, 2021

sorry for all these messages @AlexStansfield @koszti

So what it's really happening is this is trying to load too many bits into a numeric field, not because of the leading zeroes. From what I can see this is affecting all targets, not just BigQuery:

  • BigQuery has a maximum of 38 digits
  • Snowflake also has a maximum of 38 digits
  • Redshift also has a maximum of 38 digits
  • Postgres is recording it into a DOUBLE PRECISION

I think if we are going to fix this it makes sense to make the same change in all targets. What do you think?

@AlexStansfield
Copy link
Author

AlexStansfield commented Jul 15, 2021

@jmriego

I see what you mean about what's causing the error.

However I was wondering how a bit like 00011001 would be recorded in a NUMERIC field? My understanding is it would translate to 11001, which isn't what we'd really want.

@jmriego
Copy link
Contributor

jmriego commented Jul 15, 2021

@AlexStansfield I'm afraid that's the case. I am collaborating in adding the BigQuery capabilities to PipelineWise, but that exact same replication is done in the other targets.
I understand what you mean anyway. You could always LPAD it with 0's to the length of the source data, but the problem is that you don't know the length of the source column once you are already in the target table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants