-
Notifications
You must be signed in to change notification settings - Fork 117
Is it expected for column names to be alpha-sorted? #222
Comments
Adding my 2 cents to the question. It would be great if there was a choice / option on whether the columns are alpha sorted or not if I understand this comment correctly. Even if the default behaviour was to alpha sort columns, it would be nice to be able to override this via an environment variable. |
Hey Aaron, I don't honestly recall why columns are being sorted here, one thing I reckon this sorting might be helping with is the order of columns in CSV and matching that to the order in COPY/MERGE commands, I could be wrong though, this needs to be tested. have you tried removing the sorting and testing if loading still works? @koszti do you perhaps recall why we're sorting here? |
Hi, @Samira-El - Thanks for your comments. I've not tested myself, but I can understand if there could be some other factors at play. |
Interesting. We never wanted the columns to be alpha-sorted and I'm not sure why the extra sort here. 🤔 Even if we do the sort, the Originally we wanted the columns to be sorted in the same order as the original database, but the singer spec doesn't support this option. In the singer messages the columns are dict key, and the order is more like random. |
Hi, A colleague of mine has re-raised the issue of ordering records written to Snowflake and the preference to retain the tap ordering if possible. I'm wondering if the sorting was to do with older versions of Python v 3.5 and below which didn't guarantee the order of items. Certain versions of Python 3.6 would retain order, and from Python 3.7+ it is guaranteed to retain the insertion order for a Dictionary. Given we are seeing a standard to deprecate older versions of Python i.e. 3.6 and below, is it worth looking at the ability to retain the original tap column order? Perhaps it could be implemented via a Environment Variable setting which by default sorts alphabetically for backwards compatibility, but can be overwritten to not sort the dictionary so if you are running Python 3.7+ you could obtain the original tap column ordering? I haven't experimented with this but I'm interested in your thoughts on this and whether you think this is a feature you would support? |
Hi Steve, I don't have any strong opinions about keeping the alphabetical sorting. I don't fancy having this a feature flag - it'd be yet another path in the code-, I reckon we can get rid of the sort and honor the order of fields as present in the schema emitted by the tap. Would be happy to review a PR with this change. |
Hi Samira,
Thank you very much for your feedback on that. I will look to build and test this and will create a PR when I am happy with the change.
As a FYI, I have pushed through two PR’s for tap-s3-csv with a new feature and bug fix. I note that integration tests work on my local when I set the required environment variables but seem to fail in the github ci/cd tests.
Thanks
Steve
From: Samira El Aabidi ***@***.***>
Sent: Friday, 17 June 2022 8:36 PM
To: transferwise/pipelinewise-target-snowflake ***@***.***>
Cc: Steve Clarke ***@***.***>; Comment ***@***.***>
Subject: Re: [transferwise/pipelinewise-target-snowflake] Is it expected for column names to be alpha-sorted? (Issue #222)
Hi Steve, I don't have any strong opinions about keeping the alphabetical sorting.
I don't fancy having this a feature flag - it'd be yet another path in the code-, I reckon we can get rid of the sort and honor the order of fields as present in the schema emitted by the tap.
Would be happy to review a PR with this change.
—
Reply to this email directly, view it on GitHub <#222 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUDU42QKXWJVRKDB7KMNJYLVPQ2HRANCNFSM5GZKZ6UA> .
You are receiving this because you commented. <https://github.com/notifications/beacon/AUDU42XN5MEJH7MYAYQDGNTVPQ2HRA5CNFSM5GZKZ6UKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOIUHWOGI.gif> Message ID: ***@***.*** ***@***.***> >
|
We've been running with this commit from my fork for quite a while: Which I notice is exactly the same as suggested at the top of this thread. The incoming SCHEMA messages dictate the ordering of columns, so we have also been configuring our taps to provide an ordered list of columns (based on their column ordinal ID rather than name). Once a SCHEMA message has been processed, it doesn't matter what ordering the RECORD messages take. |
Now I recall the reason for sorting is simply to verify that there are no duplicated column names, so I agree with @aaronsteers code change. The |
It looks like columns may be alpha-sorted and I'm not sure if this is to-be-expected / intentional.
The sorting appears to happen here in flatten_schema:
https://github.com/transferwise/pipelinewise-target-snowflake/blob/master/target_snowflake/flattening.py#L68:L73
Seems like this might also work for the dupe check, while still returning the (original) pre-sort column ordering:
Curious as to the thoughts around this - if sorting is necessary for other reasons or if I'm perhaps reading this incorrectly.
Thank in advance!
The text was updated successfully, but these errors were encountered: