-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: @dlt.transformer
ignores write_disposition="replace"
#2108
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
3fd9d25
to
16a94de
Compare
Hey @joscha , without looking at your test to deeply, are you maybe confusing write_disposition "replace" with "merge"? If you set it to replace, merge keys are not used, but the table is fully replaced with each pipeline run. |
hi @sh-rp , updated the description and code to remove the merge key; it's the primary key I was after, sorry for the confusion. Would you mind looking at the code? I am also happy to pair/have a call/add more detail to it, if needed. |
@dlt.transformer
ignores write_disposition="replace"
@dlt.transformer
ignores write_disposition="replace"
@joscha the primary key also is not used when using replace. What exactly do you expect the primary key to do? If you need deduplication or merging, you'll have to use the merge write_disposition. In your example the transformer yields 4 rows alltogether and these seem to end up in the destination table, which is the correct behavior. |
I was under the impression the primary key denotes uniqueness, do I have that wrong?
I expect a newer (later) record to enter the system replace an older (earlier) record if they have equal primary keys. So if I have a stream of data, that yields multiple of the same records, the database ends up with only one of them. I.e.: [
{ id: 1, seq: 0 },
{ id: 1, seq: 1 },
{ id: 1, seq: 2 },
{ id: 1, seq: 3 },
{ id: 1, seq: 4 },
{ id: 1, seq: 5 },
] with
I do want that. Basically I want to treat whatever comes from the external API as truth, not keeping any previous data. How do I combine both of these behaviours? |
The primary key is just a column hint, it does not do anything on its own. dlt does not enforce uniqueness, you can set it to translate to a primary key or unique key on some destinations which would make your example fail though because you'd be trying to insert two records with the same primary key. If you want to have deduplication on the incoming records but still to a replace, you'll have to truncate the table before the pipeline run and use a merge write disposition with primary keys and possibly a dedup_sort hint: https://dlthub.com/docs/general-usage/incremental-loading#delete-insert-strategy |
Okay, thank you, will give that a try. This behaviour was not obvious to me. I think for two reasons:
I wonder if it would make sense for me to open a pull request to the docs to add this information? It would have definitely helped me. WDYT? I will try your suggestion above and close this issue in the meantime. |
Description
When a transformer yields duplicates, they are not merged as defined in the given
write_disposition
.See attached test, which is expected to pass, however it fails.
Expected
multiple yields overwrite each other. Last yield wins.
Thus the expected resulting table should look like this:
however, it looks like this:
even though
primary_key
has been defined andwrite_disposition
is"replace"