Skip to content
This repository has been archived by the owner on Sep 23, 2024. It is now read-only.

Draft: Fix FULL_TABLE replication to comply with Singer spec #96

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

paultiplady
Copy link
Contributor

@paultiplady paultiplady commented Mar 1, 2022

Describe the bug

When it detects an incrementing PK, the "full table" sync mode actually runs an incremental sync with an ephemeral "last_pk_fetched" that is computed at the beginning of the extraction, stashed in the singer state, and then cleared at the end of the extraction. This seems to be an optimization to allow the tap to be retried and resume on failures. However it's not conformant to the Singer spec, and seems to break Meltano.

Due to an issue in Meltano (it seems that clear_bookmark isn't correctly clearing the key from future jobs' state), the tap is being started with stale last_pk_fetched bookmarks.

This means changes will be missed where rows have been updated, as the tap will just skip to the rows with new PKs. I believe this breaks compatibility with the Singer spec on how sync-mode is supposed to work: https://github.com/singer-io/getting-started/blob/master/docs/SYNC_MODE.md#replication-method.

Expected behavior

Full table mode should perform a full table extraction, as described in the spec and this tap's docs: "Full-table replication extracts all data from the source table each time the tap is invoked."

The Singer spec says full table mode should depend ONLY on the start_date in the config, and replicates all available records for every sync.

FIXES: #91

The Singer Spec clearly states that FULL_TABLE replication mode
must replicate "all available records dating back to a `start_date`".

This tap erroneously switches to an incremental replication
strategy if it detects an auto-incrementing primary key. This is
both unsafe (it will omit updates to existing rows) and in violation
of the spec.

Remove this auto-incremental-update logic from the full-table
mode.
@paultiplady paultiplady changed the title Fix full table Fix FULL_TABLE replication to comply with Singer spec Mar 1, 2022
@paultiplady paultiplady changed the title Fix FULL_TABLE replication to comply with Singer spec Draft: Fix FULL_TABLE replication to comply with Singer spec Mar 1, 2022
@@ -18,7 +18,7 @@ def generate_bookmark_keys(catalog_entry):
stream_metadata = md_map.get((), {})
replication_method = stream_metadata.get('replication-method')

base_bookmark_keys = {'last_pk_fetched', 'max_pk_values', 'version', 'initial_full_table_complete'}
base_bookmark_keys = {'last_pk_fetched', 'version', 'initial_full_table_complete'}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Remove last_pk_fetched too.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Full table" sync does not extract the full table for auto-incrementing PKs
1 participant