Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure that Postgres always get a consistent snapshot #929

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

judahrand
Copy link
Contributor

@judahrand judahrand commented Apr 2, 2022

Problem

It is currently possible for PipelineWise to select an LSN to stream from which will result in missing data once the initial snapshot has been conducted.

This is due to the current mechanism of just selecting either the most recent LSN on the primary or the most recent LSN which has been replayed on the replica: https://github.com/thread/pipelinewise/blob/cfd2c2091eb60a5b71ea38486594ccfb91a4fe1c/pipelinewise/fastsync/commons/tap_postgres.py#L297-L322

This is not a safe thing to do as it does not guarantee a consistent view of the database.

Fixes: #930

Proposed changes

Inspired by how Debezium (another CDC tool written in Java) deals with initial snapshots of Postgres databases we will take advantage of two behaviours:

  1. When a replica slot is created Postgres calculates the first LSN at which the slot will provide a consistent view of the database (it also sets confirmed_flush_lsn to this value). We will stream from that point forward but use the current state of the database as the snapshot.
  2. Since PipelineWise only relieves logically decoded messages of complete transactions it should be safe to assume that any LSN that PipelineWise receives (and confirms) provides a consistent view of the database. Therefore, if the replication slot already exists we will take the current value of confirmed_flush_lsn to be the point at which we will start streaming.

In both of these cases, we may have data which will also be streamed to us by the replication slot but this should not be an issue as we are relying on an 'at least once' delivery mechanism already.

Types of changes

What types of changes does your code introduce to PipelineWise?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Checklist

  • I have read the CONTRIBUTING doc
  • Description above provides context of the change
  • I have added tests that prove my fix is effective or that my feature works
  • Unit tests for changes (not needed for documentation changes)
  • CI checks pass with my changes
  • Bumping version in setup.py is an individual PR and not mixed with feature or bugfix PRs
  • Commit message/PR title starts with [AP-NNNN] (if applicable. AP-NNNN = JIRA ID)
  • Branch name starts with AP-NNN (if applicable. AP-NNN = JIRA ID)
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions

@judahrand
Copy link
Contributor Author

@Samira-El This PR is ready for review, I believe.

@judahrand judahrand force-pushed the postgres-snapshot branch 4 times, most recently from b9ee0f5 to ce27325 Compare April 5, 2022 09:07
@judahrand judahrand mentioned this pull request Apr 5, 2022
@judahrand judahrand force-pushed the postgres-snapshot branch from 0b98a99 to d59e5d5 Compare April 5, 2022 14:32
@judahrand judahrand force-pushed the postgres-snapshot branch 3 times, most recently from 04eaffb to c04292a Compare April 19, 2022 08:47
This change ensures that the LSN at which the replication slot is
first consistent is availiable to the caller of the method.
There is a replication lag between a primary and replica database.
We get information about what data the replication slot can provide
from the primary but perform the initial data snapshot on the replica.
Therefore, we must make sure that the replica has caught up to at
least the LSN which we are going to begin streaming from the
replication slot.
@judahrand
Copy link
Contributor Author

@Samira-El

@judahrand judahrand closed this Apr 19, 2022
@judahrand judahrand reopened this Apr 19, 2022
@judahrand
Copy link
Contributor Author

@Samira-El

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tap Postgres FastSync inconsistent snapshot for logical replication
1 participant