Skip to content
This repository has been archived by the owner on Sep 23, 2024. It is now read-only.

Add replica option #131

Closed
wants to merge 4 commits into from

Conversation

mvgijssel
Copy link

Problem

In our data architecture we don't want to extract data directly from the primary Postgres instance as this can have (big) impact on the live production system. Unfortunately, as stated in the README, logical replication doesn't work for Postgres read replicas.

Proposed changes

To benefit from log based syncs and not do the bulk of the syncing on the primary instance the proposed change is to add an otion for a read replica which will be used for all the traditional streams. This means that incremental and full table syncs will do to the read replica, which will happen for the initial sync, and the log based syncs go to the primary.

NOTE: This change includes the changes from #130 because HackerOne relies on those changes as well

Types of changes

What types of changes does your code introduce to PipelineWise?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Checklist

  • Description above provides context of the change
  • I have added tests that prove my fix is effective or that my feature works
  • Unit tests for changes (not needed for documentation changes)
  • CI checks pass with my changes
  • Bumping version in setup.py is an individual PR and not mixed with feature or bugfix PRs
  • Commit message/PR title starts with [AP-NNNN] (if applicable. AP-NNNN = JIRA ID)
  • Branch name starts with AP-NNN (if applicable. AP-NNN = JIRA ID)
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions

@Samira-El
Copy link
Contributor

Hey @mvgijssel, thanks for this PR!

I more or less understand the need behind this PR, we do the same thing but in Pipelinewise FastSync.

But just for my own understanding, the outcome you want to have is that for a data pipeline that has incremental, full table and log based streams:

  • incremental steams to sync off of replica
  • full table to sync off of replica
  • log based to do initial sync off of replica and continue off of primary

is this correct?

@mvgijssel
Copy link
Author

mvgijssel commented Nov 2, 2021

I more or less understand the need behind this PR, we do the same thing but in Pipelinewise FastSync.

Didn't know about this! I'm definitely going to check this out

is this correct?

Yes! We want to bulk of the work to be on the replica and only the wal processing on the primary

@Samira-El Samira-El added the enhancement New feature or request label Jan 5, 2022
Copy link
Contributor

@judahrand judahrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Samira-El What do you think about these changes? They do seem useful. We are planning to deploy Pipelinewise to replicate from Postgres to BigQuery here at Thread. We are largely planning to use logical replication, however, we have some views which we would also like to replicate and being able to force those to use the replica would be good.

Comment on lines +426 to +427
'replica_user': args.config['replica_user'],
'replica_password': args.config['replica_password'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that would be a great default!

@@ -38,14 +38,27 @@ def fully_qualified_table_name(schema, table):
return '"{}"."{}"'.format(canonicalize_identifier(schema), canonicalize_identifier(table))


def open_connection(conn_config, logical_replication=False):
def open_connection(conn_config, logical_replication=False, primary_connection=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that's a good idea!

@@ -116,8 +116,7 @@ def sync_method_for_streams(streams, state, default_replication_method):
continue

if replication_method == 'LOG_BASED' and stream_metadata.get((), {}).get('is-view'):
raise Exception(f'Logical Replication is NOT supported for views. ' \
f'Please change the replication method for {stream["tap_stream_id"]}')
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the behaviour here changed? This exception is useful isn't it? Rather than failing silently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry this is a specific hack for HackerOne, this shouldn't be part of this PR! Same for the changes related to TOAST'ed Postgres values.

Copy link
Contributor

@judahrand judahrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvgijssel Does open_connection not need changing elsewhere to use use_replica to determine whether to connect to the replica or not rather than defaulting to the replica? Otherwise, what is the point in the use_replica flag?

@judahrand judahrand mentioned this pull request Jan 11, 2022
13 tasks
@judahrand
Copy link
Contributor

@mvgijssel I hope you don't mind that I've opened a separate PR with just the replica option changes + some documentation. I hope the change is more likely to get merged with documentation and as a standalone change.

@mvgijssel
Copy link
Author

@mvgijssel I hope you don't mind that I've opened a separate PR with just the replica option changes + some documentation. I hope the change is more likely to get merged with documentation and as a standalone change.

No worries! Happy this is being picked up ❤️

@Samira-El
Copy link
Contributor

Hey @mvgijssel, this feature is now available thanks to Judah's PR #145 which was more minimal in changes.
Closing this PR.

@Samira-El Samira-El closed this Jan 25, 2022
@Samira-El Samira-El added the duplicate This issue or pull request already exists label Jan 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
duplicate This issue or pull request already exists enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants