Add replica option #131

mvgijssel · 2021-10-08T15:57:41Z

Problem

In our data architecture we don't want to extract data directly from the primary Postgres instance as this can have (big) impact on the live production system. Unfortunately, as stated in the README, logical replication doesn't work for Postgres read replicas.

Proposed changes

To benefit from log based syncs and not do the bulk of the syncing on the primary instance the proposed change is to add an otion for a read replica which will be used for all the traditional streams. This means that incremental and full table syncs will do to the read replica, which will happen for the initial sync, and the log based syncs go to the primary.

NOTE: This change includes the changes from #130 because HackerOne relies on those changes as well

Types of changes

What types of changes does your code introduce to PipelineWise?
Put an x in the boxes that apply

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation Update (if none of the other choices apply)

Checklist

Description above provides context of the change
I have added tests that prove my fix is effective or that my feature works
Unit tests for changes (not needed for documentation changes)
CI checks pass with my changes
Bumping version in setup.py is an individual PR and not mixed with feature or bugfix PRs
Commit message/PR title starts with [AP-NNNN] (if applicable. AP-NNNN = JIRA ID)
Branch name starts with AP-NNN (if applicable. AP-NNN = JIRA ID)
Commits follow "How to write a good git commit message"
Relevant documentation is updated including usage instructions

Samira-El · 2021-11-02T08:33:39Z

Hey @mvgijssel, thanks for this PR!

I more or less understand the need behind this PR, we do the same thing but in Pipelinewise FastSync.

But just for my own understanding, the outcome you want to have is that for a data pipeline that has incremental, full table and log based streams:

incremental steams to sync off of replica
full table to sync off of replica
log based to do initial sync off of replica and continue off of primary

is this correct?

mvgijssel · 2021-11-02T12:51:55Z

I more or less understand the need behind this PR, we do the same thing but in Pipelinewise FastSync.

Didn't know about this! I'm definitely going to check this out

is this correct?

Yes! We want to bulk of the work to be on the replica and only the wal processing on the primary

judahrand

@Samira-El What do you think about these changes? They do seem useful. We are planning to deploy Pipelinewise to replicate from Postgres to BigQuery here at Thread. We are largely planning to use logical replication, however, we have some views which we would also like to replicate and being able to force those to use the replica would be good.

judahrand · 2022-01-11T13:44:57Z

tap_postgres/__init__.py

+            'replica_user': args.config['replica_user'],
+            'replica_password': args.config['replica_password'],


Could these not default to the same credentials as the primary? Similar to: https://github.com/transferwise/pipelinewise/blob/206e75e630933d1a2b2ab9afb36261a55ccf0e4c/pipelinewise/fastsync/commons/tap_postgres.py#L161-L164

Yeah I think that would be a great default!

judahrand · 2022-01-11T13:46:17Z

tap_postgres/db.py

@@ -38,14 +38,27 @@ def fully_qualified_table_name(schema, table):
    return '"{}"."{}"'.format(canonicalize_identifier(schema), canonicalize_identifier(table))


-def open_connection(conn_config, logical_replication=False):
+def open_connection(conn_config, logical_replication=False, primary_connection=False):


Maybe use prioritize_primary similar to: https://github.com/transferwise/pipelinewise/blob/206e75e630933d1a2b2ab9afb36261a55ccf0e4c/pipelinewise/fastsync/commons/tap_postgres.py#L129

Yeah I think that's a good idea!

judahrand · 2022-01-11T13:47:53Z

tap_postgres/__init__.py

@@ -116,8 +116,7 @@ def sync_method_for_streams(streams, state, default_replication_method):
            continue

        if replication_method == 'LOG_BASED' and stream_metadata.get((), {}).get('is-view'):
-            raise Exception(f'Logical Replication is NOT supported for views. ' \
-                            f'Please change the replication method for {stream["tap_stream_id"]}')
+            continue 


Why is the behaviour here changed? This exception is useful isn't it? Rather than failing silently.

Yeah sorry this is a specific hack for HackerOne, this shouldn't be part of this PR! Same for the changes related to TOAST'ed Postgres values.

judahrand

@mvgijssel Does open_connection not need changing elsewhere to use use_replica to determine whether to connect to the replica or not rather than defaulting to the replica? Otherwise, what is the point in the use_replica flag?

judahrand · 2022-01-11T16:45:33Z

@mvgijssel I hope you don't mind that I've opened a separate PR with just the replica option changes + some documentation. I hope the change is more likely to get merged with documentation and as a standalone change.

mvgijssel · 2022-01-11T21:45:36Z

@mvgijssel I hope you don't mind that I've opened a separate PR with just the replica option changes + some documentation. I hope the change is more likely to get merged with documentation and as a standalone change.

No worries! Happy this is being picked up ❤️

Samira-El · 2022-01-25T07:46:31Z

Hey @mvgijssel, this feature is now available thanks to Judah's PR #145 which was more minimal in changes.
Closing this PR.

deanmorin and others added 3 commits September 15, 2021 15:41

Fixes transferwise#107

f8b4440

Enable traditional streams to extract from replica instance

992d705

Don't process views

a073f5c

Samira-El added the enhancement New feature or request label Jan 5, 2022

judahrand reviewed Jan 11, 2022

View reviewed changes

judahrand mentioned this pull request Jan 11, 2022

Add use_secondary config flag #145

Merged

13 tasks

mvgijssel force-pushed the add-replica-option branch from 7516ab5 to a073f5c Compare January 13, 2022 10:19

Output partial schema and record when TOAST columns are missing

acd0b0f

This was referenced Jan 18, 2022

Add use_replica config flag thread/pipelinewise-tap-postgres#2

Merged

Perform logical replication after initial sync #144

Open

Samira-El closed this Jan 25, 2022

Samira-El added the duplicate This issue or pull request already exists label Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add replica option #131

Add replica option #131

mvgijssel commented Oct 8, 2021

Samira-El commented Nov 2, 2021

mvgijssel commented Nov 2, 2021 •

edited

Loading

judahrand left a comment

judahrand Jan 11, 2022

mvgijssel Jan 12, 2022

judahrand Jan 11, 2022

mvgijssel Jan 12, 2022

judahrand Jan 11, 2022

mvgijssel Jan 11, 2022

judahrand left a comment •

edited

Loading

judahrand commented Jan 11, 2022

mvgijssel commented Jan 11, 2022

Samira-El commented Jan 25, 2022

		'replica_user': args.config['replica_user'],
		'replica_password': args.config['replica_password'],

Add replica option #131

Add replica option #131

Conversation

mvgijssel commented Oct 8, 2021

Problem

Proposed changes

Types of changes

Checklist

Samira-El commented Nov 2, 2021

mvgijssel commented Nov 2, 2021 • edited Loading

judahrand left a comment

Choose a reason for hiding this comment

judahrand Jan 11, 2022

Choose a reason for hiding this comment

mvgijssel Jan 12, 2022

Choose a reason for hiding this comment

judahrand Jan 11, 2022

Choose a reason for hiding this comment

mvgijssel Jan 12, 2022

Choose a reason for hiding this comment

judahrand Jan 11, 2022

Choose a reason for hiding this comment

mvgijssel Jan 11, 2022

Choose a reason for hiding this comment

judahrand left a comment • edited Loading

Choose a reason for hiding this comment

judahrand commented Jan 11, 2022

mvgijssel commented Jan 11, 2022

Samira-El commented Jan 25, 2022

mvgijssel commented Nov 2, 2021 •

edited

Loading

judahrand left a comment •

edited

Loading