feat: Adding pg_legacy_replication verified source using decoderbufs #589

neuromantik33 · 2024-12-23T03:33:44Z

Differences between `pg_legacy_replication` and `pg_replication`

Overview

pg_legacy_replication is a fork of the verified pg_replication source. The primary goal of this fork is to provide logical replication capabilities for Postgres instances running versions earlier than 10, when the pgoutput plugin was not yet available. This fork draws inspiration from the original pg_replication source and the decoderbufs library, which is actively maintained by Debezium.

Key Differences from `pg_replication`

Replication User Ownership Requirements

One of the limitations of native Postgre replication is that the replication user must own the tables in order to add them to a publication.
Additionally, once a table is added to a publication, it cannot be removed, requiring the creation of a new replication slot, which results in the loss of any state tracking.

Limitations in `pg_replication`

The current pg_replication implementation has several limitations:

It supports only a single initial snapshot of the data.
It requires CREATE access to the source database in order to perform the initial snapshot.
Superuser access is required to replicate entire Postgres schemas.
While the pg_legacy_replication source theoretically reads the entire WAL across all schemas, the current implementation using dlt transformers restricts this functionality. In practice, this has not been a common use case.
The implementation is opinionated in its approach to data transfer. Specifically, when updates or deletes are required, it defaults to a merge write disposition, which replicates live data without tracking changes over time.

Features of `pg_legacy_replication`

This fork of pg_replication addresses the aforementioned limitations and introduces the following improvements:

Adheres to the dlt philosophy by treating the WAL as an upstream resource. This replication stream is then transformed into various DLT resources, with customizable options for write disposition, file formats, type hints, etc., specified at the resource level rather than at the source level.
Supports an initial snapshot of all tables using the transaction slot isolation level. Additionally, ad-hoc snapshots can be performed using the serializable deferred isolation level, similar to pg_backup.
Emphasizes the use of pyarrow and parquet formats for efficient data storage and transfer. A dedicated backend has been implemented to support these formats.
Replication messages are decoded using Protocol Buffers (protobufs) in C, rather than relying on native Python byte buffer parsing. This ensures greater efficiency and performance.

Next steps

Add support for the wal2json replication plugin. This is particularly important for environments such as Amazon RDS, which supports wal2json, as opposed to on-premise or Google Cloud SQL instances that support decoderbufs.

…ting other sources -_-

Nicolas ESTRADA added 30 commits December 16, 2024 14:22

fix: finally got pg_replication tests working as is

59e7557

feat: got decoderbufs to run and compile in docker

79220b7

chore: updated protobuf to latest compatible version

9de0835

chore: copying all files from pg_replication; format-lint is reformat…

75a0f7f

…ting other sources -_-

wip: saving work

73704af

wip: saving work

7d1b8e7

wip: saving work

ecbf98d

wip: removed all references to publications

3ed14da

fix: applied suggested changes mentioned here dlt-hub/dlt#1920

9fe0301

wip: saving work

197ba82

wip: finally got snapshot to work

c897ee0

chore: simply cleaning up

d303c04

chore: need to find a better way to clean up the underlying engine

6566fe4

wip: handling begin/commit

70d40a0

wip: saving work

f703431

wip: saving work

f001633

wip: saving work

c0df7c9

wip: saving work

aa464d5

wip: making progress

db09568

wip: saving work

c3c0518

refactor: some test refactoring

a5b1a87

wip: saving work

7fad621

wip: saving work

fbc65bc

wip: cleaning up + refactor

1299b60

wip: cleaning up + refactor

f44853b

wip: cleaning up + refactor

46200ca

wip: slowly progressing

f0f0146

wip: all tests pass now to update docs and cleanup

cd8d906

wip: still trying to get it work with all versions of dlt

02851f4

wip

beef6ea

Nicolas ESTRADA added 27 commits December 16, 2024 14:28

fix: small type corrections for pg9.6

fd4638b

fix: exposing table options for later arrow support

526eff3

wip: saving work for arrow

2f5ad15

wip: first test with arrow passing

32063e2

wip: almost done passing all tests

28f463d

wip: some arrow tests are still not passing

385e8a6

fix: done with pyarrow; too many issues with duckdb atm

a291b69

wip: some bug fixes

ba23505

wip: small refactoring

5993fb4

wip: duckdb needs patching, trying out new max_lsn

6db693a

wip: some refactoring of options to make certain features togglable

c53c9f9

wip: lsn and deleted ts are optional

ba1c3fc

feat: added optional transaction id

6b960df

feat: added optional commit timestamp

9fa9d98

fix: never handled missing type and added text oid mapping

1947029

fix: added some logging and bug fixes

7a7ba30

chore: basic refactoring

a752581

fix: minor corrections

4184ca9

chore: reverting back to prev state

3c7232f

chore: rebasing 1.x branch onto my own

c8f1ad2

fix: corrected bug regarding column names

7024ce7

chore: minor fixes

63b1de0

chore: small perf fixes and aligning with more adt

e8b2a0c

chore: refactoring and cleaning

4c33129

chore: finished docstrings

0b7c151

bugfix: misuse of defaultdict

ec72e36

Finally done with docs

ecc6089

rudolfix self-assigned this Dec 25, 2024

rudolfix added the ci from fork Allows to run tests from PR coming from fork label Dec 25, 2024

fix: wasn't able to execute local tests without these settings

dd5a63b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Adding pg_legacy_replication verified source using decoderbufs #589

feat: Adding pg_legacy_replication verified source using decoderbufs #589

neuromantik33 commented Dec 23, 2024 •

edited

Loading

feat: Adding pg_legacy_replication verified source using decoderbufs #589

Are you sure you want to change the base?

feat: Adding pg_legacy_replication verified source using decoderbufs #589

Conversation

neuromantik33 commented Dec 23, 2024 • edited Loading

Differences between pg_legacy_replication and pg_replication

Overview

Key Differences from pg_replication

Replication User Ownership Requirements

Limitations in pg_replication

Features of pg_legacy_replication

Next steps

neuromantik33 commented Dec 23, 2024 •

edited

Loading

Differences between `pg_legacy_replication` and `pg_replication`

Key Differences from `pg_replication`

Limitations in `pg_replication`

Features of `pg_legacy_replication`