Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adding pg_legacy_replication verified source using decoderbufs #589

Open
wants to merge 75 commits into
base: master
Choose a base branch
from

Conversation

neuromantik33
Copy link

@neuromantik33 neuromantik33 commented Dec 23, 2024

Differences between pg_legacy_replication and pg_replication

Overview

pg_legacy_replication is a fork of the verified pg_replication source. The primary goal of this fork is to provide logical replication capabilities for Postgres instances running versions earlier than 10, when the pgoutput plugin was not yet available. This fork draws inspiration from the original pg_replication source and the decoderbufs library, which is actively maintained by Debezium.

Key Differences from pg_replication

Replication User Ownership Requirements

One of the limitations of native Postgre replication is that the replication user must own the tables in order to add them to a publication.
Additionally, once a table is added to a publication, it cannot be removed, requiring the creation of a new replication slot, which results in the loss of any state tracking.

Limitations in pg_replication

The current pg_replication implementation has several limitations:

  • It supports only a single initial snapshot of the data.
  • It requires CREATE access to the source database in order to perform the initial snapshot.
  • Superuser access is required to replicate entire Postgres schemas.
    While the pg_legacy_replication source theoretically reads the entire WAL across all schemas, the current implementation using dlt transformers restricts this functionality. In practice, this has not been a common use case.
  • The implementation is opinionated in its approach to data transfer. Specifically, when updates or deletes are required, it defaults to a merge write disposition, which replicates live data without tracking changes over time.

Features of pg_legacy_replication

This fork of pg_replication addresses the aforementioned limitations and introduces the following improvements:

  • Adheres to the dlt philosophy by treating the WAL as an upstream resource. This replication stream is then transformed into various DLT resources, with customizable options for write disposition, file formats, type hints, etc., specified at the resource level rather than at the source level.
  • Supports an initial snapshot of all tables using the transaction slot isolation level. Additionally, ad-hoc snapshots can be performed using the serializable deferred isolation level, similar to pg_backup.
  • Emphasizes the use of pyarrow and parquet formats for efficient data storage and transfer. A dedicated backend has been implemented to support these formats.
  • Replication messages are decoded using Protocol Buffers (protobufs) in C, rather than relying on native Python byte buffer parsing. This ensures greater efficiency and performance.

Next steps

  • Add support for the wal2json replication plugin. This is particularly important for environments such as Amazon RDS, which supports wal2json, as opposed to on-premise or Google Cloud SQL instances that support decoderbufs.

@rudolfix rudolfix self-assigned this Dec 25, 2024
@rudolfix rudolfix added the ci from fork Allows to run tests from PR coming from fork label Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci from fork Allows to run tests from PR coming from fork
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants