Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify the primary key of the referenced table in a virtual reference #270

Open
Tracked by #258
mikfreedman opened this issue Feb 26, 2025 · 3 comments
Open
Tracked by #258
Assignees
Labels

Comments

@mikfreedman
Copy link
Contributor

Users should be able to specify the primary key of the referenced table in a virtual reference so that they can handle non-standard relationships in greenmask.

The docs currently state that:

You do not need to define primary key of the referenced table. Greenmask will automatically resolve it and use it in the join condition.

from database_subset.md.

@mikfreedman
Copy link
Contributor Author

mikfreedman commented Feb 26, 2025

Here is a sample schema that illustrates the challenge:

CREATE TABLE legacy_accounts (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL
);

CREATE TABLE accounts (
    id SERIAL PRIMARY KEY,
    legacy_id INTEGER,
    username TEXT NOT NULL
);

-- Insert sample data with explicit IDs
INSERT INTO legacy_accounts (id, name) VALUES
(100, 'Legacy User 1'),
(101, 'Legacy User 2'),
(102, 'Legacy User 3');

INSERT INTO accounts (id, legacy_id, username) VALUES
(1, 100, 'user1'),
(2, 101, 'user2'),
(3, NULL, 'user3'); -- Account without a legacy reference

@mikfreedman
Copy link
Contributor Author

Here is a sample config.yml, note a new key called referenced_table_id

common:
  pg_bin_path: "/usr/bin/"
  tmp_dir: "/tmp"

log:
  level: "debug"
  format: "text"

storage:
  directory:
    path: "/tmp/pg_dump_test"

dump:
  pg_dump_options:
    jobs: 10
    exclude-schema: '("teSt"*|test*)'
    load-via-partition-root: true
  virtual_references:
    - schema: "public"
      name: "legacy_accounts"
      references:
        - schema: "public"
          name: "accounts"
          columns:
            - name: "id" # the primary key on the table
           referenced_table_id: "legacy_id" # the field on the accounts table to reference rather than its id column
          not_null: true
  transformation:
    - schema: "public"
      name: "accounts"
      subset_conds:
        - "public.accounts.id = 1"

This config should constrain the legacy_accounts table and only return the first row with ID 100

@wwoytenko wwoytenko self-assigned this Feb 26, 2025
@wwoytenko wwoytenko mentioned this issue Jan 31, 2025
13 tasks
@wwoytenko
Copy link
Contributor

@mikfreedman Hi!

Thank you for raising a feature request. I really appreciate that you attached the configuration and DB schema for this issue.

I would recommend implementing a referenced_table_id (or a similar attribute) that supports a list, as there might be multiple columns involved in FK/PK relationships.

Another important case to consider: what if a user creates a second primary key (PK)?
Based on my findings, there are two edge cases we should address:

  • The table does not have a PK at all (which is fine—a virtual PK will work correctly).
  • Table A has two foreign keys (FKs) pointing to table B—one via the original PK and another via a virtual PK. This scenario likely requires two JOINs, as there are distinct references. Given that we're considering this case, it's reasonable to assume table B has additional filtering conditions.

It looks like this won't affect the logic and can be implemented fairly straightforwardly.

Feel free to share you thoughts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants