Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] dbt-redshift microbatch materialization strategy does not use unique_key in config block #927

Open
3 of 11 tasks
hamzaamjad opened this issue Mar 20, 2025 · 0 comments
Labels
feature:microbatch Issues related to the microbatch incremental strategy pkg:dbt-redshift Issue affects dbt-redshift type:bug Something isn't working as documented

Comments

@hamzaamjad
Copy link

hamzaamjad commented Mar 20, 2025

Is this a new bug?

  • I believe this is a new bug
  • I have searched the existing issues, and I could not find an existing issue for this bug

Which packages are affected?

  • dbt-adapters
  • dbt-tests-adapter
  • dbt-athena
  • dbt-athena-community
  • dbt-bigquery
  • dbt-postgres
  • dbt-redshift
  • dbt-snowflake
  • dbt-spark

Current Behavior

Microbatch incremental models built on Redshift are not respecting the unique_key provided in config block. When a microbatch incremental model is run, the unique_key is ignored and not reflected in the Data Manipulation Language generated by the microbatch incremental strategy to identify which records to delete from the existing model before inserting new records. This results in microbatch incremental models that can contain duplicate values of a unique_key.

Expected Behavior

Microbatch incremental models built on Redshift respect the unique_key provided in config block. When a microbatch incremental model is run, the unique_key is reflected in the Data Manipulation Language generated by the microbatch incremental strategy to identify which records to delete from the existing model before inserting new records. This results in microbatch incremental models that can not contain duplicate values of a unique_key.

Steps To Reproduce

  1. Configure a dbt project in dbt Cloud running on Latest.
  2. Identify an existing model that is materialized as table or incremental model that you would like to configure to run using microbatch materialization strategy.
    3.. Define the configuration in the model similar to the below{{ config( materialized='incremental', incremental_strategy='microbatch', unique_key='model_unique_key', event_time='model_event_time', begin='2025-01-01', batch_size='model_batch_size' ) }}
  3. Ensure that your model has a uniqueness test configured on the model_unique_key.
  4. Ensure that you have a set of records that have an model_event_time that will have them run in the latest batch. Ideally, your model_event_time is pointed at a timestamp that CAN change, such as record_updated_at.
  5. Run dbt build --select your_model
  6. The model should run successfully, but the test for uniqueness on the model_unique_key will fail.

Relevant log output

Environment

- OS: dbt Cloud
- Python: dbt Cloud Latest
- dbt-adapters: dbt Cloud Latest
- dbt-redshift: dbt Cloud Latest

Additional Context

Expected Adapter-specific Behavior based on Documentation

The Adapter-specific Behavior section in the incremental microbatch documentation specifies that:

dbt-redshift: Uses the delete+insert strategy, which "inserts" or "replaces."

Which implies to a user that the microbatch materialization strategy for Redshift will mirror the expected behavior of the delete+insert materialization strategy.

The Incremental materialization strategies section in the Redshift configurations documentation specifies that:

delete+insert (default when unique_key is defined)

Which implies to a user that the delete+insert relies on a unique_key and will respect this unique_key.

Issue in Source Code for dbt-redshift adapter

The issue is found in the incremental_merge.sql file in the dbt-adapters repository. The specific section to reference begins on Line 57.

It appears that handling for the unique_key config was intentionally removed from this microbatch strategy. I don't think this is correct -- unique_key should be an optional config value that can be provided to microbatch incremental models that are run in Redshift.

Futhermore, I'll note that the function cannot just call out to the default implementation of delete+insert. Calling out to the default__get_delete_insert_merge_sql macro would generate Data Manipulation Language that uses AND conditions in the WHERE clause of the DELETE statement. This can cause issues with microbatch models that use this strategy from my testing. Instead, I had to roll a custom microbatch incremental strategy, which is as follows:

{% macro get_incremental_microbatch_sql(arg_dict) %}

  {% if arg_dict["unique_key"] %}
    {%- set target = arg_dict["target_relation"] -%}
    {%- set unique_key = arg_dict["unique_key"] -%}
    {%- set source = arg_dict["temp_relation"] -%}
    {%- set dest_columns = arg_dict["dest_columns"] -%}
    {%- set predicates = [] -%}

    {%- set incremental_predicates = [] if arg_dict.get('incremental_predicates') is none else arg_dict.get('incremental_predicates') -%}
    {%- for pred in incremental_predicates -%}
        {% if "DBT_INTERNAL_DEST." in pred %}
            {%- set pred =  pred | replace("DBT_INTERNAL_DEST.", target ~ "." ) -%}
        {% endif %}
        {% if "dbt_internal_dest." in pred %}
            {%- set pred =  pred | replace("dbt_internal_dest.", target ~ "." ) -%}
        {% endif %}
        {% do predicates.append(pred) %}
    {% endfor %}

    {% if not model.batch or (not model.batch.event_time_start or not model.batch.event_time_end) -%}
        {% do exceptions.raise_compiler_error('dbt could not compute the start and end timestamps for the running batch') %}
    {% endif %}

    {#-- Add additional incremental_predicates to filter for batch --#}
    {% do predicates.append(model.config.event_time ~ " >= TIMESTAMP '" ~ model.batch.event_time_start ~ "'") %}
    {% do predicates.append(model.config.event_time ~ " < TIMESTAMP '" ~ model.batch.event_time_end ~ "'") %}
    {% do arg_dict.update({'incremental_predicates': predicates}) %}

    {%- set dest_cols_csv = get_quoted_csv(dest_columns | map(attribute="name")) -%}

    {% if unique_key %}
        {% if unique_key is string %}
        {% set unique_key = [unique_key] %}
        {% endif %}

        {%- set unique_key_str = unique_key|join(', ') -%}

        delete from {{ target }}
        where ({{ unique_key_str }}) in (
            select distinct {{ unique_key_str }}
            from {{ source }}
        )
        {%- if incremental_predicates %}
        OR (
            {% for predicate in incremental_predicates %}
                {%- if not loop.first %}and {% endif -%} {{ predicate }}
            {% endfor %}
        )
        {%- endif -%};

    {% endif %}

    insert into {{ target }} ({{ dest_cols_csv }})
    (
        select {{ dest_cols_csv }}
        from {{ source }}
    )

  {% else %}
    {% do return(redshift__get_incremental_microbatch_sql(arg_dict)) %}
  {% endif %}

{% endmacro %}
@hamzaamjad hamzaamjad added triage:product In Product's queue type:bug Something isn't working as documented labels Mar 20, 2025
@amychen1776 amychen1776 added feature:microbatch Issues related to the microbatch incremental strategy pkg:dbt-redshift Issue affects dbt-redshift and removed triage:product In Product's queue labels Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:microbatch Issues related to the microbatch incremental strategy pkg:dbt-redshift Issue affects dbt-redshift type:bug Something isn't working as documented
Projects
None yet
Development

No branches or pull requests

2 participants