Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dagster-dbt] Make default backfill_policy=None always for @dbt_asset #22280

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

smackesey
Copy link
Collaborator

@smackesey smackesey commented Jun 4, 2024

Summary & Motivation

@dbt_assets currently sets a default backfill policy of BackfillPolicy.single_run() if there is a TimeWindowPartitionsDefinition. This is non-standard behavior since all other assets have a default backfill policy of None.

With the recent fix of job backfills to respect backfill policies, this has caused problems for some users since all assets in a job need to have the same backfill policy, so users who have a mix of dbt and regular assets in the same job without setting a backfill policy are now seeing failures.

This PR removes the non-standard default, which is a breaking change (but it should be noted backfill policies are experimental).

How I Tested These Changes

Add test to make sure a job with @dbt_asset and @asset can be created.

Copy link
Collaborator Author

smackesey commented Jun 4, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

@smackesey smackesey marked this pull request as ready for review June 4, 2024 13:51
@smackesey smackesey force-pushed the sean/standardize-dagster-dbt-default-backfill-policy branch from 47115c5 to 2a1e03c Compare June 4, 2024 14:10
@smackesey smackesey force-pushed the sean/standardize-dagster-dbt-default-backfill-policy branch from 2a1e03c to 73a7421 Compare June 4, 2024 14:42
@smackesey smackesey requested review from rexledesma and schrockn June 4, 2024 15:34

assert my_dbt_assets.backfill_policy == expected_backfill_policy


Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole test is no longer needed as there is no special backfill policy logic in @dbt_assets anymore.


asset_job = define_asset_job("asset_job", [foo, my_dbt_assets])

assert Definitions(assets=[foo, my_dbt_assets], jobs=[asset_job]).get_job_def("asset_job")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed this test was failing before the changes.

@rexledesma rexledesma requested a review from sryza June 4, 2024 15:38
@schrockn
Copy link
Member

schrockn commented Jun 4, 2024

This is a breaking change but I think the original behavior is so counterintuitive that we should consider it a bug, unless I am missing something obvious. Want @rexledesma to weigh in ofc since he has more context.

Copy link
Contributor

@rexledesma rexledesma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding @sryza here for visibility.

The non-standard default for backfill_policy in @dbt_assets was added in #17822.

The context was that the initial introduction of backfill_policy was providing a broken experience for folks. Before, they could do a single-run backfill through the UI. Afterwards, they had to manually specify BackfillPolicy.single_run() to get that behavior.

This change would remove this bandaid for users.


Part of the problem was, after the introduction of backfill_policy, the UI did not clearly indicate that the user now had to take action in code to accomplish a single-run backfill. Do we have that guidance in the UI now?

Furthermore, when defining a mismatch of backfill_policy's from separate assets combined in a job, do we have an appropriate error message for the user to unify the backfill_policy of these definitions?

@smackesey
Copy link
Collaborator Author

smackesey commented Jun 4, 2024

Part of the problem was, after the introduction of backfill_policy, the UI did not clearly indicate that the user now had to take action in code to accomplish a single-run backfill.

That makes sense, but is there something special about dbt and TimeWindowPartitionsDefinition in particular? Because it seems like one could make the same argument for any partitioned asset, but all other partitioned assets default to backfill_policy=None AFAICT.

Do we have that guidance in the UI now?

The UI has some text notifying the user that backfill policies will be used, but I don't think it provides specific guidance like: "if you want a single-run backfill, set BackfillPolicy.single_run(). Perhaps we should be more explicit? Curious what @sryza thinks.

Furthermore, when defining a mismatch of backfill_policy's from separate assets combined in a job, do we have an appropriate error message for the user to unify the backfill_policy of these definitions?

Yes: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_core/definitions/unresolved_asset_job_definition.py#L186-L193

@rexledesma
Copy link
Contributor

That makes sense, but is there something special about dbt and TimeWindowPartitionsDefinition in particular?

dagster-dbt is a pretty common entrypoint for folks into Dagster. We were just making it ergonomic for these folks, since things are coming out of the box for them. For dbt, a single-run backfill is a sensible default, considering that could accomplished by running dbt with --full-refresh.

Yes: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_core/definitions/unresolved_asset_job_definition.py#L186-L193

The current message describes the definition error, but does not prescribe an explicit remedy. Like which definitions should I modify to ensure that the backfill policies are the same?

I would love to have an error message that mentions the names of the underlying node definitions violating the constraint and the corresponding backfill policies.

@smackesey
Copy link
Collaborator Author

smackesey commented Jun 10, 2024

@schrockn Since @rexledesma laid out the reasoning for the current behavior, can you weigh in as to whether to proceed with this PR?

Also I've improved the error message for multiple backfill policies in an asset job here: #22433

@rexledesma
Copy link
Contributor

Some discussion is happening in some issues: #22254, #22356.

@smackesey
Copy link
Collaborator Author

smackesey commented Jun 12, 2024

@rexledesma I'm not familiar enough with dagster-dbt to know this without digging-- will partitioned assets created with @dbt_assets work "by default" with partition ranges (i.e. single-run backfill)? Do users have to set anything else (like a special IO manager?)

Asking because there are users using partitioned dbt assets (see here) who seem to be not using the single-run-backfill functionality, because they've been backfilling using jobs.

@rexledesma
Copy link
Contributor

@smackesey They don't need a custom IO manager. Instead, they'll need to appropriately shuffle in the partition information to the dbt invocation in the body of their execution function for things to work properly. https://docs.dagster.io/integrations/dbt/reference#building-incremental-models-using-partitions

Copy link
Contributor

@rexledesma rexledesma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need this anymore -- did we end up fixing this behavior in the core framework?

Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q mgmt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants