Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfill retries #23679

Merged
merged 12 commits into from
Oct 16, 2024
Merged

Backfill retries #23679

merged 12 commits into from
Oct 16, 2024

Conversation

jamiedemaria
Copy link
Contributor

@jamiedemaria jamiedemaria commented Aug 15, 2024

Summary & Motivation

Enables reexecuting a backfill with either all partitions retried or only the failed partitions retried

Re-uses some of the graphene types for run re-execution. I could create different types that are backfill specific instead.

If reexecuting from failure:
For asset backfills, it will create a new backfill that will backfill the set of assets that were not successfully materialized in the first backfill. For job backfills, uses the existing fromFailure attribute that will retry a job backfill.

Constraints:

  • the first backfill must be in a completed state before it can be retried
  • for asset backfills, if reexecuting from failure some assets must have not been materialized in the first backfill. This differs from another action.

When a retried backfill is created we add the parent backfill id and the root backfill id as tags like we do for run retries.

How I Tested These Changes

new tests

Copy link
Contributor Author

jamiedemaria commented Aug 15, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @jamiedemaria and the rest of your teammates on Graphite Graphite

Copy link

github-actions bot commented Aug 15, 2024

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-n05gsv92q-elementl.vercel.app
https://jamie-backfill-retries.core-storybook.dagster-docs.io

Built with commit 58456f5.
This pull request is being automatically deployed with vercel-action

@jamiedemaria jamiedemaria force-pushed the jamie/backfill-retries branch 2 times, most recently from b0aa0c5 to b250d4f Compare August 16, 2024 17:10
@jamiedemaria jamiedemaria marked this pull request as ready for review August 16, 2024 19:00
@jamiedemaria jamiedemaria requested review from sryza and prha August 16, 2024 19:00
@jamiedemaria
Copy link
Contributor Author

moving this back to draft since i'm going to shift focus to status and filtering (see discussion in planning doc about retries not being high priority for mvp)

@jamiedemaria jamiedemaria marked this pull request as draft August 19, 2024 14:32
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-retries branch from b250d4f to c14ad4f Compare August 21, 2024 17:54
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-retries branch 2 times, most recently from 03be000 to ddb3f7f Compare October 4, 2024 15:21
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-retries branch 4 times, most recently from 86437e4 to 96a36a0 Compare October 8, 2024 19:46
@jamiedemaria jamiedemaria mentioned this pull request Oct 8, 2024
3 tasks
@jamiedemaria jamiedemaria force-pushed the jamie/backfill-retries branch from 28dea82 to 87f8c68 Compare October 9, 2024 16:48
@jamiedemaria jamiedemaria marked this pull request as ready for review October 9, 2024 16:51
def mutate(
self,
graphene_info: ResolveInfo,
reexecutionParams: GrapheneReexecutionParams,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it could make sense to make a backfill-specific version of GrapheneReexecutionParams that would take parentBackfillId and strategy where strategy can be FROM_FAILURE or ALL. i will make that update

backfill = graphene_info.context.instance.get_backfill(backfill_id)
from_failure = ReexecutionStrategy(strategy) == ReexecutionStrategy.FROM_FAILURE
if not backfill:
check.failed(f"No backfill found for id: {backfill_id}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to add a GrapheneBackfillNotFound output type. will look into

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If backfills can only be retried from the UI hitting this seems unlikely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the other backfill actions have this too, and i figured it doesn't hurt to have here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be to make it more in line w the run re-execution types which have GrapheneRunNotFound as a potential return type. i dont think it's really that necessary here though

@jamiedemaria
Copy link
Contributor Author

@prha @sryza @clairelin135 pinging for review for this one!

Copy link
Contributor

@clairelin135 clairelin135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good to me!

Seems like this implementation will retry canceled partitions and failed partitions. Not sure if this is expected behavior, maybe we should communicate this in the UI somehow?

backfill = graphene_info.context.instance.get_backfill(backfill_id)
from_failure = ReexecutionStrategy(strategy) == ReexecutionStrategy.FROM_FAILURE
if not backfill:
check.failed(f"No backfill found for id: {backfill_id}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If backfills can only be retried from the UI hitting this seems unlikely?

@jamiedemaria
Copy link
Contributor Author

Seems like this implementation will retry canceled partitions and failed partitions.

Yeah the idea is that you can take a canceled or failed backfill and retry anything that didn't work the first time. one example of an issue that has come up is k8s pods being evicted and causing a run in a backfill to fail. we've had users request the ability to retry the just partitions that failed/didn't run so that they dont have to manually make the backfill that targets just those partitions themselves

@jamiedemaria jamiedemaria force-pushed the jamie/backfill-retries branch from 3cd9f4a to 58456f5 Compare October 15, 2024 13:45
@jamiedemaria jamiedemaria merged commit 6fdd972 into master Oct 16, 2024
2 checks passed
@jamiedemaria jamiedemaria deleted the jamie/backfill-retries branch October 16, 2024 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants