Locking mechanism to prevent two or more workflows/tasks running in parallel #3754

edvardm · 2023-06-07T11:43:03Z

edvardm
Jun 7, 2023

Draft for more elaborate RFC.

There are multiple cases when it is not desirable to have multiple tasks and/or workflows running at the same time. To cover all such cases one way to achieve it would be to use named, distributed locks in such a way that only single process anywhere could hold that lock*

An existing related issue which is very common in ETL pipelines is well described in #267

From users perspective, I wish I had something along these lines:

# only one instance anywhere in the cluster can hold `t1_mutex` at a time
@task(lock=distributed_lock(id="t1_mutex", timeout=3600))
def t1(..):
     ...
     
and

# only one instance anywhere in the cluster can hold `my_wf_mutex` at a time
@workflow(lock=distributed_lock(id="my_wf_mutex", timeout=5*3600))
def wf(...):
     ...

I'm not yet sure how global it could be; maybe that lock identifier could have by default prefix of the project name. Currently I'm intending to resolve this by using Redlock as distributed lock, likely combined with conditionals in the workflow

*) Or even neater, something like semaphores so that k instances could pass at a time, but simple distributed mutexes would be good enough

kumare3 · 2023-06-07T13:38:17Z

kumare3
Jun 7, 2023
Maintainer

@edvardm tgank you for the request, do you think https://docs.flyte.org/projects/cookbook/en/latest/auto/core/flyte_basics/task_cache_serialize.html this works

0 replies

edvardm · 2023-06-08T20:35:44Z

edvardm
Jun 8, 2023
Author

That would not work unfortunately for two reasons: one is that I would not want to queue task until another if finished, but prevent execution / skip just started task, if one was running already. Another reason is that using cache would just waste resources here, as I'd pretty much never get hits.

So suggestion you made in #267 (comment) is spot on

0 replies

hamersaw · 2023-06-09T16:59:58Z

hamersaw
Jun 9, 2023
Maintainer

I think this is great! Could be really powerful. To add a little more context, the cache serialize work above works on the premise that each task execution is uniquely identified by project / domain / task id / input values / cache version. The serialization of concurrent executions occurs by using this cache key and it works great, only a single instance of a cached task will run concurrently and then others will reuse the cached results rather than computing them separately. This scheme may or may not be extensible to support this use-case based on to scope.

A previous proposal (can't seem to find) discussed applying generalized serialization to tasks with something like:
@task(serialize_behavior=flytekit.SERIALIZE_WAIT)
where task serialization behavior could be NONE, WAIT, FAIL, etc based on how to handle concurrent executions. There are a number of difficulties here in (1) what to use for serialization ID - unique accross project / domain / task? are there usecases where different tasks would be serialized? (2) scope of application - is this applicable to task executions only? or do we support launchplans, workflows, etc? (3) support for semaphore-like handling so N instances may execute concurrently (4..n) probably many other things.

IMO scope is the largest unknown here. If we just want to add a simple serialize behavior flag at the task level, this would be pretty simple. However, that solution is not nearly as ambitious as this proposal. I would certainly support an RFC to explore this in more depth and be very happy to help implementation.

0 replies

kumare3 · 2023-06-15T15:21:20Z

kumare3
Jun 15, 2023
Maintainer

I think most folks want serial execution of scheduled launchplans. This is a much simpler and better problem to solve than global serialization

1 reply

edvardm Aug 1, 2023
Author

Yes, to be fair serialized execution of launchplans would suffice for me too; original proposal was based on the idea that more global execution control would be easy enough to do.

I'd imagine many people schedule launchplans to do some heavy processing, and previous plan could be still running when the new one starts, and due to distributed nature of Flyte it's not trivial to prevent this from happening in the application side. Serializing workflow runs or launchplans would be enough to prevent this.

MariaSL · 2023-07-11T13:49:49Z

MariaSL
Jul 11, 2023

Hey! 
I have a couple of use-cases where we would greatly benefit from having a locking mechanism in place.

Use-case 1 (parallel tasks):

We have a workflow with a Spark task to write data in S3 bucket. The data is appended/dynamically overwritten.

@task
def parse_date(date: Optional[str] = None) -> str:
    ...

@task
def read_data(parsed_date: Optional[str] = None) -> DataFrame:
    ...

@task
def write_data(df: DataFrame) -> None:
    ...

@workflow
def wf(date: Optional[str] = None) -> int:

    parsed_date = parse_date(date=date)
    df = read_data(date=parsed_date)
    write_data(df=df)

    return 1

Problem
The problem appears when running several executions in parallel (for instance when backfilling for several days). In that case there are multiple Spark jobs trying to write to the same dataset simultaneously. These are then ruining for each other by deleting temporary files for other jobs. It becomes a race condition who manages to writes their data successfully. Since data will then be partly written this means you end up having duplicate data.
Desired behaviour
We need to have a global lock on the task for writing data. The desired behaviour would be that the other executions are waiting in a queue until the lock is released so all gets executed in the end, it just takes longer time.

Use-case 2(concurrent workflows):

We have a workflow that contains multiple long-running tasks. The workflow has external dependencies that can arrive with delays and so is scheduled to run ~every 20th minute. Several tasks have different upstream dependencies. For instance, data for Task 1 can arrive earlier than data for Task 2 - this means that as soon as data for task 1 is ready and the schedule kicks-in, Task 1 will start executing, while Task 2 will await for its input.

Problem
When there is a long running task in a workflow and a frequent cron schedule, multiple workflows can end up running simultaneously.
This can result in corrupted pipeline outputs and increased resource usage.
Desired behaviour
Have a concurrency policy on a workflow level (launch plan?) to forbid concurrent executions of a workflow (similar to #267) .

0 replies

davidmirror-ops · 2023-07-11T15:11:54Z

davidmirror-ops
Jul 11, 2023
Maintainer

@edvardm how do you feel about starting an RFC from this discussion?

1 reply

edvardm Aug 1, 2023
Author

Just back from holidays. I could give it a shot, though based on these discussions it seems like pretty modest enhancement proposal: "provide means to serialize launchplans, so that scheduled plans won't fire if previously scheduled plan is still executing"

davidmirror-ops · 2023-11-14T19:06:39Z

davidmirror-ops
Nov 14, 2023
Maintainer

@edvardm from last week's Contributor's meetup this idea still is considered a good fit for an RFC. You could either work on the proposal yourself, nominate someone else, or let us know if you still want to keep this entry open. Thanks for your support so far :)

0 replies

blaketastic2 · 2024-10-24T16:57:19Z

blaketastic2
Oct 24, 2024

We had a similar request and recently implemented this through custom agents (LockingAgent and UnlockingAgent). The lock is tracked in our our database, but we had thought about either using the Flyte admin database in a new schema or etcd. The current implementation simply exits if it can't acquire a lock. We are also looking at another implementation that will act more like a Sensor and wait for it's turn in the queue.

My point for bringing this up is that maybe Agents are a way to move forward with this. I'd love to discuss this more with folks if this is still a desired feature.

2 replies

blaketastic2 Oct 24, 2024

Here's what code looks like:

@workflow
def locking_wf() -> None:
    acquired, lock_handle = acquire_lock(duration_in_seconds=10)

    cond = (
        conditional("test lock")
        .if_(acquired.is_true())
        .then(task1(handle=lock_handle))
        .else_()
        .then(echo(message="noop"))
    )

    cond > release_lock(lock_handle=lock_handle)

davidmirror-ops Oct 24, 2024
Maintainer

@blaketastic2 thank you for joining and sharing!

davidmirror-ops · 2024-10-24T18:34:31Z

davidmirror-ops
Oct 24, 2024
Maintainer

10/24/2024 Contributor's sync notes: no active work on this and still needs an owner to write and shepherd the RFC through the process.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Locking mechanism to prevent two or more workflows/tasks running in parallel #3754

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Locking mechanism to prevent two or more workflows/tasks running in parallel #3754

edvardm Jun 7, 2023

Replies: 9 comments · 4 replies

kumare3 Jun 7, 2023 Maintainer

edvardm Jun 8, 2023 Author

hamersaw Jun 9, 2023 Maintainer

kumare3 Jun 15, 2023 Maintainer

edvardm Aug 1, 2023 Author

MariaSL Jul 11, 2023

Use-case 1 (parallel tasks):

Use-case 2(concurrent workflows):

davidmirror-ops Jul 11, 2023 Maintainer

edvardm Aug 1, 2023 Author

davidmirror-ops Nov 14, 2023 Maintainer

blaketastic2 Oct 24, 2024

blaketastic2 Oct 24, 2024

davidmirror-ops Oct 24, 2024 Maintainer

davidmirror-ops Oct 24, 2024 Maintainer

edvardm
Jun 7, 2023

Replies: 9 comments 4 replies

kumare3
Jun 7, 2023
Maintainer

edvardm
Jun 8, 2023
Author

hamersaw
Jun 9, 2023
Maintainer

kumare3
Jun 15, 2023
Maintainer

edvardm Aug 1, 2023
Author

MariaSL
Jul 11, 2023

davidmirror-ops
Jul 11, 2023
Maintainer

edvardm Aug 1, 2023
Author

davidmirror-ops
Nov 14, 2023
Maintainer

blaketastic2
Oct 24, 2024

davidmirror-ops Oct 24, 2024
Maintainer

davidmirror-ops
Oct 24, 2024
Maintainer