Use explicitly added `ApplyDeferred` stages when computing automatically inserted sync points. #16782

andriyDev · 2024-12-12T09:01:45Z

Objective

The previous implementation of automatically inserting sync points did not consider explicitly added sync points. This created additional sync points. For example:

A-B
C-D-E

If A and B needed a sync point, and D was an ApplyDeferred, an additional sync point would be generated between A and B.

A-D2-B
C-D -E

This can result in the following system ordering:

A-D2-(B-C)-D-E

Where only B and C run in parallel. If we could reuse D as the sync point, we would get the following ordering:

(A-C)-D-(B-E)

Now we have two more opportunities for parallelism!

Solution

In the first pass, we:
- Compute the number of sync points before each node
  - This was already happening but now we consider ApplyDeferred nodes as creating a sync point.
- Pick an arbitrary explicit ApplyDeferred node for each "sync point index" that we can (some required sync points may be missing!)
In the second pass, we:
- For each edge, if two nodes have a different number of sync points before them then there must be a sync point between them.
- Look for an explicit ApplyDeferred. If one exists, use it as the sync point.
- Otherwise, generate a new sync point.

I believe this should also gracefully handle changes to the ScheduleGraph. Since automatically inserted sync points are inserted as systems, they won't look any different to explicit sync points, so they are also candidates for "reusing" sync points.

One thing this solution does not handle is "deduping" sync points. If you add 10 sync points explicitly, there will be at least 10 sync points. You could keep track of all the sync points at the same "distance" and then hack apart the graph to dedup those, but that could be a follow-up step (and it's more complicated since you have to worry about transferring edges between nodes).

Testing

Added a test to test the feature.
The existing tests from all our crates still pass.

Showcase

Automatically inserted sync points can now reuse explicitly inserted ApplyDeferred systems! Previously, Bevy would add new sync points between systems, ignoring the explicitly added sync points. This would reduce parallelism of systems in some situations. Now, the parallelism has been improved!

andriyDev · 2024-12-12T10:08:04Z

One thing I'm not fully certain of is whether this unlocks more parallelism. The naive answer is fewer sync points = more parallelism. However this depends on how the multi-threaded executor works. AFAIK, the systems are topologically sorted, and then when a system finishes, an executor thread wakes up, iterates through the list of unfinished systems, finds the first one whose dependencies are satisfied (before systems), and runs that.

I'm not certain how system accesses relate to this. Perhaps a systems dependencies just be satisfied AND its accesses must be satisfied, otherwise it is skipped. This would imply that ApplyDeferred systems would always be "pushed back" as far as they can be since it obviously requires full world access. So even if we have two sync points, the prior systems would end up "pushing them" next to each other in many cases. This could lead to "starvation" though, so maybe that's not how it works? I need to dig in to how the executor works!

alice-i-cecile · 2024-12-12T19:32:29Z

Good: this was part of the future work that we put off when merging the initial PR.

hymm · 2024-12-13T00:21:26Z

ran the build_schedule benches and seems about the same. The extra pass doesn't seem too expensive. I guess it was probably the extra topo sort that was expensive in the original pr.

group                                         apply_deferred                         main_for_apply_deferred
-----                                         --------------                         -----------------------
build_schedule/1000_schedule                  1.00       2.1±0.15s        ? ?/sec    1.00       2.1±0.15s        ? ?/sec
build_schedule/1000_schedule_noconstraints    1.00     27.1±2.29ms        ? ?/sec    1.00     27.2±2.33ms        ? ?/sec
build_schedule/100_schedule                   1.00      8.6±1.01ms        ? ?/sec    1.00      8.6±0.83ms        ? ?/sec
build_schedule/100_schedule_noconstraints     1.00   437.9±40.73µs        ? ?/sec    1.08   473.8±41.74µs        ? ?/sec
build_schedule/500_schedule                   1.00   387.5±35.70ms        ? ?/sec    1.01   390.5±37.29ms        ? ?/sec
build_schedule/500_schedule_noconstraints     1.00      7.1±0.65ms        ? ?/sec    1.00      7.1±0.61ms        ? ?/sec

andriyDev · 2024-12-13T01:59:52Z

@hymm How are you generating that benchmark? I tried running the benchmark locally, once with main and once with this PR and got the following:

~/bevy/benches (deferred) $ cargo bench build_schedule
   Compiling benches v0.0.0 (E:\Content\bevy\benches)
    Finished `bench` profile [optimized] target(s) in 4m 15s
     Running benches/bevy_ecs/main.rs (target\release\deps\ecs-d2da3142124a85f3.exe)
Gnuplot not found or not usable, using plotters backend
`gnuplot --version` returned an unparseable version string: gnuplot 6.0 patchlevel 1

build_schedule/100_schedule_noconstraints
                        time:   [726.73 µs 729.60 µs 732.60 µs]
                        change: [+1.5341% +2.0926% +2.6640%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
build_schedule/100_schedule
                        time:   [10.889 ms 11.275 ms 11.697 ms]
                        change: [+11.464% +15.498% +20.594%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
build_schedule/500_schedule_noconstraints
                        time:   [12.233 ms 12.428 ms 12.626 ms]
                        change: [+22.118% +24.134% +26.229%] (p = 0.00 < 0.05)
                        Performance has regressed.
Benchmarking build_schedule/500_schedule: Warming up for 500.00 ms
Warning: Unable to complete 100 samples in 15.0s. You may wish to increase target time to 53.4s, or reduce sample count to 20.
build_schedule/500_schedule
                        time:   [468.93 ms 477.26 ms 485.64 ms]
                        change: [+19.192% +21.547% +24.130%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
build_schedule/1000_schedule_noconstraints
                        time:   [45.864 ms 46.657 ms 47.519 ms]
                        change: [+25.115% +27.296% +29.666%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
Benchmarking build_schedule/1000_schedule: Warming up for 500.00 ms
Warning: Unable to complete 100 samples in 15.0s. You may wish to increase target time to 277.9s, or reduce sample count to 10.
build_schedule/1000_schedule
                        time:   [2.4158 s 2.4411 s 2.4681 s]
                        change: [+6.4100% +7.6148% +8.8132%] (p = 0.00 < 0.05)
                        Performance has regressed.

Our delta seems to be about +7%. I'm on Windows, but I'm surprised if there's such a substantial difference between OSes (that's my guess).

Edit: To be clear, I'm doing:

$ git switch main
$ cargo bench build_schedule
$ git switch deferred
$ cargo bench build_schedule

hymm · 2024-12-13T03:41:05Z

pretty much what you're doing except using critcmp to make a nice table. https://github.com/BurntSushi/critcmp

I do run a few times to make sure I'm getting consistent result and making sure the diffs aren't in the noise of my cpu perf. For these result I ran on my laptop I changed to silent mode, since the normal mode was too noisy.

andriyDev · 2024-12-13T08:21:13Z

Ah after using critcmp I got similar results:

group                                         deferred                               main
-----                                         --------                               ----
build_schedule/1000_schedule                  1.00       2.3±0.06s        ? ?/sec    1.05       2.4±0.11s        ? ?/sec
build_schedule/1000_schedule_noconstraints    1.07     41.5±3.04ms        ? ?/sec    1.00     39.0±1.29ms        ? ?/sec
build_schedule/100_schedule                   1.00      9.6±0.34ms        ? ?/sec    1.03     10.0±0.33ms        ? ?/sec
build_schedule/100_schedule_noconstraints     1.01   723.7±15.73µs        ? ?/sec    1.00   713.1±22.68µs        ? ?/sec
build_schedule/500_schedule                   1.01   403.0±43.40ms        ? ?/sec    1.00   399.9±12.32ms        ? ?/sec
build_schedule/500_schedule_noconstraints     1.00     10.5±0.61ms        ? ?/sec    1.03     10.8±0.69ms        ? ?/sec

That's very surprising that there's such a big difference between the measuring techniques. I wonder why. Based on this chart though it seems like the runtime is basically unchanged and more importantly within variance (sometimes main wins and sometimes this PR wins). This PR technically should be slower, but I don't imagine by much at all.

andriyDev · 2024-12-13T08:49:13Z

One thing I'm not fully certain of is whether this unlocks more parallelism.

After digging through the multi_threaded executor, I found this:

// NOTE: exclusive systems with ambiguities are susceptible to
// being significantly displaced here (compared to single-threaded order)
// if systems after them in topological order can run
// if that becomes an issue, `break;` if exclusive system

bevy/crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

Line 406 in 00722b8

// NOTE: exclusive systems with ambiguities are susceptible to

This somewhat confirms my suspicions. That said, I realized that we still unlock additional parallelism by having fewer sync points. I was initially under the impression that there would always be another system accessing the world which would prevent exclusive systems (like sync points) from running until there were literally no non-exclusive systems left. However if all the currently running systems finish at about the same time (which seems plausible if you have several "idle systems" - systems that iterate an empty query), that gives a chance for an exclusive system to run (and so fewer exclusive systems is better).

So for once, the naive answer is correct haha.

…uto sync points.

…ble.

andriyDev · 2025-02-20T01:56:29Z

I rebased. It was not entirely trivial due to #11094 being merged, so many of the changes had to be moved to auto_insert_apply_deferred.rs, but for the most part it was just a move.

hymm

whoops. Thought I had approved this already.

crates/bevy_ecs/src/schedule/schedule.rs

…r auto sync points. Thanks urben1680!

urben1680

Just smaller nits but I like the state of it now.

urben1680 · 2025-02-23T00:16:52Z

crates/bevy_ecs/src/schedule/auto_insert_apply_deferred.rs

+        }
+
+        fn system_has_conditions(graph: &ScheduleGraph, node: NodeId) -> bool {
+            assert!(node.is_system());


What is the concern here for this assert? Since the other function has no inverse assert?

The other one does this implicitly. set_conditions_at panics if the node is not a set.

For the systems, there's no corresponding function, but we can implement it ourselves pretty simply. But we need to make sure this only happens for systems. In practice, we will never call this with a set, but better to be safe than sorry.

urben1680 · 2025-02-23T00:23:44Z

crates/bevy_ecs/src/schedule/auto_insert_apply_deferred.rs

-                    sync_point_graph.add_edge(sync_point, target);
+        // Find any edges which have a different number of sync points between them and make sure
+        // there is a sync point between them.
+        for node in &topo {


You need to do this in a second iteration because until then you don't necessarily have all explicit sync points collected before adding new ones, right?

Not quite. The real problem is that we don't know the "distance" for the explicit sync node until it is "expanded" (the outer loop has it as node). Until then we're not sure which distance the sync node applies to since another edge may come along and "bump it".

urben1680 · 2025-02-23T00:26:44Z

crates/bevy_ecs/src/schedule/auto_insert_apply_deferred.rs

@@ -80,41 +80,109 @@ impl ScheduleBuildPass for AutoInsertApplyDeferredPass {
        let mut sync_point_graph = dependency_flattened.clone();
        let topo = graph.topsort_graph(dependency_flattened, ReportCycles::Dependency)?;

+        fn set_has_conditions(graph: &ScheduleGraph, node: NodeId) -> bool {


This and the other two functions kinda bloat this up. It might be worth it making them methods of ScheduleGraph since that is the first argument for all of them. Though the third and it's HashMap parameter make it probably even less useful in any other (future) context.

I kinda don't want to do this. If in the future this is a desirable thing to know for other cases, then we should probably use some dynamic programming to figure out which sets/systems overall have conditions. This is only "efficient" because we expect few ApplyDeferred stages in the first place, so querying for those in particular is probably cheaper.

andriyDev force-pushed the deferred branch 2 times, most recently from 8bb5cab to 497ccae Compare December 12, 2024 09:12

hymm self-requested a review December 12, 2024 18:43

alice-i-cecile added C-Bug An unexpected or incorrect behavior A-ECS Entities, components, systems, and events S-Needs-Review Needs reviewer attention (from anyone!) to move forward D-Modest A "normal" level of difficulty; suitable for simple features or challenging fixes labels Dec 12, 2024

andriyDev added S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Feb 8, 2025

andriyDev added 3 commits February 19, 2025 17:37

Create a test to check that explicit sync points are considered for a…

31fc75f

…uto sync points.

Simplify how we're computing the distances for nodes.

242ce70

Reuse existing sync points instead of generating new ones where possi…

776d062

…ble.

andriyDev force-pushed the deferred branch from 497ccae to 776d062 Compare February 20, 2025 01:55

andriyDev added S-Needs-Review Needs reviewer attention (from anyone!) to move forward and removed S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged labels Feb 20, 2025

hymm approved these changes Feb 20, 2025

View reviewed changes

alice-i-cecile added this to the 0.16 milestone Feb 20, 2025

urben1680 suggested changes Feb 20, 2025

View reviewed changes

crates/bevy_ecs/src/schedule/schedule.rs Show resolved Hide resolved

andriyDev added 4 commits February 20, 2025 23:54

Add tests to check that conditional sync points are not considered fo…

306cc07

…r auto sync points. Thanks urben1680!

Only consider a system an explicit sync point if it is unconditional.

d17c236

Merge branch 'main' into deferred

53a4d32

Fix clippy issue.

54206f1

andriyDev requested a review from urben1680 February 22, 2025 05:57

urben1680 approved these changes Feb 23, 2025

View reviewed changes

andriyDev added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Feb 23, 2025

alice-i-cecile added this pull request to the merge queue Feb 24, 2025

Merged via the queue into bevyengine:main with commit 7f14581 Feb 24, 2025
30 checks passed

rparrett mentioned this pull request Feb 28, 2025

Error logged when running 3d examples #18094

Open

JaySpruce mentioned this pull request Mar 4, 2025

Replace internal uses of insert_or_spawn_batch #18035

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use explicitly added `ApplyDeferred` stages when computing automatically inserted sync points. #16782

Use explicitly added `ApplyDeferred` stages when computing automatically inserted sync points. #16782

andriyDev commented Dec 12, 2024 •

edited

Loading

andriyDev commented Dec 12, 2024

alice-i-cecile commented Dec 12, 2024

hymm commented Dec 13, 2024

andriyDev commented Dec 13, 2024 •

edited

Loading

hymm commented Dec 13, 2024

andriyDev commented Dec 13, 2024

andriyDev commented Dec 13, 2024 •

edited

Loading

andriyDev commented Feb 20, 2025

hymm left a comment

urben1680 left a comment

urben1680 Feb 23, 2025

andriyDev Feb 23, 2025

urben1680 Feb 23, 2025

andriyDev Feb 23, 2025

urben1680 Feb 23, 2025

andriyDev Feb 23, 2025

Use explicitly added ApplyDeferred stages when computing automatically inserted sync points. #16782

Use explicitly added ApplyDeferred stages when computing automatically inserted sync points. #16782

Conversation

andriyDev commented Dec 12, 2024 • edited Loading

Objective

Solution

Testing

Showcase

andriyDev commented Dec 12, 2024

alice-i-cecile commented Dec 12, 2024

hymm commented Dec 13, 2024

andriyDev commented Dec 13, 2024 • edited Loading

hymm commented Dec 13, 2024

andriyDev commented Dec 13, 2024

andriyDev commented Dec 13, 2024 • edited Loading

andriyDev commented Feb 20, 2025

hymm left a comment

Choose a reason for hiding this comment

urben1680 left a comment

Choose a reason for hiding this comment

urben1680 Feb 23, 2025

Choose a reason for hiding this comment

andriyDev Feb 23, 2025

Choose a reason for hiding this comment

urben1680 Feb 23, 2025

Choose a reason for hiding this comment

andriyDev Feb 23, 2025

Choose a reason for hiding this comment

urben1680 Feb 23, 2025

Choose a reason for hiding this comment

andriyDev Feb 23, 2025

Choose a reason for hiding this comment

Use explicitly added `ApplyDeferred` stages when computing automatically inserted sync points. #16782

Use explicitly added `ApplyDeferred` stages when computing automatically inserted sync points. #16782

andriyDev commented Dec 12, 2024 •

edited

Loading

andriyDev commented Dec 13, 2024 •

edited

Loading

andriyDev commented Dec 13, 2024 •

edited

Loading