-
-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use explicitly added ApplyDeferred
stages when computing automatically inserted sync points.
#16782
Conversation
8bb5cab
to
497ccae
Compare
One thing I'm not fully certain of is whether this unlocks more parallelism. The naive answer is fewer sync points = more parallelism. However this depends on how the multi-threaded executor works. AFAIK, the systems are topologically sorted, and then when a system finishes, an executor thread wakes up, iterates through the list of unfinished systems, finds the first one whose dependencies are satisfied (before systems), and runs that. I'm not certain how system accesses relate to this. Perhaps a systems dependencies just be satisfied AND its accesses must be satisfied, otherwise it is skipped. This would imply that ApplyDeferred systems would always be "pushed back" as far as they can be since it obviously requires full world access. So even if we have two sync points, the prior systems would end up "pushing them" next to each other in many cases. This could lead to "starvation" though, so maybe that's not how it works? I need to dig in to how the executor works! |
Good: this was part of the future work that we put off when merging the initial PR. |
ran the build_schedule benches and seems about the same. The extra pass doesn't seem too expensive. I guess it was probably the extra topo sort that was expensive in the original pr.
|
@hymm How are you generating that benchmark? I tried running the benchmark locally, once with main and once with this PR and got the following:
Our delta seems to be about +7%. I'm on Windows, but I'm surprised if there's such a substantial difference between OSes (that's my guess). Edit: To be clear, I'm doing:
|
pretty much what you're doing except using I do run a few times to make sure I'm getting consistent result and making sure the diffs aren't in the noise of my cpu perf. For these result I ran on my laptop I changed to silent mode, since the normal mode was too noisy. |
Ah after using critcmp I got similar results:
That's very surprising that there's such a big difference between the measuring techniques. I wonder why. Based on this chart though it seems like the runtime is basically unchanged and more importantly within variance (sometimes main wins and sometimes this PR wins). This PR technically should be slower, but I don't imagine by much at all. |
After digging through the multi_threaded executor, I found this:
This somewhat confirms my suspicions. That said, I realized that we still unlock additional parallelism by having fewer sync points. I was initially under the impression that there would always be another system accessing the world which would prevent exclusive systems (like sync points) from running until there were literally no non-exclusive systems left. However if all the currently running systems finish at about the same time (which seems plausible if you have several "idle systems" - systems that iterate an empty query), that gives a chance for an exclusive system to run (and so fewer exclusive systems is better). So for once, the naive answer is correct haha. |
I rebased. It was not entirely trivial due to #11094 being merged, so many of the changes had to be moved to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops. Thought I had approved this already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just smaller nits but I like the state of it now.
} | ||
|
||
fn system_has_conditions(graph: &ScheduleGraph, node: NodeId) -> bool { | ||
assert!(node.is_system()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the concern here for this assert? Since the other function has no inverse assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other one does this implicitly. set_conditions_at
panics if the node is not a set.
For the systems, there's no corresponding function, but we can implement it ourselves pretty simply. But we need to make sure this only happens for systems. In practice, we will never call this with a set, but better to be safe than sorry.
sync_point_graph.add_edge(sync_point, target); | ||
// Find any edges which have a different number of sync points between them and make sure | ||
// there is a sync point between them. | ||
for node in &topo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to do this in a second iteration because until then you don't necessarily have all explicit sync points collected before adding new ones, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite. The real problem is that we don't know the "distance" for the explicit sync node until it is "expanded" (the outer loop has it as node
). Until then we're not sure which distance the sync node applies to since another edge may come along and "bump it".
@@ -80,41 +80,109 @@ impl ScheduleBuildPass for AutoInsertApplyDeferredPass { | |||
let mut sync_point_graph = dependency_flattened.clone(); | |||
let topo = graph.topsort_graph(dependency_flattened, ReportCycles::Dependency)?; | |||
|
|||
fn set_has_conditions(graph: &ScheduleGraph, node: NodeId) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the other two functions kinda bloat this up. It might be worth it making them methods of ScheduleGraph
since that is the first argument for all of them. Though the third and it's HashMap parameter make it probably even less useful in any other (future) context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda don't want to do this. If in the future this is a desirable thing to know for other cases, then we should probably use some dynamic programming to figure out which sets/systems overall have conditions. This is only "efficient" because we expect few ApplyDeferred
stages in the first place, so querying for those in particular is probably cheaper.
Objective
If
A
andB
needed a sync point, andD
was anApplyDeferred
, an additional sync point would be generated betweenA
andB
.This can result in the following system ordering:
Where only
B
andC
run in parallel. If we could reuseD
as the sync point, we would get the following ordering:Now we have two more opportunities for parallelism!
Solution
ApplyDeferred
nodes as creating a sync point.ApplyDeferred
node for each "sync point index" that we can (some required sync points may be missing!)ApplyDeferred
. If one exists, use it as the sync point.I believe this should also gracefully handle changes to the
ScheduleGraph
. Since automatically inserted sync points are inserted as systems, they won't look any different to explicit sync points, so they are also candidates for "reusing" sync points.One thing this solution does not handle is "deduping" sync points. If you add 10 sync points explicitly, there will be at least 10 sync points. You could keep track of all the sync points at the same "distance" and then hack apart the graph to dedup those, but that could be a follow-up step (and it's more complicated since you have to worry about transferring edges between nodes).
Testing
Showcase
ApplyDeferred
systems! Previously, Bevy would add new sync points between systems, ignoring the explicitly added sync points. This would reduce parallelism of systems in some situations. Now, the parallelism has been improved!