Build batches across phases in parallel. #17764

pcwalton · 2025-02-10T00:18:15Z

Currently, invocations of batch_and_prepare_binned_render_phase and batch_and_prepare_sorted_render_phase can't run in parallel because they write to scene-global GPU buffers. After PR #17698, batch_and_prepare_binned_render_phase started accounting for the lion's share of the CPU time, causing us to be strongly CPU bound on scenes like Caldera when occlusion culling was on (because of the overhead of batching for the Z-prepass). Although I eventually plan to optimize batch_and_prepare_binned_render_phase, we can obtain significant wins now by parallelizing that system across phases.

This commit splits all GPU buffers that
batch_and_prepare_binned_render_phase and
batch_and_prepare_sorted_render_phase touches into separate buffers for each phase so that the scheduler will run those phases in parallel. At the end of batch preparation, we gather the render phases up into a single resource with a new collection phase. Because we already run mesh preprocessing separately for each phase in order to make occlusion culling work, this is actually a cleaner separation. For example, mesh output indices (the unique ID that identifies each mesh instance on GPU) are now guaranteed to be sequential starting from 0, which will simplify the forthcoming work to remove them in favor of the compute dispatch ID.

On Caldera, this brings the frame time down to approximately 9.1 ms with occlusion culling on.

Currently, invocations of `batch_and_prepare_binned_render_phase` and `batch_and_prepare_sorted_render_phase` can't run in parallel because they write to scene-global GPU buffers. After PR bevyengine#17698, `batch_and_prepare_binned_render_phase` started accounting for the lion's share of the CPU time, causing us to be strongly CPU bound on scenes like Caldera when occlusion culling was on (because of the overhead of batching for the Z-prepass). Although I eventually plan to optimize `batch_and_prepare_binned_render_phase`, we can obtain significant wins now by parallelizing that system across phases. This commit splits all GPU buffers that `batch_and_prepare_binned_render_phase` and `batch_and_prepare_sorted_render_phase` touches into separate buffers for each phase so that the scheduler will run those phases in parallel. At the end of batch preparation, we gather the render phases up into a single resource with a new *collection* phase. Because we already run mesh preprocessing separately for each phase in order to make occlusion culling work, this is actually a cleaner separation. For example, mesh output indices (the unique ID that identifies each mesh instance on GPU) are now guaranteed to be sequential starting from 0, which will simplify the forthcoming work to remove them in favor of the compute dispatch ID. On Caldera, this brings the frame time down to approximately 9.1 ms with occlusion culling on.

superdump · 2025-02-10T09:49:48Z

examples/3d/occlusion_culling.rs

@@ -185,6 +190,10 @@ fn main() {
                .set(RenderPlugin {
                    allow_copies_from_indirect_parameters: true,
                    ..default()
+                })
+                .set(PbrPlugin {
+                    allow_copies_from_indirect_parameters: true,


Could this be reused from the RenderPlugin? Rather than having to set it in two places.

Could you elaborate as to how that would work? The problem is that the PbrPlugin can't reach into the RenderPlugin to check its value.

I think this is fine for now but it's a rather particular option and I might suggest subsuming it some kind of broader "debug renderer" setting if we ever have a second instance of this kind of thing.

Yeah, good point. Or maybe a "debug flags"?

I went ahead and switched this to a RenderDebugFlags so that we can have more of them.

…epare

tychedelia

Lgtm. Thanks for creating the debug flags, a lot cleaner. Being able to lean on the ECS scheduler for this is nice and clean.

crates/bevy_render/src/batching/gpu_preprocessing.rs

…epare

pcwalton requested review from tychedelia and IceSentry February 10, 2025 00:18

pcwalton added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Feb 10, 2025

pcwalton requested a review from superdump February 10, 2025 00:19

pcwalton force-pushed the parallel-batch-and-prepare branch from 08de0fe to c1f9764 Compare February 10, 2025 00:46

superdump reviewed Feb 10, 2025

View reviewed changes

github-actions bot mentioned this pull request Feb 10, 2025

17764 TheBevyFlock/bevy-example-runner#116

Closed

pcwalton added 3 commits February 10, 2025 17:42

Merge remote-tracking branch 'origin/main' into parallel-batch-and-pr…

bc114f0

…epare

Switch the ad-hoc boolean to a RenderDebugFlags bitfield

1b10fbd

Internal import police

8cb3cdf

tychedelia approved these changes Feb 11, 2025

View reviewed changes

crates/bevy_render/src/batching/gpu_preprocessing.rs Show resolved Hide resolved

pcwalton added 3 commits February 11, 2025 22:13

Merge remote-tracking branch 'origin/main' into parallel-batch-and-pr…

0419380

…epare

Use Deref for IndirectParametersBuffers

bb7299c

Merge remote-tracking branch 'origin/main' into parallel-batch-and-pr…

1479410

…epare

superdump approved these changes Feb 13, 2025

View reviewed changes

superdump added this pull request to the merge queue Feb 13, 2025

alice-i-cecile added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Feb 13, 2025

Merged via the queue into bevyengine:main with commit 0ede857 Feb 13, 2025
34 checks passed

This was referenced Feb 13, 2025

Transparent things don't render and cause warning spam #17846

Closed

Fix panic in custom_render_phase example #17866

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build batches across phases in parallel. #17764

Build batches across phases in parallel. #17764

pcwalton commented Feb 10, 2025

superdump Feb 10, 2025

pcwalton Feb 10, 2025

tychedelia Feb 10, 2025

pcwalton Feb 10, 2025

pcwalton Feb 11, 2025

tychedelia left a comment

Build batches across phases in parallel. #17764

Build batches across phases in parallel. #17764

Conversation

pcwalton commented Feb 10, 2025

superdump Feb 10, 2025

Choose a reason for hiding this comment

pcwalton Feb 10, 2025

Choose a reason for hiding this comment

tychedelia Feb 10, 2025

Choose a reason for hiding this comment

pcwalton Feb 10, 2025

Choose a reason for hiding this comment

pcwalton Feb 11, 2025

Choose a reason for hiding this comment

tychedelia left a comment

Choose a reason for hiding this comment