Mesh shader discussion #41

Venemo · 2022-09-07T17:41:12Z

Venemo
Sep 7, 2022

Not sure what is the best place to talk about this so decided maybe we can discuss it here. Hope this is okay.

Looking at the current code, I noticed that the mesh shader workgroup size is 64, but the shader has: max_vertices = 64, max_primitives = 124 this means that the shader is going to have poor occupancy on AMD HW, effectively leaving 50% of shader invocations under-utilized. Note that this is also suboptimal on NVidia HW which prefers a workgroup size of 32.

I recommend to have a compile-time constant for each of these values (similar to what you do for MESH_WGSIZE) and configure it like this:

AMD: workgroup size == max output vertices == max output primitives >= 64, all outputs accessed by local invocation index
Nvidia: workgroup size = 32, max output vertices = 64, max output primitives >= 64

You can achieve this by using a "compile-time loop" (a loop using the compile-time constants) which will be optimal on both AMD and NVidia.

Venemo · 2022-09-07T17:42:52Z

Venemo
Sep 7, 2022
Author

Also, you don't have to hardcode these in your program, you can use maxPreferredMeshWorkGroupInvocations and prefersLocalInvocation*Output from the device properties to determine these things.

0 replies

zeux · 2022-09-07T17:54:14Z

zeux
Sep 7, 2022
Maintainer

One issue with this is that all of this requires rebuilding meshlet data, something that would ideally be done offline. It looks as if on AMD hardware specifically the shader is export-bound atm, I've looked a little bit at adding per-triangle culling and it does help performance significantly.

Without per-triangle culling I don't seem to get any benefit from moving to the same number of primitives; with it however I do get better throughput with 64 max primitives, but that's a little too low. I'll test different configurations when I get time. Thanks for the suggestion!

0 replies

zeux · 2022-09-07T17:57:42Z

zeux
Sep 7, 2022
Maintainer

I'm also wondering what happens on NV specifically from work groups with size 64 vs 32 - and whether the shader that the driver runs is substantially different wrt performance from a shader that uses work group size 32 but has to process two vertices per invocation (which is what the shader I used for NV extension did, but there it wasn't possible to test wider groups because NV extension requires work group size of 32 if I'm not mistaken).

0 replies

Venemo · 2022-09-07T18:07:26Z

Venemo
Sep 7, 2022
Author

No, it doesn't require rebuilding meshlet data. A workable compromise is if you use a meshlet size of 64 (max 64 vertices and max 64 primitives). In this case on NVidia you would output 1 meshlet per workgroup and on AMD you could output 2 meshlets per workgroup.

I personally haven't tested this but would be interesting to compare how different configs perform.

0 replies

Venemo · 2022-09-07T18:10:56Z

Venemo
Sep 7, 2022
Author

To your NVidia question: this is explained in one of their mesh shader blogs. As far as I understand NVidia's problem is that it doesn't have proper workgroups, so the whole mesh shader workgroup is executed in a single warp and they emulate the workgroup using a loop.

Therefore, you can get closer to what NVidia hardware actually runs if you use a workgroup size that matches their warp size.

0 replies

zeux · 2022-09-07T18:42:48Z

zeux
Sep 7, 2022
Maintainer

No, it doesn't require rebuilding meshlet data. A workable compromise is if you use a meshlet size of 64 (max 64 vertices and max 64 primitives). In this case on NVidia you would output 1 meshlet per workgroup and on AMD you could output 2 meshlets per workgroup.

What I meant is that varying the max sizes between vendor requires different meshlet data. Just varying workgroup configurations doesn't of course.

64 & 64 is a little problematic depending on the mesh topology - I'd expect that the 64 vertex limit leads to an effective primitive count between 64 and 98 (98 corresponds to an 8x8 grid). Setting the primitive count limit of 64 limits to something like 45 vertices per meshlet for smooth meshes, so you end up underutilizing the threads for vertex transformation.

One other alternative is something like 128 vertices and 192 primitives, which is more balanced wrt the ratio, but still problematic because now this means we need to write all vertex data to LDS :)

To your NVidia question: this is explained in one of their mesh shader blogs. As far as I understand NVidia's problem is that it doesn't have proper workgroups, so the whole mesh shader workgroup is executed in a single warp and they emulate the workgroup using a loop.

Right, but a work group of 64 would be compiled into two sequential invocations of 32 elements each, vs a shader that uses more or less the same loop if it needs to process a meshlet with >32 vertices/primitives. I understand that using a work group of 64 doesn't match the hardware perfectly, but the question is where the resulting inefficiencies come from.

0 replies

Venemo · 2022-09-07T19:02:54Z

Venemo
Sep 7, 2022
Author

What I meant is that varying the max sizes between vendor requires different meshlet data. Just varying workgroup configurations doesn't of course.

Yes, the trick is to find a meshlet size which can work fine on both vendors, and then you can use the same meshlet size but with a slightly different workgroup config.

64 & 64 is a little problematic depending on the mesh topology - I'd expect that the 64 vertex limit leads to an effective primitive count between 64 and 98 (98 corresponds to an 8x8 grid). Setting the primitive count limit of 64 limits to something like 45 vertices per meshlet for smooth meshes, so you end up underutilizing the threads for vertex transformation.

One other alternative is something like 128 vertices and 192 primitives, which is more balanced wrt the ratio, but still problematic because now this means we need to write all vertex data to LDS :)

I think it's worth to experiment with a meshlet size of: max vertices = 128, max primitives = 128 and then use a 128-sized workgroup on AMD and 32 (or 64?) on NVidia.

Right, but a work group of 64 would be compiled into two sequential invocations of 32 elements each, vs a shader that uses more or less the same loop if it needs to process a meshlet with >32 vertices/primitives. I understand that using a work group of 64 doesn't match the hardware perfectly, but the question is where the resulting inefficiencies come from.

Unfortunately I don't know any more details other than what I told above, only that this is their recommendation.

0 replies

Venemo · 2022-09-07T21:18:16Z

Venemo
Sep 7, 2022
Author

I think it's worth to experiment with a meshlet size of: max vertices = 128, max primitives = 128 and then use a 128-sized workgroup on AMD and 32 (or 64?) on NVidia.

One more thought about this. If you definitely don't want to increase the number of max output vertices but you want to use max 128 output primitives, it is still worth it (on AMD) to increase the workgroup size to 128 and make your primitive processing more parallel than it currently is.

0 replies

zeux · 2022-09-07T21:59:24Z

zeux
Sep 7, 2022
Maintainer

Can you elaborate on why on AMD there's a benefit to going above 64? It's not intuitively obvious that this should help as 64 (and sometimes 32) is the HW wavefront size.

0 replies

Venemo · 2022-09-07T22:24:14Z

Venemo
Sep 7, 2022
Author

On RDNA2 each invocation can only really create max 1 vertex and 1 primitive. Any other kind of access pattern is emulated by the driver. This also implies that it may need to launch more invocations than your specified workgroup size in order to fit a larger output.

If you have a workgroup size of 64 but a max primitive count of 126 then the "real" workgroup size will be 126 (this fits 2 waves, which have 128 invocations):

First 64 invocations will execute the code you wrote
Next 62 invocations will sit there, just waiting for the first 64 to finish executing your code, then they will output 1 primitive
The last 2 invocations will be just sitting there deactivated the whole time

So, in fact there are 128 invocations running but you don't utilize all of them. It is more efficient to write your code in a manner that utilizes all invocations instead of letting them sit there doing nothing most of the time.

I try to explain this in my blog post "How mesh shaders are implemented in an AMD driver".

0 replies

zeux · 2022-09-07T23:06:43Z

zeux
Sep 7, 2022
Maintainer

Ah, that explains a lot! It's indeed substantially different compared to NV model. I didn't realize that the restriction on emission also applies to primitives, as I thought it's just the vertices.

0 replies

Venemo · 2022-09-08T12:14:59Z

Venemo
Sep 8, 2022
Author

It seems that a few others also struggle to understand this, eg. GravityMark has the same problem. So I think I explained it poorly... Can you suggest a good way to edit my blog post to clarify this?

0 replies

zeux · 2023-01-04T05:26:39Z

zeux
Jan 4, 2023
Maintainer

By the way, at least in radv it looks like mesh shaders are always compiled with wave size 64. Do you know if this is a hardware restriction or a driver limitation? I can't currently test any other AMD drivers with mesh shading support...

The reason I ask is I was hoping for something like max_vertices=64 max_triangles=96 to work reasonably well with wave32 but it looks like this is inefficient as it effectively uses the same wave configuration as max_vertices=64 max_triangles=124.

0 replies

zeux · 2023-01-04T18:33:14Z

zeux
Jan 4, 2023
Maintainer

Also based on GPUOpen-Drivers/llpc@772eef3 my understanding is that on GFX11 (RDNA3) row export would allow emitting more than one vertex or primitive per thread, which would be great as it would provide the much needed flexibility wrt balancing performance. Not sure if GFX11 has other relevant changes for mesh shading.

0 replies

Venemo · 2023-01-04T21:00:41Z

Venemo
Jan 4, 2023
Author

By the way, at least in radv it looks like mesh shaders are always compiled with wave size 64. Do you know if this is a hardware restriction or a driver limitation?

It's just the default in our driver. You can use the RADV_PERFTEST=gewave32 environment variable to use Wave32 mode for geometry processing shaders.

The reason I ask is I was hoping for something like max_vertices=64 max_triangles=96 to work reasonably well with wave32

Worth a try. Yes it would be inefficient in Wave64 mode. Maybe we should add special casing for 32 and 96.

my understanding is that on GFX11 (RDNA3) row export would allow emitting more than one vertex or primitive per thread

This is correct, but I haven't implemented that in RADV yet. (I am on vacation this week and will get back to work next week.) However, it will still need some shuffling between SIMD lanes.

Not sure if GFX11 has other relevant changes for mesh shading.

Yes, it also has a new "fast launch" mode, which will eliminate the need for launching shader invocations that "do nothing".

0 replies

zeux · 2024-10-13T20:13:41Z

zeux
Oct 13, 2024
Maintainer

I'm not sure - it's complicated. The discrepancy exists (between Windows & Linux/radv) according to the overall frame rate and the individual GPU timers as captured in real time. This is with vsync off so in theory I'd expect similar clocks when just running the app (there's little idle time), but I would need to verify this. When running radv on Windows, I see nominal performance before I capture to be the same - higher than radv - but when I capture, the resulting capture has longer duration that matches what radv captures more or less. radv captures also show longer capture duration but it's less severe (eg radv is 5ms frame time with no capture, 5.2ms in RGP capture; Windows is 4.5ms frame time with no capture, 5.2ms in RGP capture).

Additionally, when using cluster culling with the aforementioned second pass experiencing long delays due to task queue bottleneck, the RGP profiles just look very different between radv and Windows driver - the total number of vkCmdDispatch that radv shows is a close multiple of 1024 (queue size), whereas I am dispatching fewer mesh workgroups; Windows RGP capture shows a more reasonable size with a much shorter duration but there's still a gap at the end where "nothing" happens on the timeline, so I'm not sure if this is a capture artifact or not.

Consequently there's a lot of variables here, and to test any theory on both OSes I have to reboot back and forth which is time consuming, so it would be better to compare radv vs amdvlk, but for that I need an amdvlk binary that just works so I'll wait :)

0 replies

Venemo · 2024-10-13T21:21:41Z

Venemo
Oct 13, 2024
Author

Understood. Regardless of what RGP says, it is nice to see that the two drivers at least perform in the same ballpark.

0 replies

Venemo · 2024-10-13T21:40:43Z

Venemo
Oct 13, 2024
Author

FWIW, the number of draw entries is 1024, so theoretically 1024 task shader workgroups could be in flight on the GPU at any given time.

Considering that the 7900 GRE has 80 CUs (note, the top Navi 31 has up to 96 CU) and each CU has 2 SIMD, and each SIMD can have 16 waves in flight, that means 7900 GRE can have 80216=2560 waves in flight at a time (this is 96216=3072 on top Navi 31) . Now the task shaders here are just 1 Wave64 wave per workgroup, meaning we should be able to have at least 40% occupancy on your GPU with just task shaders (this would be 33% on top Navi 31), assuming each task workgroup launches 0 mesh shaders, but in reality the occupancy seems to be much lower. Theoretically it should be enough to increase the number of entries to 4096 (closest power of two after 3072) to get fully occupancy, but that doesn't seem to help either.

So, I suspect that either the issue is due to the firmware; or the driver does something really stupid that nobody has noticed.

0 replies

zeux · 2024-10-14T17:54:45Z

zeux
Oct 14, 2024
Maintainer

I agree that this doesn't fully make sense. I originally expected that this is just an issue with CP throughput, but thinking about this further, unless CP has some sort of parallel draw call rejection feature, increasing the queue size would not by itself allow the "empty" dispatch pass to accelerate more than 2x (by allowing CP & task shaders to run in parallel). And if CP is not the bottleneck then it's unclear why the queue needs to be increased so much to have good gains, and why modest increases in queue sizes don't seem to help much. Also I was discussing this with a dev who had a DX12 playground with very different code, and they observed similar timings for empty dispatches with a lighter weight task shader, so that also seems to suggest CP bottleneck, which again is a little odd as to why the gains are so dramatic.

But I could imagine some sort of horrible synchronization ping ponging where both CP and TS are constantly stalled, with CP waiting for some items to be written ahead of current read pointer and TS waiting for available space in the ring buffer, that magnifies the slowdown well past 2x and requires read and write regions of the ring buffer to be far apart to isolate them from conflict.

It's also possible this is some sort of driver issue, but because the synchronization is entirely on the firmware side and because this is present in all drivers (radv, amdvlk/linux, amdvlk/windows, although the latter two are the same source code of course), I'd think this is either some inefficiency in firmware, or a fundamental design defect of the hardware. FWIW I've hit the same type of problem - bottleneck by empty draws in CP processing - on RDNA2 on this code base, when I was using multi draw indirect and had multiple draw calls; this was fixed for RDNA2 by switching to a single draw call (463063c commit, https://www.youtube.com/live/eYvGruGHhUE YT stream); I've went back to this commit which allowed to switch between two modes, and this is still super valuable on RDNA3, but wheres on the GPU I had at the time (6700 XTX) doing this alone basically lifted the CP bottleneck, on 7900 GRE this is very beneficial but not enough to eliminate the CP bottleneck. I'm not sure if this is because the firmware was different, or because just the delta between shader processing capacity and CP processing capacity or something else changed.

0 replies

zeux · 2024-10-14T19:38:01Z

zeux
Oct 14, 2024
Maintainer

Oh one other thing that complicates correct comparison here that I think I should note: enabling rgp trace in the 32K queue size configuration slows things back down.

Specifically, when I use cluster culling with the slow second pass with default 1024 queue size, I get the following timings (sorry that these are different from before, I've been optimizing some unrelated bits and pieces, treat these in isolation from previous timings):

gpu render early 4.6ms
gpu render late 4.5ms
gpu render frame 9.34 ms
cpu observed render latency 10.0 ms

When I use queue size 32k, I get the following:

gpu render early 1.2ms
gpu render late 0.4ms
gpu render frame 1.75ms
cpu observed render latency 2.0 ms

I am running these in a CPU-GPU synced configuration (CPU waits for GPU frame completion) so I am inclined to trust the results because CPU timing can not be spoofed even if GPU counters somehow measure the wrong thing.

However, when I enable MESA_VK_TRACE=rgp, I get this with 32K entries (the numbers with 1024 entries are same ballpark as without rgp):

gpu render early 5.2ms
gpu render late 5.2ms
gpu render frame 10.7ms
cpu observed render latency 11.4ms

This makes it especially difficult to correctly measure all of this, which is part of the "it's complicated" I referred to: I do not trust RGP captures to not skew numbers to the point of not being useful in this code...

The RGP capture for 32K queue does show a huge gap in the second pass where seemingly nothing happens. Maybe this is the actual capture done by firmware (?).

Compared to the 1K queue capture where the gap exists but is much smaller. But note that in the 32K capture the duration of the first render pass is similar to 32K, whereas in practice (without rgp capture) it's way faster.

This is observed even in a setup where I disable everything I can disable via RADV_THREAD_TRACE_QUEUE_EVENTS=false RADV_THREAD_TRACE_INSTRUCTION_TIMING=false RADV_THREAD_TRACE_CACHE_COUNTERS=false.

0 replies

Venemo · 2024-10-15T20:35:57Z

Venemo
Oct 15, 2024
Author

I think it makes sense what you say, that the bottleneck is the CP firmware, though I personally have no insight into what is happening in there exactly. I wouldn't be surprised, since it seems that both of these GPUs are all bottlenecked by the CP in general, and the 7900 even more so.

Based on what you posted, it looks like taking a RGP trace will by itself slow down everything to such an extent that we are essentially looking at a different thing entirely. If you have time, you could try setting THREAD_TRACE_MARKER_ENABLE to zero in radv_cs_emit_dispatch_taskmesh_gfx_packet and see what happens then. (My guess is that we'll see less information in RGP but maybe the slowdown will change.)

Sorry I haven't been able to try it myself yet, I just recently got back from XDC 2024. I would like to test this on my 7900 XTX too. Can you please give me a quick walk through of what I need to do to run your test cases?

0 replies

zeux · 2024-10-15T21:09:18Z

zeux
Oct 15, 2024
Maintainer

setting THREAD_TRACE_MARKER_ENABLE to zero in radv_cs_emit_dispatch_taskmesh_gfx_packet

Yeah this basically eliminates most of the tracing overhead AFAICT; this is with that marker set to 0 and with queue items 1024:

... and this is with queue items 32768 (both captures are with cluster culling enabled):

The overall frame times barely change under RGP in that setup. It would be nice to have this as a tracing option maybe, so that it could be disabled for testing draw heavy workloads...

Sorry I haven't been able to try it myself yet

No worries at all, this is extremely not urgent :)

If you'd like to reproduce any of the results yourself, it should be sufficient to:

git clone --recursive https://github.com/zeux/niagara
cd niagara
git checkout 745700cda87bcd268493c06b917debd611620d98
cmake . -DCMAKE_BUILD_TYPE=Release
make -j8
./niagara data/kitten.obj

For the build to work, you would need, at the minimum, Vulkan-Headers and glslang in path; I use Vulkan SDK for this but iirc installing these two separately, at least on Ubuntu, also works.

By default, the rendering pipeline uses mesh shaders, with task shaders that cull clusters based on frustum/backface, but do not do per-cluster occlusion culling, which is what causes the second pass to have to reprocess all meshes but output zero mesh shading workgroups. You can toggle cluster occlusion culling by pressing K. The other useful key is M, which disables mesh/task pipeline altogether (as well as any forms of cluster culling!) and just uses traditional raster. Because the CP bottleneck affects the first pass as well, without a large ring buffer capacity that is much faster than mesh shading path, at least on 7900 GRE; with patched driver and 32768 ring size, I get similar timings.

The title bar shows the total CPU latency, GPU frame time (measured with GPU timestamps), individual frame times for first and second render passes, and some other information that can probably be ignored. Note that the window size will affect the timings - while this is a geometry-heavy scene, culling and LOD selection take window size into account.

For testing ring sizes, I use the patch mentioned in https://github.com/zeux/niagara/issues/30#issuecomment-2407688500 on top of latest (branch main) radv, built with LLVM disabled.

By default, VSYNC is enabled; to disable it, just change the CONFIG_VSYNC variable in src/config.h and run make again. This should not affect GPU timings, unless the GPU power management decides to keep GPU clocks fairly low. When available, my code uses KHR_performance_query to activate profiling lock; this used to work on 6700 XTX and fixed the GPU clocks, but it looks like radv (and amdvlk) no longer exposes it on 7900 so the app is at the mercy of the default power management logic.

0 replies

zeux · 2024-10-15T21:31:37Z

zeux
Oct 15, 2024
Maintainer

One more comment on RDNA2: while I no longer have access to 6700 XTX, I have an integrated RDNA2 in my Zen4 CPU. It's obviously a GPU of a completely different class, but what I found interesting is that when testing using the same patch, the CP bottleneck doesn't appear to exist (the frame times are pretty flat with ring size varying all the way from 256 to 16K), however using a queue size of 32K sees a significant jump up in render time for both passes, in the configuration where cluster culling is enabled. I'm not sure why this is exactly; perhaps CP needs to process the entire buffer's worth of dispatches generated, and that results in excessive processing by CP for draw commands that aren't generated; but this points to either RDNA2 being just different, or to much weaker GPU potentially wanting smaller queue sizes (without an integrated RDNA3 to test I can't disambiguate the two).

0 replies

Venemo · 2024-10-17T15:43:55Z

Venemo
Oct 17, 2024
Author

I have an integrated RDNA2 in my Zen4 CPU. [...] using a queue size of 32K sees a significant jump up in render time for both passes

I think it's not a surprise that a small GPU would have different performance characteristics. My guess is that the CP in those GPUs is roughly the same, but it has (much) fewer compute units to feed. Furthermore, memory access is much slower on APUs than on dGPUs.

Considering that Navi 31 can have max 96 x 2 x 16 = 3072 waves in flight, and a 32K ring bufer works best, I would extrapolate that on a GPU that has only 2 CUs, since it can have max 2 x 2 x 16 = 64 waves in flight, the ring buffer could be as small as 512 or 1024. Technically, since Raphael probably has other bottlenecks, it doesn't matter as much os on a dGPU.

Also, I just noticed that all of the calculations I posted here are rendered incorrectly due to * being understood as italic text... my apologies for that.

0 replies

zeux · 2024-10-17T16:00:08Z

zeux
Oct 17, 2024
Maintainer

I think it's not a surprise that a small GPU would have different performance characteristics.

Yes, I agree with that. My point was that the strategy of maximizing the queue size for small payloads backfires on this GPU for some reason. Maybe some other adjustments somewhere else are needed, or the queue size could be scaled with the CU count.

0 replies

Venemo · 2024-10-17T17:14:58Z

Venemo
Oct 17, 2024
Author

My point was that the strategy of maximizing the queue size for small payloads backfires on this GPU for some reason.

Have you tried 512 or 1024 on that small GPU? That would be equivalent to 32K on the large GPU.

0 replies

zeux · 2024-10-17T17:16:31Z

zeux
Oct 17, 2024
Maintainer

As noted above, all sizes from 256 up to 16K perform the same (on that small GPU). Only 32K is an outlier (and results in significant, 50%+, frame time regression).

edit:
Looking closer at the heuristics for ring size in the official driver, they seem to be slightly different vs radv:

A semi-hardcoded variable numTsMsDrawEntriesPerSe is 256 by default (gfx9/10) and 1024 for gfx11
Draw ring is allocated to the next power of two of this value times shader engine count (1 for RDNA2 iGPU, 5 for 7900 GRE)

So this would default to 256 for RDNA2 iGPU (same as radv) but 8192 for RDNA3 7900 GRE. This "solves" the small vs big GPU issue I suppose.

0 replies

zeux · 2024-10-29T05:34:07Z

zeux
Oct 29, 2024
Maintainer

Final note on this: I've looked into the discrepancies in performance I've seen here with AMDVLK. Using latest Mesa build (as of today, 6800cd270306ad779b72bbed754bbcf463d1c78c) + https://github.com/GPUOpen-Drivers/AMDVLK/releases/tag/v-2024.Q3.3 AMDVLK & latest master in this repository (as of today, 4242461).

All numbers are in 4K resolution using Wayland with VSYNC off. This is a larger resolution vs what I was using before, and other updates in the repository shifted the numbers; as before, numbers in this comment should only be understood in isolation / relative to each other, not relative to previous reports.

Latest master now defaults to not using task shaders (which is a new mode I've implemented on a stream a week-ish ago, https://youtu.be/zROUBE5pLuI). This mode eliminates the overhead associated with task dispatch by using a single mesh grid with a compute shader replacing task shader (running in graphics queue and generating cluster ids). That's significantly faster, has no CP overhead, and allows to both test mesh shader performance in isolation and actually stress the parts of the GPU pipeline that are downstream of CP.

In that mode, I see radv and amdvlk at ~parity with or without cluster occlusion culling:

radv: 2.20 ms/frame with cluster occlusion, 4.30 ms/frame without it.
amdvlk: 2.23 ms/frame with cluster occlusion, 4.26 ms/frame without it.

With task shading and mesh shading disabled (traditional raster), I see radv being slightly faster:

radv: 4.81 ms/frame
amdvlk: 4.93 ms/frame

Now, with task shading and mesh shading enabled (which is the configuration that was the default in my previous comment about task shading performance; now it's optional and is enabled by pressing T), I see the performance regressions that I've commented on previously:

radv: 13.28 ms/frame with cluster occlusion, 7.16 ms/frame without it.
amdvlk: 11.78 ms/frame with cluster occlusion, 7.06 ms/frame without it.

The times are much larger than other reported times because of the CP bottleneck we've already discussed. And, as already explained, cluster occlusion culling in this configuration is actually detrimental to overall performance because CP bottleneck makes the second pass too slow to be usable.

As I noted in the previous comment, AMDVLK actually scales the ring buffer with the GPU configuration, and my GPU should be getting 8192 items in the ring buffer, instead of radv's default 1024. Changing radv to use 8192 ring size changes the results to:

RADV_TASKN=8192 radv: 11.67 ms/frame with cluster occlusion, 6.98 ms/frame without it.

So the previous performance delta I've observed is entirely explained by the ring size.
Using larger rings, as before, makes things better; as before, 32K is a significant jump over 16K:

RADV_TASKN=16384 radv: 9.31 ms/frame with cluster occlusion, 6.58 ms/frame without it.
RADV_TASKN=32768 radv: 3.77 ms/frame with cluster occlusion, 5.38 ms/frame without it.

So! The good news is that radv is consistently outperforming amdvlk; the only case when that's not happening is the one where radv is configuring the task queue without scaling it according to the GPU shader engines, which yields a lower queue size vs amdvlk and consequently worse performance.

The bad news is that not using task shaders is significantly faster than even the configuration with 32K items.

I'm going to close this issue because I don't think I can meaningfully provide more input here. My conclusions are:

Both AMD drivers really need to at least scale the ring up according to the payload count. The current performance cliff is too large for task shaders to be practically useful for culling. This conclusion is shared by games like Alan Wake2 that do cluster culling in compute instead of using amplification shaders per Digital Dragons talk (https://www.youtube.com/watch?v=EtX7WnFhxtQ). Doing this optimally seems to require detailed information about the way the CP handles item readiness. I hope this happens for both drivers; until it does, task shaders have limited utility for geometry heavy scenes on AMD.
Even if the ring is scaled up to 32K, the performance is still not competitive with a manual replacement (in comparison, on NVidia 4090 on this application, task shader culling is the same performance as manual compute culling, while requiring less space since the entire cluster buffer doesn't need to be stored in memory). It is possible that on real-world scenes, the discrepancy between manual culling and task culling will not be visible and the data I am seeing is an artifact of synthetic setup. Until pt1 is resolved that will be very hard to tell; but I also hope AMD is considering fixing this in firmware or, if necessary, in hardware in future revisions of whatever is the issue here; without this, task shaders work very well on NVidia and poorly on AMD which is not a great place to be.

0 replies

Venemo · 2024-10-30T10:46:49Z

Venemo
Oct 30, 2024
Author

Is there still a way to use the code path that utilizes task shaders, for reproducing the issue? While I understand you are moving on from this problem, I'd still like to finish investigating this on RADV.

0 replies

zeux · 2024-10-30T15:09:16Z

zeux
Oct 30, 2024
Maintainer

Yes, you can still activate task shader mode by pressing T. The comment above details the measurements in that mode from master branch.

0 replies

Mesh shader discussion #41

Venemo Sep 7, 2022

Replies: 52 comments

Venemo Sep 7, 2022 Author

zeux Sep 7, 2022 Maintainer

zeux Sep 7, 2022 Maintainer

Venemo Sep 7, 2022 Author

Venemo Sep 7, 2022 Author

zeux Sep 7, 2022 Maintainer

Venemo Sep 7, 2022 Author

Venemo Sep 7, 2022 Author

zeux Sep 7, 2022 Maintainer

Venemo Sep 7, 2022 Author

zeux Sep 7, 2022 Maintainer

Venemo Sep 8, 2022 Author

zeux Jan 4, 2023 Maintainer

zeux Jan 4, 2023 Maintainer

Venemo Jan 4, 2023 Author

zeux Oct 13, 2024 Maintainer

Venemo Oct 13, 2024 Author

Venemo Oct 13, 2024 Author

zeux Oct 14, 2024 Maintainer

zeux Oct 14, 2024 Maintainer

Venemo Oct 15, 2024 Author

zeux Oct 15, 2024 Maintainer

zeux Oct 15, 2024 Maintainer

Venemo Oct 17, 2024 Author

zeux Oct 17, 2024 Maintainer

Venemo Oct 17, 2024 Author

zeux Oct 17, 2024 Maintainer

zeux Oct 29, 2024 Maintainer

Venemo Oct 30, 2024 Author

zeux Oct 30, 2024 Maintainer

Venemo
Sep 7, 2022

Venemo
Sep 7, 2022
Author

zeux
Sep 7, 2022
Maintainer

zeux
Sep 7, 2022
Maintainer

Venemo
Sep 7, 2022
Author

Venemo
Sep 7, 2022
Author

zeux
Sep 7, 2022
Maintainer

Venemo
Sep 7, 2022
Author

Venemo
Sep 7, 2022
Author

zeux
Sep 7, 2022
Maintainer

Venemo
Sep 7, 2022
Author

zeux
Sep 7, 2022
Maintainer

Venemo
Sep 8, 2022
Author

zeux
Jan 4, 2023
Maintainer

zeux
Jan 4, 2023
Maintainer

Venemo
Jan 4, 2023
Author

zeux
Oct 13, 2024
Maintainer

Venemo
Oct 13, 2024
Author

Venemo
Oct 13, 2024
Author

zeux
Oct 14, 2024
Maintainer

zeux
Oct 14, 2024
Maintainer

Venemo
Oct 15, 2024
Author

zeux
Oct 15, 2024
Maintainer

zeux
Oct 15, 2024
Maintainer

Venemo
Oct 17, 2024
Author

zeux
Oct 17, 2024
Maintainer

Venemo
Oct 17, 2024
Author

zeux
Oct 17, 2024
Maintainer

zeux
Oct 29, 2024
Maintainer

Venemo
Oct 30, 2024
Author

zeux
Oct 30, 2024
Maintainer