Replies: 52 comments
-
Also, you don't have to hardcode these in your program, you can use |
Beta Was this translation helpful? Give feedback.
-
One issue with this is that all of this requires rebuilding meshlet data, something that would ideally be done offline. It looks as if on AMD hardware specifically the shader is export-bound atm, I've looked a little bit at adding per-triangle culling and it does help performance significantly. Without per-triangle culling I don't seem to get any benefit from moving to the same number of primitives; with it however I do get better throughput with 64 max primitives, but that's a little too low. I'll test different configurations when I get time. Thanks for the suggestion! |
Beta Was this translation helpful? Give feedback.
-
I'm also wondering what happens on NV specifically from work groups with size 64 vs 32 - and whether the shader that the driver runs is substantially different wrt performance from a shader that uses work group size 32 but has to process two vertices per invocation (which is what the shader I used for NV extension did, but there it wasn't possible to test wider groups because NV extension requires work group size of 32 if I'm not mistaken). |
Beta Was this translation helpful? Give feedback.
-
No, it doesn't require rebuilding meshlet data. A workable compromise is if you use a meshlet size of 64 (max 64 vertices and max 64 primitives). In this case on NVidia you would output 1 meshlet per workgroup and on AMD you could output 2 meshlets per workgroup. I personally haven't tested this but would be interesting to compare how different configs perform. |
Beta Was this translation helpful? Give feedback.
-
To your NVidia question: this is explained in one of their mesh shader blogs. As far as I understand NVidia's problem is that it doesn't have proper workgroups, so the whole mesh shader workgroup is executed in a single warp and they emulate the workgroup using a loop. Therefore, you can get closer to what NVidia hardware actually runs if you use a workgroup size that matches their warp size. |
Beta Was this translation helpful? Give feedback.
-
What I meant is that varying the max sizes between vendor requires different meshlet data. Just varying workgroup configurations doesn't of course. 64 & 64 is a little problematic depending on the mesh topology - I'd expect that the 64 vertex limit leads to an effective primitive count between 64 and 98 (98 corresponds to an 8x8 grid). Setting the primitive count limit of 64 limits to something like 45 vertices per meshlet for smooth meshes, so you end up underutilizing the threads for vertex transformation. One other alternative is something like 128 vertices and 192 primitives, which is more balanced wrt the ratio, but still problematic because now this means we need to write all vertex data to LDS :)
Right, but a work group of 64 would be compiled into two sequential invocations of 32 elements each, vs a shader that uses more or less the same loop if it needs to process a meshlet with >32 vertices/primitives. I understand that using a work group of 64 doesn't match the hardware perfectly, but the question is where the resulting inefficiencies come from. |
Beta Was this translation helpful? Give feedback.
-
Yes, the trick is to find a meshlet size which can work fine on both vendors, and then you can use the same meshlet size but with a slightly different workgroup config.
I think it's worth to experiment with a meshlet size of: max vertices = 128, max primitives = 128 and then use a 128-sized workgroup on AMD and 32 (or 64?) on NVidia.
Unfortunately I don't know any more details other than what I told above, only that this is their recommendation. |
Beta Was this translation helpful? Give feedback.
-
One more thought about this. If you definitely don't want to increase the number of max output vertices but you want to use max 128 output primitives, it is still worth it (on AMD) to increase the workgroup size to 128 and make your primitive processing more parallel than it currently is. |
Beta Was this translation helpful? Give feedback.
-
Can you elaborate on why on AMD there's a benefit to going above 64? It's not intuitively obvious that this should help as 64 (and sometimes 32) is the HW wavefront size. |
Beta Was this translation helpful? Give feedback.
-
On RDNA2 each invocation can only really create max 1 vertex and 1 primitive. Any other kind of access pattern is emulated by the driver. This also implies that it may need to launch more invocations than your specified workgroup size in order to fit a larger output. If you have a workgroup size of 64 but a max primitive count of 126 then the "real" workgroup size will be 126 (this fits 2 waves, which have 128 invocations):
So, in fact there are 128 invocations running but you don't utilize all of them. It is more efficient to write your code in a manner that utilizes all invocations instead of letting them sit there doing nothing most of the time. I try to explain this in my blog post "How mesh shaders are implemented in an AMD driver". |
Beta Was this translation helpful? Give feedback.
-
Ah, that explains a lot! It's indeed substantially different compared to NV model. I didn't realize that the restriction on emission also applies to primitives, as I thought it's just the vertices. |
Beta Was this translation helpful? Give feedback.
-
It seems that a few others also struggle to understand this, eg. GravityMark has the same problem. So I think I explained it poorly... Can you suggest a good way to edit my blog post to clarify this? |
Beta Was this translation helpful? Give feedback.
-
By the way, at least in radv it looks like mesh shaders are always compiled with wave size 64. Do you know if this is a hardware restriction or a driver limitation? I can't currently test any other AMD drivers with mesh shading support... The reason I ask is I was hoping for something like max_vertices=64 max_triangles=96 to work reasonably well with wave32 but it looks like this is inefficient as it effectively uses the same wave configuration as max_vertices=64 max_triangles=124. |
Beta Was this translation helpful? Give feedback.
-
Also based on GPUOpen-Drivers/llpc@772eef3 my understanding is that on GFX11 (RDNA3) row export would allow emitting more than one vertex or primitive per thread, which would be great as it would provide the much needed flexibility wrt balancing performance. Not sure if GFX11 has other relevant changes for mesh shading. |
Beta Was this translation helpful? Give feedback.
-
It's just the default in our driver. You can use the
Worth a try. Yes it would be inefficient in Wave64 mode. Maybe we should add special casing for 32 and 96.
This is correct, but I haven't implemented that in RADV yet. (I am on vacation this week and will get back to work next week.) However, it will still need some shuffling between SIMD lanes.
Yes, it also has a new "fast launch" mode, which will eliminate the need for launching shader invocations that "do nothing". |
Beta Was this translation helpful? Give feedback.
-
I'm not sure - it's complicated. The discrepancy exists (between Windows & Linux/radv) according to the overall frame rate and the individual GPU timers as captured in real time. This is with vsync off so in theory I'd expect similar clocks when just running the app (there's little idle time), but I would need to verify this. When running radv on Windows, I see nominal performance before I capture to be the same - higher than radv - but when I capture, the resulting capture has longer duration that matches what radv captures more or less. radv captures also show longer capture duration but it's less severe (eg radv is 5ms frame time with no capture, 5.2ms in RGP capture; Windows is 4.5ms frame time with no capture, 5.2ms in RGP capture). Additionally, when using cluster culling with the aforementioned second pass experiencing long delays due to task queue bottleneck, the RGP profiles just look very different between radv and Windows driver - the total number of vkCmdDispatch that radv shows is a close multiple of 1024 (queue size), whereas I am dispatching fewer mesh workgroups; Windows RGP capture shows a more reasonable size with a much shorter duration but there's still a gap at the end where "nothing" happens on the timeline, so I'm not sure if this is a capture artifact or not. Consequently there's a lot of variables here, and to test any theory on both OSes I have to reboot back and forth which is time consuming, so it would be better to compare radv vs amdvlk, but for that I need an amdvlk binary that just works so I'll wait :) |
Beta Was this translation helpful? Give feedback.
-
Understood. Regardless of what RGP says, it is nice to see that the two drivers at least perform in the same ballpark. |
Beta Was this translation helpful? Give feedback.
-
FWIW, the number of draw entries is 1024, so theoretically 1024 task shader workgroups could be in flight on the GPU at any given time. Considering that the 7900 GRE has 80 CUs (note, the top Navi 31 has up to 96 CU) and each CU has 2 SIMD, and each SIMD can have 16 waves in flight, that means 7900 GRE can have 80216=2560 waves in flight at a time (this is 96216=3072 on top Navi 31) . Now the task shaders here are just 1 Wave64 wave per workgroup, meaning we should be able to have at least 40% occupancy on your GPU with just task shaders (this would be 33% on top Navi 31), assuming each task workgroup launches 0 mesh shaders, but in reality the occupancy seems to be much lower. Theoretically it should be enough to increase the number of entries to 4096 (closest power of two after 3072) to get fully occupancy, but that doesn't seem to help either. So, I suspect that either the issue is due to the firmware; or the driver does something really stupid that nobody has noticed. |
Beta Was this translation helpful? Give feedback.
-
I agree that this doesn't fully make sense. I originally expected that this is just an issue with CP throughput, but thinking about this further, unless CP has some sort of parallel draw call rejection feature, increasing the queue size would not by itself allow the "empty" dispatch pass to accelerate more than 2x (by allowing CP & task shaders to run in parallel). And if CP is not the bottleneck then it's unclear why the queue needs to be increased so much to have good gains, and why modest increases in queue sizes don't seem to help much. Also I was discussing this with a dev who had a DX12 playground with very different code, and they observed similar timings for empty dispatches with a lighter weight task shader, so that also seems to suggest CP bottleneck, which again is a little odd as to why the gains are so dramatic. But I could imagine some sort of horrible synchronization ping ponging where both CP and TS are constantly stalled, with CP waiting for some items to be written ahead of current read pointer and TS waiting for available space in the ring buffer, that magnifies the slowdown well past 2x and requires read and write regions of the ring buffer to be far apart to isolate them from conflict. It's also possible this is some sort of driver issue, but because the synchronization is entirely on the firmware side and because this is present in all drivers (radv, amdvlk/linux, amdvlk/windows, although the latter two are the same source code of course), I'd think this is either some inefficiency in firmware, or a fundamental design defect of the hardware. FWIW I've hit the same type of problem - bottleneck by empty draws in CP processing - on RDNA2 on this code base, when I was using multi draw indirect and had multiple draw calls; this was fixed for RDNA2 by switching to a single draw call (463063c commit, https://www.youtube.com/live/eYvGruGHhUE YT stream); I've went back to this commit which allowed to switch between two modes, and this is still super valuable on RDNA3, but wheres on the GPU I had at the time (6700 XTX) doing this alone basically lifted the CP bottleneck, on 7900 GRE this is very beneficial but not enough to eliminate the CP bottleneck. I'm not sure if this is because the firmware was different, or because just the delta between shader processing capacity and CP processing capacity or something else changed. |
Beta Was this translation helpful? Give feedback.
-
Oh one other thing that complicates correct comparison here that I think I should note: enabling rgp trace in the 32K queue size configuration slows things back down. Specifically, when I use cluster culling with the slow second pass with default 1024 queue size, I get the following timings (sorry that these are different from before, I've been optimizing some unrelated bits and pieces, treat these in isolation from previous timings): gpu render early 4.6ms When I use queue size 32k, I get the following: gpu render early 1.2ms I am running these in a CPU-GPU synced configuration (CPU waits for GPU frame completion) so I am inclined to trust the results because CPU timing can not be spoofed even if GPU counters somehow measure the wrong thing. However, when I enable MESA_VK_TRACE=rgp, I get this with 32K entries (the numbers with 1024 entries are same ballpark as without rgp): gpu render early 5.2ms This makes it especially difficult to correctly measure all of this, which is part of the "it's complicated" I referred to: I do not trust RGP captures to not skew numbers to the point of not being useful in this code... The RGP capture for 32K queue does show a huge gap in the second pass where seemingly nothing happens. Maybe this is the actual capture done by firmware (?). Compared to the 1K queue capture where the gap exists but is much smaller. But note that in the 32K capture the duration of the first render pass is similar to 32K, whereas in practice (without rgp capture) it's way faster. This is observed even in a setup where I disable everything I can disable via RADV_THREAD_TRACE_QUEUE_EVENTS=false RADV_THREAD_TRACE_INSTRUCTION_TIMING=false RADV_THREAD_TRACE_CACHE_COUNTERS=false. |
Beta Was this translation helpful? Give feedback.
-
I think it makes sense what you say, that the bottleneck is the CP firmware, though I personally have no insight into what is happening in there exactly. I wouldn't be surprised, since it seems that both of these GPUs are all bottlenecked by the CP in general, and the 7900 even more so. Based on what you posted, it looks like taking a RGP trace will by itself slow down everything to such an extent that we are essentially looking at a different thing entirely. If you have time, you could try setting Sorry I haven't been able to try it myself yet, I just recently got back from XDC 2024. I would like to test this on my 7900 XTX too. Can you please give me a quick walk through of what I need to do to run your test cases? |
Beta Was this translation helpful? Give feedback.
-
Yeah this basically eliminates most of the tracing overhead AFAICT; this is with that marker set to 0 and with queue items 1024: ... and this is with queue items 32768 (both captures are with cluster culling enabled): The overall frame times barely change under RGP in that setup. It would be nice to have this as a tracing option maybe, so that it could be disabled for testing draw heavy workloads...
No worries at all, this is extremely not urgent :) If you'd like to reproduce any of the results yourself, it should be sufficient to: git clone --recursive https://github.com/zeux/niagara
cd niagara
git checkout 745700cda87bcd268493c06b917debd611620d98
cmake . -DCMAKE_BUILD_TYPE=Release
make -j8
./niagara data/kitten.obj For the build to work, you would need, at the minimum, Vulkan-Headers and glslang in path; I use Vulkan SDK for this but iirc installing these two separately, at least on Ubuntu, also works. By default, the rendering pipeline uses mesh shaders, with task shaders that cull clusters based on frustum/backface, but do not do per-cluster occlusion culling, which is what causes the second pass to have to reprocess all meshes but output zero mesh shading workgroups. You can toggle cluster occlusion culling by pressing K. The other useful key is M, which disables mesh/task pipeline altogether (as well as any forms of cluster culling!) and just uses traditional raster. Because the CP bottleneck affects the first pass as well, without a large ring buffer capacity that is much faster than mesh shading path, at least on 7900 GRE; with patched driver and 32768 ring size, I get similar timings. The title bar shows the total CPU latency, GPU frame time (measured with GPU timestamps), individual frame times for first and second render passes, and some other information that can probably be ignored. Note that the window size will affect the timings - while this is a geometry-heavy scene, culling and LOD selection take window size into account. For testing ring sizes, I use the patch mentioned in https://github.com/zeux/niagara/issues/30#issuecomment-2407688500 on top of latest (branch main) radv, built with LLVM disabled. By default, VSYNC is enabled; to disable it, just change the CONFIG_VSYNC variable in |
Beta Was this translation helpful? Give feedback.
-
One more comment on RDNA2: while I no longer have access to 6700 XTX, I have an integrated RDNA2 in my Zen4 CPU. It's obviously a GPU of a completely different class, but what I found interesting is that when testing using the same patch, the CP bottleneck doesn't appear to exist (the frame times are pretty flat with ring size varying all the way from 256 to 16K), however using a queue size of 32K sees a significant jump up in render time for both passes, in the configuration where cluster culling is enabled. I'm not sure why this is exactly; perhaps CP needs to process the entire buffer's worth of dispatches generated, and that results in excessive processing by CP for draw commands that aren't generated; but this points to either RDNA2 being just different, or to much weaker GPU potentially wanting smaller queue sizes (without an integrated RDNA3 to test I can't disambiguate the two). |
Beta Was this translation helpful? Give feedback.
-
I think it's not a surprise that a small GPU would have different performance characteristics. My guess is that the CP in those GPUs is roughly the same, but it has (much) fewer compute units to feed. Furthermore, memory access is much slower on APUs than on dGPUs. Considering that Navi 31 can have max 96 x 2 x 16 = 3072 waves in flight, and a 32K ring bufer works best, I would extrapolate that on a GPU that has only 2 CUs, since it can have max 2 x 2 x 16 = 64 waves in flight, the ring buffer could be as small as 512 or 1024. Technically, since Raphael probably has other bottlenecks, it doesn't matter as much os on a dGPU. Also, I just noticed that all of the calculations I posted here are rendered incorrectly due to |
Beta Was this translation helpful? Give feedback.
-
Yes, I agree with that. My point was that the strategy of maximizing the queue size for small payloads backfires on this GPU for some reason. Maybe some other adjustments somewhere else are needed, or the queue size could be scaled with the CU count. |
Beta Was this translation helpful? Give feedback.
-
Have you tried 512 or 1024 on that small GPU? That would be equivalent to 32K on the large GPU. |
Beta Was this translation helpful? Give feedback.
-
As noted above, all sizes from 256 up to 16K perform the same (on that small GPU). Only 32K is an outlier (and results in significant, 50%+, frame time regression). edit:
So this would default to 256 for RDNA2 iGPU (same as radv) but 8192 for RDNA3 7900 GRE. This "solves" the small vs big GPU issue I suppose. |
Beta Was this translation helpful? Give feedback.
-
Final note on this: I've looked into the discrepancies in performance I've seen here with AMDVLK. Using latest Mesa build (as of today, 6800cd270306ad779b72bbed754bbcf463d1c78c) + https://github.com/GPUOpen-Drivers/AMDVLK/releases/tag/v-2024.Q3.3 AMDVLK & latest master in this repository (as of today, 4242461). All numbers are in 4K resolution using Wayland with VSYNC off. This is a larger resolution vs what I was using before, and other updates in the repository shifted the numbers; as before, numbers in this comment should only be understood in isolation / relative to each other, not relative to previous reports. Latest master now defaults to not using task shaders (which is a new mode I've implemented on a stream a week-ish ago, https://youtu.be/zROUBE5pLuI). This mode eliminates the overhead associated with task dispatch by using a single mesh grid with a compute shader replacing task shader (running in graphics queue and generating cluster ids). That's significantly faster, has no CP overhead, and allows to both test mesh shader performance in isolation and actually stress the parts of the GPU pipeline that are downstream of CP. In that mode, I see radv and amdvlk at ~parity with or without cluster occlusion culling: radv: 2.20 ms/frame with cluster occlusion, 4.30 ms/frame without it. With task shading and mesh shading disabled (traditional raster), I see radv being slightly faster: radv: 4.81 ms/frame Now, with task shading and mesh shading enabled (which is the configuration that was the default in my previous comment about task shading performance; now it's optional and is enabled by pressing radv: 13.28 ms/frame with cluster occlusion, 7.16 ms/frame without it. The times are much larger than other reported times because of the CP bottleneck we've already discussed. And, as already explained, cluster occlusion culling in this configuration is actually detrimental to overall performance because CP bottleneck makes the second pass too slow to be usable. As I noted in the previous comment, AMDVLK actually scales the ring buffer with the GPU configuration, and my GPU should be getting 8192 items in the ring buffer, instead of radv's default 1024. Changing radv to use 8192 ring size changes the results to: RADV_TASKN=8192 radv: 11.67 ms/frame with cluster occlusion, 6.98 ms/frame without it. So the previous performance delta I've observed is entirely explained by the ring size. RADV_TASKN=16384 radv: 9.31 ms/frame with cluster occlusion, 6.58 ms/frame without it. So! The good news is that radv is consistently outperforming amdvlk; the only case when that's not happening is the one where radv is configuring the task queue without scaling it according to the GPU shader engines, which yields a lower queue size vs amdvlk and consequently worse performance. The bad news is that not using task shaders is significantly faster than even the configuration with 32K items. I'm going to close this issue because I don't think I can meaningfully provide more input here. My conclusions are:
|
Beta Was this translation helpful? Give feedback.
-
Is there still a way to use the code path that utilizes task shaders, for reproducing the issue? While I understand you are moving on from this problem, I'd still like to finish investigating this on RADV. |
Beta Was this translation helpful? Give feedback.
-
Yes, you can still activate task shader mode by pressing |
Beta Was this translation helpful? Give feedback.
-
Not sure what is the best place to talk about this so decided maybe we can discuss it here. Hope this is okay.
Looking at the current code, I noticed that the mesh shader workgroup size is 64, but the shader has:
max_vertices = 64, max_primitives = 124
this means that the shader is going to have poor occupancy on AMD HW, effectively leaving 50% of shader invocations under-utilized. Note that this is also suboptimal on NVidia HW which prefers a workgroup size of 32.I recommend to have a compile-time constant for each of these values (similar to what you do for
MESH_WGSIZE
) and configure it like this:You can achieve this by using a "compile-time loop" (a loop using the compile-time constants) which will be optimal on both AMD and NVidia.
Beta Was this translation helpful? Give feedback.
All reactions