Divide large constant buffer into subsets and implement push constants for Vulkan performance #818

SRSaunders · 2023-10-12T05:39:11Z

This PR depends on RobertBeckebans/nvrhi#6

This PR also solves #763 for Apple Silicon performance and rendering artifact elimination, as well as #804 for Intel Integrated GPU support with Vulkan (slow, but works).

This is a fairly large pull request that implements the following:

Separates the single large constant buffer into renderparm subsets (12 in total: 3 of 128 bytes in size, 6 greater than 128 bytes but less than 256 bytes, and 3 greater than 256 bytes but less than 1024 bytes)
Adds new binding layout types to associate and differentiate between the new subsets (BINDING_LAYOUT_GBUFFER, BINDING_LAYOUT_GBUFFER_SKINNED, BINDING_LAYOUT_TEXTURE, BINDING_LAYOUT_TEXTURE_SKINNED, BINDING_LAYOUT_WOBBLESKY, BINDING_LAYOUT_SSGI, BINDING_LAYOUT_SSGI_SKINNED, BINDING_LAYOUT_POST_PROCESS)
Implements push constants for Vulkan and DX12 across all platforms: Linux, macOS, Windows. This has varying degrees of performance improvement, the largest being on Vulkan for Linux and macOS. Windows Vulkan shows modest improvement dependant on the GPU vendor (Nvidia's 256 byte limit is better than AMD's 128 byte limit on Windows). Windows DX12 shows no performance improvement when using push constants vs. volatile constant buffers. Given this I have defined a new boolean r_useDX12PushConstants cvar which is turned off by default. This can optionally be turned on using autoexec.cfg for experimentation.
Reduced the volatile constant max buffer count from 16,384 to 8,192. I believe this is sufficient but if testing reveals differently, then it could be boosted back up. Note that when push constants are enabled it reduces the requirement.
Adds basic infrastructure for static constant buffers but these are disabled for now. This could be a possibility for the future but further subset refactoring would likely be needed, and sync issues would have to be resolved.
Simplified Vulkan code by removing the barrier command list - not needed if command submit is done in correct order
Added CPU and GPU usage % counters to the on-screen HUD display for all platforms
Added MoltenVK's Vulkan-to-Metal encoding time to the HUD when available for macOS only. This capability will be available in the next release of MoltenVK (1.2.6)
Addressed a nasty bug with mis-calibrated GPU timers for macOS caused by a regression in the MoltenVK 1.2.5 release. Use either MoltenVK release <= 1.2.4 or the coming 1.2.6 release with this pull request to avoid the issue.
Modified the ssao_compute fix to support multiple HLSL versions, not just 2021. I found that my changes were required for compiling DXIL for D3D12 while remaining compatible with SPIRV for Vulkan, otherwise I would get compile failures on Windows DX12 (select does not seem to be supported when compiling for DXIL).
A minor CMakeLists fix primarily for Xcode that cleans up precompiled.h-xxxxx.gch.tmp files left around when the ZERO_CHECK target runs for regeneration.
UPDATED: Tested on Apple Silicon (M1) and disabled GPU Skinning for macOS arm64 to eliminate rendering artifacts. I will have to look at this later to see what is going on with GPU skinning on Apple Silicon.
UPDATED: Modified cmake-macos-*.sh and cmake-xcode-*.sh build scripts for openal-soft path portability across x86 and Apple Silicon. Thanks to @asemarafa for the code.
UPDATED: Works around missing Vulkan shaderStorageImageReadWithoutFormat device feature on Intel GPUs, and individually activates VK_KHR_fragment_shading_rate sub-features vs. all or none (this is supported by nvrhi).
FINAL UPDATE: Fixed uniforms change detection logic (orthogonal to push constants) which has a very significant positive impact on performance. See updated performance timings below. Also added new cvar r_useVulkanPushConstants (default on) which is useful for performance comparisons.

Tested on Windows 10 (AMD and Nvidia), Linux Manjaro, and macOS Ventura 13.5

Performance timings for this PR vs. current master, generated using a simple home-made timedemo:

Windows Nvidia System (1070 Ti)
DX12: 263 fps before, ~~255~~ 360 fps after (with r_useDX12PushConstants = 0) --> significant improvement
Vulkan: 218 fps before, ~~233~~ 333 fps after --> significant improvement

Windows AMD System (6600 XT)
DX12: 295 fps before, ~~285~~ 305 fps after (with r_useDX12PushConstants = 0) --> neutral/positive improvement
Vulkan: 150 fps before, ~~155~~ 160 fps after --> neutral/positive improvement

Linux AMD System (6600 XT)
Vulkan: 150 fps before, ~~210~~ 270 fps after --> large improvement

macOS AMD System (6600 XT)
Vulkan: 77 fps before, ~~177~~ 245 fps after --> very large improvement

macOS Apple Silicon System (M1 Air) (new)
Vulkan: 6 fps before, 85 fps after --> massive improvement

Some possible explanations are as follows:

Windows nvrhi DX12 push constants are implemented as D3D12 root constants which are just a special constant buffer of limited size (about 240 bytes for RBDoom3BFG). I suspect there is no real performance gain by using these over nvrhi's volatile constant buffers which appear to be quite efficient with low synchronization overhead on Windows. As a result r_useDX12PushConstants = 0 by default.
Nvidia offers 256 bytes of Vulkan push constants on Windows, while AMD offers only 128 bytes of Vulkan push constants on Windows. For this reason AMD Vulkan users on Windows will likely see a small positive change in performance. Nvidia users on Windows will see a greater ~~but still modest~~ improvement for Vulkan.
On Linux both Nvidia and AMD can offer 256 bytes of Vulkan push constants. Perhaps AMD's driver implementation on Linux favours push constants due to reduced synchronization overhead and the performance gains are quite significant. I would be interested in seeing some perf timings for Nvidia on Linux.
On macOS/MoltenVK/Metal there are 4096 bytes of push constants. This means that virtually all of the draw transactions are handled with push constants and volatile constant buffers are not needed. This is a good thing since constant buffer synchronization on macOS/MoltenVK/Metal appears to be expensive. Push constants do not incur any synchronization overheads and are delivered with command submission.
Overall I think this is a useful change, especially for Vulkan users on Linux and macOS. I would go as far as saying it is mandatory for macOS on Apple Silicon, as the game is virtually unplayable without it. The Windows benefits are ~~less, but still in the positive direction for Vulkan~~ significant for DX12 & Vulkan on Nvidia, but less but so on AMD, at least for my test systems/GPUs.

@RobertBeckebans here is the spreadsheet I used to create the binding layout type to shader/renderparm subset mapping:
Binding to Shader Mapping v4.xlsx

…s for Vulkan and DX12

…nables macOS previous command statistics

…oding time when available

…ename helper arrays for clarity

…2018 and 2021)

… constants

…ation (Xcode)

…rtifacts

…y - thanks asemarafa

…el GPUs

RobertBeckebans · 2023-10-18T11:13:19Z

The results are awesome but I need quite a lot of testing and time to review this code. I aim for a new public release in December before Christmas.

SRSaunders · 2023-10-18T16:42:18Z

Thanks @RobertBeckebans - take your time on review. It took me a while to figure this out anyways, plus a bunch of work with the MoltenVK project to diagnose and optimize performance issues. It was my summer/fall background project :)

If you have questions please don't hesitate to post a message or send an email. I will do my best to answer.

…l per-feature granularity

…default on)

SRSaunders · 2023-10-21T22:55:23Z

Hi @RobertBeckebans. I have pushed my final design update to this branch. Commits 2063c72 and c1f712a add one small but critical update for performance - fixing the uniforms change detection logic which reduces how often volatile constant buffers are written for the larger renderparm sets. This has a huge effect which is orthogonal to push constants, and has a positive impact across all platforms: Windows, Linux, & macOS. See updated timing info above.

I am not sure why my AMD 6600XT GPU seems to be limited when running Windows/Vulkan, with other OSs surpassing it with the same card. Perhaps something about AMD’s windows drivers or AMD’s 128 byte limit for Vulkan push constants on Windows.

…fers / push constants

…hronos sync2 layer based on macOS SDK version

…ts not enabled

…eLists: make VMA header visible in IDE

SRSaunders · 2023-11-30T19:27:58Z

I added a few minor things due to dependencies on previous changes within this PR:

Added a few additional comments to explain some of the important changes in this PR
Don't allocate constant buffers unless required (i.e. when push constants disabled for binding layout type)
For the statistics overlay HUD, I added GPU Memory usage and smoothing to CPU/GPU usage percentages
Made the VMA header file visible within the IDE source tree under libs/vma (CMakeLists change)

SRSaunders · 2023-12-20T18:57:39Z

Oops, just realized that ImmediateMode was broken for using debug tools with push constants enabled.

Fixed in ee3b6f9

…DK >= 1.3.272.0

SRSaunders · 2023-12-28T19:53:29Z

macOS only: I just added support for the new VK_EXT_layer_settings extension used for configuring the upcoming MoltenVK 1.2.7 / Vulkan SDK 1.3.272.0. See commit f3c65ee.

@RobertBeckebans The last several commits have been added mostly due to dependencies on earlier changes within this PR. Unless something comes up in your testing, I don't plan any more changes here.

labrnth · 2024-01-01T19:22:28Z

@SRSaunders I tried building the XCode release using this branch but ran into a couple of errors. The following diff fixed the XCode build:

diff --git a/neo/renderer/GuiModel.cpp b/neo/renderer/GuiModel.cpp
index 5db04303..fb4547b5 100644
--- a/neo/renderer/GuiModel.cpp
+++ b/neo/renderer/GuiModel.cpp
@@ -395,9 +395,9 @@ void idGuiModel::EmitImGui( ImDrawData* drawData )
                        idScreenRect clipRect =
                        {
                                static_cast<short>( pcmd->ClipRect.x ),
-                               io.DisplaySize.y - static_cast<short>( pcmd->ClipRect.w ),
+                static_cast<short>(io.DisplaySize.y - static_cast<short>( pcmd->ClipRect.w )),
                                static_cast<short>( pcmd->ClipRect.z ),
-                               io.DisplaySize.y - static_cast<short>( pcmd->ClipRect.y ),
+                static_cast<short>(io.DisplaySize.y - static_cast<short>( pcmd->ClipRect.y )),
                                0.0f,
                                1.0f
                        };

labrnth · 2024-01-01T19:23:48Z

Also noticed some strange artifacts on the weapons (not seeing this elsewhere). See attached screenshot.

I have observed a massive increase in FPS though, so thats great.

SRSaunders · 2024-01-01T22:22:09Z

@labrnth Thanks for trying out this PR. The compile issue you found was fixed in master with commit ab663a7. Perhaps you synced before this was merged recently.

Regarding the weapons artifacts, are you running on an Intel or Apple Silicon Mac? I thought this problem was only visible on Apple Silicon - where I disabled GPU skinning to fix it. If you are running on Intel that is quite interesting and may indicate a more general problem with skinning. In this case you could try setting r_useGPUSkinning 0 in your console, or adding set r_useGPUSkinning 0 to your autoexec.cfg file. Please let me know your results.

Good news on the performance improvement. That was my main goal for this effort. I trust you applied RobertBeckebans/nvrhi#6 - this is needed for max performance.

labrnth · 2024-01-02T03:55:53Z

@SRSaunders Ya that could be.

Yes I'm running on Apple Silicon (M2 Pro). The r_useGPUSkinning "0" fixed it but only while in campaign not the multiplayer.

…cheat)

SRSaunders · 2024-01-02T17:58:11Z

Thanks for catching the r_useGPUSkinning reset in multiplayer. Solved in 3b6598b (this cvar is not a cheat in any case).

SRSaunders · 2024-01-21T18:50:56Z

@RobertBeckebans I would like to know what you are planning with this PR. I have a few other improvements for Vulkan that I would like to submit (e.g. use Vulkan dynamic functions vs. static linkage for VMA setup, Optick, and MoltenVK config), but I don't want to keep adding on here, especially if you are not planning to merge in the near term.

I could simplify things by dividing this into two PRs that could be treated independently:

Separate the Uniforms Subsetting, Push Constants and Uniforms Change Detection improvements which is focussed solely on performance for both Vulkan and DX12. And yes, DX12 benefits from uniforms subsetting with the new change detection logic even without push constants enabled.
Separate the HUD GPU Memory usage & CPU/GPU Usage % features (DX12 and Vulkan), generic Vulkan changes (incl. Intel iGPU fixes, dynamic functions, etc), and macOS/MoltenVK-only stuff.

I would then close this large PR and you could look at two new smaller PRs independently.

Would this help? If so would you look at either of the new PRs for your upcoming release?

RobertBeckebans · 2024-01-22T23:58:08Z

I can see why it is necessary to split the constant buffer renderparms into smaller push constants but I don't really like the complexity it comes with. I don't see this PR merged with RBDoom 1.6 but rather in 1.7. I also would like to merge this after NVRHI has been updated to the newest version.

Separate the HUD GPU Memory usage & CPU/GPU Usage % features (DX12 and Vulkan), generic Vulkan changes (incl. Intel iGPU fixes, dynamic functions, etc), and macOS/MoltenVK-only stuff.

I also would like to see this separated out into a different branch and then it needs to be merged with the newest NVRHI later.

SRSaunders · 2024-01-24T04:29:45Z

Thanks @RobertBeckebans for your comments and advice. I will proceed with splitting the PR into two parts and resubmit.

Regarding your comments:

For the push constants work, I am fine with you waiting until 1.7 and merging after updating nvrhi. However, please note the nvrhi dependency is very small, and only requires the nvrhi push constant limits to updated based on platform (i.e. Relax nvrhi push constant limits to permit platform-specific runtime checks nvrhi#6)
Regarding your comments about complexity, I found the main issue was determining the shader -> subset mapping and then encoding that information into data structures for use at runtime. However, once that work was done, it is relatively static and no changes are required unless shaders are updated in the future. Another factor is that I wrote the code to allow runtime choice of constant buffers vs push constants based on platform capability and cvars - that adds some complexity that I suppose could be reduced if it was hard-coded. And lastly, the renderparm change detection logic is pretty simple, but yields large improvements independent of push constants themselves. I wish I did not have to touch the renderparm naming in the shaders, but as far as I can tell you cannot use globals for push constants and they need to be part of a struct. I also wanted to retain constant buffer compatibility if push constants were disabled at runtime. So I had to do a global rename using the pc. struct preface. If you know a better solution to this please let me know.
Regarding the other HUD, Vulkan, and macOS stuff it works with current master has no nvrhi version dependencies.

RobertBeckebans · 2024-01-24T11:27:50Z

The pc. suffix for the renderparms is not really a big deal. I was refering to figuring out which renderparms can be put into subsets for the used shaders and hitting the limits of constant buffers when extending the shaders in the future. You did a good job at sorting those using the excel sheet but this still needs to go through some deeper testing and it is also kind of the opposite I generally wish for the overall renderer design. I like it stupid simple and it might be the case that changing the renderer from a multipass forward design where each geometry is rerendered for each light to a clustered forward+ renderer design can outperform your changes while having a design that still works with the old renderparms array.
Having all shadows in a big shadow atlas like we have right now was mandatory for forward+ and it is a feature I want to try to implement this year.

SRSaunders · 2024-01-25T23:50:47Z

Yes, the sorting work with Excel was the main effort, along with figuring out how to dynamically enable/disable push constants at runtime based on platform capabilities. This branch has been my daily driver for about 4 months now, plus a number of linux and macOS users have tried the PR with success. While I think it's pretty stable as is, getting more play and stress testing would give us more confidence.

Regarding your plans for clustered forward+ rendering, are these the papers that descibe it? forward_plus.pdf 087-096.pdf

If so I am wondering whether my work with push constants can co-exist with these improvements? I presume the main issue would be whether existing shaders would need more parameters (or perhaps new shaders), and then bumping up against push constant platform limits (128 or 256 byte limit on Windows & Linux depending on GPU, a non-issue on macOS). However, if it only involves modifying the lighting shaders (ambient*, interaction*, and interactionSM*) then the solution should be fairly simple: those shaders already use >256 renderparm bytes and have push constants disabled for their associated binding layout types on Windows and Linux. On macOS they can use push constants since the limit is 4096 bytes on that platform. In other words, adding new renderparms to the three subsets for those specific lighting shader groups would not negate the performance gains from my push constants work. Is this a possiblity?

If you are going to continue to support Linux and macOS, Vulkan cannot be left to fall behind in performance. My work was a small attempt to push things forward (sorry for the pun).

SRSaunders · 2024-01-25T23:52:56Z

This PR will now be closed unmerged. For existing users please see #854 and #855 for up-to-date replacements.

SRSaunders added 10 commits October 4, 2023 12:06

Divide large constant buffer into subsets and implement push constant…

5320c57

…s for Vulkan and DX12

Remove need for barrier command list on Vulkan, simplifies code and e…

9956923

…nables macOS previous command statistics

Add CPU / GPU usage % to HUD overlay and display MoltenVK's Metal enc…

5475976

…oding time when available

Refactor renderParmSets to fit within D3D12 root constant limit and r…

54f5ffb

…ename helper arrays for clarity

Fix ssao_compute.hlsl for compatibility with multiple HLSL versions (…

e96321f

…2018 and 2021)

Clamp max push constant size to nvrhi::c_MaxPushConstantSize

509b196

Define r_useDX12PushConstants cvar (default off) to control DX12 push…

5a76607

… constants

CMakeLists: Add wildcards to remove tmp files from ZERO_CHECK regener…

0f9f4f6

…ation (Xcode)

Merge latest changes from upstream master

ba2df1b

macOS: Disable GPU skinning on Apple Silicon to eliminate rendering a…

86dc341

…rtifacts

SRSaunders mentioned this pull request Oct 13, 2023

Vulkan support in MacOS? #763

Closed

SRSaunders added 2 commits October 15, 2023 16:39

macOS: Update cmake*.sh build scripts for openal-soft path portabilit…

6707d42

…y - thanks asemarafa

Work around missing Vulkan shaderStorageImageReadWithoutFormat on Int…

83b97d0

…el GPUs

SRSaunders mentioned this pull request Oct 18, 2023

Linux - Failed to create a Vulkan physical device, error code = VK_ERROR_FEATURE_NOT_PRESENT #804

Closed

SRSaunders added 2 commits October 19, 2023 12:21

Vulkan: Detect and enable fragment shading rate features at individua…

638ae85

…l per-feature granularity

Fix uniforms change detection and add r_useVulkanPushConstants cvar (…

2063c72

…default on)

SRSaunders added 4 commits October 23, 2023 21:04

Simplify change detection logic that controls writing of constant buf…

c1f712a

…fers / push constants

Add comments, remove redundant call to Vulkan getProperties, enable K…

18769ec

…hronos sync2 layer based on macOS SDK version

Memory Optimization: allocate constant buffers only when push constan…

ea6d698

…ts not enabled

Statistics HUD: smooth CPU/GPU usage, add GPU Memory for mode 3; CMak…

8a0c493

…eLists: make VMA header visible in IDE

Fix ImmediateMode so debug tools work properly with push constants

ee3b6f9

SRSaunders and others added 2 commits December 26, 2023 12:16

Merge branch 'master' into push-constants

b3f627d

macOS: Support VK_EXT_layer_settings for MoltenVK >= 1.2.7 / Vulkan S…

f3c65ee

…DK >= 1.3.272.0

Don't reset or lock r_useGPUSkinning cvar in multiplayer mode (not a …

3b6598b

…cheat)

Merge branch 'RobertBeckebans:master' into push-constants

96ddf7b

This was referenced Jan 25, 2024

Vulkan & Optick Improvements and GPU memory + CPU/GPU usage % features #854

Merged

Divide large constant buffer into subsets and implement push constants for performance #855

Open

SRSaunders closed this Jan 25, 2024

SRSaunders deleted the push-constants branch February 12, 2024 05:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Divide large constant buffer into subsets and implement push constants for Vulkan performance #818

Divide large constant buffer into subsets and implement push constants for Vulkan performance #818

SRSaunders commented Oct 12, 2023 •

edited

Loading

RobertBeckebans commented Oct 18, 2023

SRSaunders commented Oct 18, 2023

SRSaunders commented Oct 21, 2023 •

edited

Loading

SRSaunders commented Nov 30, 2023

SRSaunders commented Dec 20, 2023

SRSaunders commented Dec 28, 2023

labrnth commented Jan 1, 2024

labrnth commented Jan 1, 2024

SRSaunders commented Jan 1, 2024 •

edited

Loading

labrnth commented Jan 2, 2024

SRSaunders commented Jan 2, 2024

SRSaunders commented Jan 21, 2024 •

edited

Loading

RobertBeckebans commented Jan 22, 2024

SRSaunders commented Jan 24, 2024

RobertBeckebans commented Jan 24, 2024

SRSaunders commented Jan 25, 2024 •

edited

Loading

SRSaunders commented Jan 25, 2024

Divide large constant buffer into subsets and implement push constants for Vulkan performance #818

Divide large constant buffer into subsets and implement push constants for Vulkan performance #818

Conversation

SRSaunders commented Oct 12, 2023 • edited Loading

RobertBeckebans commented Oct 18, 2023

SRSaunders commented Oct 18, 2023

SRSaunders commented Oct 21, 2023 • edited Loading

SRSaunders commented Nov 30, 2023

SRSaunders commented Dec 20, 2023

SRSaunders commented Dec 28, 2023

labrnth commented Jan 1, 2024

labrnth commented Jan 1, 2024

SRSaunders commented Jan 1, 2024 • edited Loading

labrnth commented Jan 2, 2024

SRSaunders commented Jan 2, 2024

SRSaunders commented Jan 21, 2024 • edited Loading

RobertBeckebans commented Jan 22, 2024

SRSaunders commented Jan 24, 2024

RobertBeckebans commented Jan 24, 2024

SRSaunders commented Jan 25, 2024 • edited Loading

SRSaunders commented Jan 25, 2024

SRSaunders commented Oct 12, 2023 •

edited

Loading

SRSaunders commented Oct 21, 2023 •

edited

Loading

SRSaunders commented Jan 1, 2024 •

edited

Loading

SRSaunders commented Jan 21, 2024 •

edited

Loading

SRSaunders commented Jan 25, 2024 •

edited

Loading