-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Divide large constant buffer into subsets and implement push constants for Vulkan performance #818
Conversation
…s for Vulkan and DX12
…nables macOS previous command statistics
…oding time when available
…ename helper arrays for clarity
The results are awesome but I need quite a lot of testing and time to review this code. I aim for a new public release in December before Christmas. |
Thanks @RobertBeckebans - take your time on review. It took me a while to figure this out anyways, plus a bunch of work with the MoltenVK project to diagnose and optimize performance issues. It was my summer/fall background project :) If you have questions please don't hesitate to post a message or send an email. I will do my best to answer. |
…l per-feature granularity
Hi @RobertBeckebans. I have pushed my final design update to this branch. Commits 2063c72 and c1f712a add one small but critical update for performance - fixing the uniforms change detection logic which reduces how often volatile constant buffers are written for the larger renderparm sets. This has a huge effect which is orthogonal to push constants, and has a positive impact across all platforms: Windows, Linux, & macOS. See updated timing info above. I am not sure why my AMD 6600XT GPU seems to be limited when running Windows/Vulkan, with other OSs surpassing it with the same card. Perhaps something about AMD’s windows drivers or AMD’s 128 byte limit for Vulkan push constants on Windows. |
…fers / push constants
…hronos sync2 layer based on macOS SDK version
…eLists: make VMA header visible in IDE
I added a few minor things due to dependencies on previous changes within this PR:
|
Oops, just realized that ImmediateMode was broken for using debug tools with push constants enabled. Fixed in ee3b6f9 |
macOS only: I just added support for the new @RobertBeckebans The last several commits have been added mostly due to dependencies on earlier changes within this PR. Unless something comes up in your testing, I don't plan any more changes here. |
@SRSaunders I tried building the XCode release using this branch but ran into a couple of errors. The following diff fixed the XCode build:
|
@labrnth Thanks for trying out this PR. The compile issue you found was fixed in master with commit ab663a7. Perhaps you synced before this was merged recently. Regarding the weapons artifacts, are you running on an Intel or Apple Silicon Mac? I thought this problem was only visible on Apple Silicon - where I disabled GPU skinning to fix it. If you are running on Intel that is quite interesting and may indicate a more general problem with skinning. In this case you could try setting Good news on the performance improvement. That was my main goal for this effort. I trust you applied RobertBeckebans/nvrhi#6 - this is needed for max performance. |
@SRSaunders Ya that could be. Yes I'm running on Apple Silicon (M2 Pro). The r_useGPUSkinning "0" fixed it but only while in campaign not the multiplayer. |
Thanks for catching the |
@RobertBeckebans I would like to know what you are planning with this PR. I have a few other improvements for Vulkan that I would like to submit (e.g. use Vulkan dynamic functions vs. static linkage for VMA setup, Optick, and MoltenVK config), but I don't want to keep adding on here, especially if you are not planning to merge in the near term. I could simplify things by dividing this into two PRs that could be treated independently:
I would then close this large PR and you could look at two new smaller PRs independently. Would this help? If so would you look at either of the new PRs for your upcoming release? |
I can see why it is necessary to split the constant buffer renderparms into smaller push constants but I don't really like the complexity it comes with. I don't see this PR merged with RBDoom 1.6 but rather in 1.7. I also would like to merge this after NVRHI has been updated to the newest version.
I also would like to see this separated out into a different branch and then it needs to be merged with the newest NVRHI later. |
Thanks @RobertBeckebans for your comments and advice. I will proceed with splitting the PR into two parts and resubmit. Regarding your comments:
|
The pc. suffix for the renderparms is not really a big deal. I was refering to figuring out which renderparms can be put into subsets for the used shaders and hitting the limits of constant buffers when extending the shaders in the future. You did a good job at sorting those using the excel sheet but this still needs to go through some deeper testing and it is also kind of the opposite I generally wish for the overall renderer design. I like it stupid simple and it might be the case that changing the renderer from a multipass forward design where each geometry is rerendered for each light to a clustered forward+ renderer design can outperform your changes while having a design that still works with the old renderparms array. |
Yes, the sorting work with Excel was the main effort, along with figuring out how to dynamically enable/disable push constants at runtime based on platform capabilities. This branch has been my daily driver for about 4 months now, plus a number of linux and macOS users have tried the PR with success. While I think it's pretty stable as is, getting more play and stress testing would give us more confidence. Regarding your plans for clustered forward+ rendering, are these the papers that descibe it? forward_plus.pdf 087-096.pdf If so I am wondering whether my work with push constants can co-exist with these improvements? I presume the main issue would be whether existing shaders would need more parameters (or perhaps new shaders), and then bumping up against push constant platform limits (128 or 256 byte limit on Windows & Linux depending on GPU, a non-issue on macOS). However, if it only involves modifying the lighting shaders (ambient*, interaction*, and interactionSM*) then the solution should be fairly simple: those shaders already use >256 renderparm bytes and have push constants disabled for their associated binding layout types on Windows and Linux. On macOS they can use push constants since the limit is 4096 bytes on that platform. In other words, adding new renderparms to the three subsets for those specific lighting shader groups would not negate the performance gains from my push constants work. Is this a possiblity? If you are going to continue to support Linux and macOS, Vulkan cannot be left to fall behind in performance. My work was a small attempt to push things forward (sorry for the pun). |
This PR depends on RobertBeckebans/nvrhi#6
This PR also solves #763 for Apple Silicon performance and rendering artifact elimination, as well as #804 for Intel Integrated GPU support with Vulkan (slow, but works).
This is a fairly large pull request that implements the following:
BINDING_LAYOUT_GBUFFER, BINDING_LAYOUT_GBUFFER_SKINNED, BINDING_LAYOUT_TEXTURE, BINDING_LAYOUT_TEXTURE_SKINNED, BINDING_LAYOUT_WOBBLESKY, BINDING_LAYOUT_SSGI, BINDING_LAYOUT_SSGI_SKINNED, BINDING_LAYOUT_POST_PROCESS
)r_useDX12PushConstants
cvar which is turned off by default. This can optionally be turned on using autoexec.cfg for experimentation.select
does not seem to be supported when compiling for DXIL).cmake-macos-*.sh
andcmake-xcode-*.sh
build scripts for openal-soft path portability across x86 and Apple Silicon. Thanks to @asemarafa for the code.shaderStorageImageReadWithoutFormat
device feature on Intel GPUs, and individually activatesVK_KHR_fragment_shading_rate
sub-features vs. all or none (this is supported by nvrhi).r_useVulkanPushConstants
(default on) which is useful for performance comparisons.Tested on Windows 10 (AMD and Nvidia), Linux Manjaro, and macOS Ventura 13.5
Performance timings for this PR vs. current master, generated using a simple home-made timedemo:
Windows Nvidia System (1070 Ti)
DX12: 263 fps before,
255360 fps after (with r_useDX12PushConstants = 0) --> significant improvementVulkan: 218 fps before,
233333 fps after --> significant improvementWindows AMD System (6600 XT)
DX12: 295 fps before,
285305 fps after (with r_useDX12PushConstants = 0) --> neutral/positive improvementVulkan: 150 fps before,
155160 fps after --> neutral/positive improvementLinux AMD System (6600 XT)
Vulkan: 150 fps before,
210270 fps after --> large improvementmacOS AMD System (6600 XT)
Vulkan: 77 fps before,
177245 fps after --> very large improvementmacOS Apple Silicon System (M1 Air) (new)
Vulkan: 6 fps before, 85 fps after --> massive improvement
Some possible explanations are as follows:
r_useDX12PushConstants = 0
by default.but still modestimprovement for Vulkan.less, but still in the positive direction for Vulkansignificant for DX12 & Vulkan on Nvidia, but less but so on AMD, at least for my test systems/GPUs.@RobertBeckebans here is the spreadsheet I used to create the binding layout type to shader/renderparm subset mapping:
Binding to Shader Mapping v4.xlsx