-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[viostor] Fix performance issue (regression) #1289
base: master
Are you sure you want to change the base?
[viostor] Fix performance issue (regression) #1289
Conversation
1. Added missing juice in PR virtio-win#487 (f8c904c), recommended for SMP with CPU affinity plus increase in MAX_PHYS_SEGMENTS (8x) and MaxXfer size. Other performance enhancements are available, but this commit should resolve issue virtio-win#992. Signed-off-by: benyamin-codez <[email protected]>
1. Mostly backported components from vioscsi or vioscsi-proposed... 2. PR virtio-win#1289 - Added missing juice in PR virtio-win#487 (f8c904c), recommended for SMP with CPU affinity plus increase in MAX_PHYS_SEGMENTS (8x) and MaxXfer size (regression). 3. Addded VioStorReadRegistryParameter() and dependecy CopyBufferToAnsiString() to set max_segments. Fudged hba_id to "1". 4. Refactored multi-factor max_segments computation for use with NOPB and MaximumTransferLength. 5. Refactored memory allocation calculations and added padding to ensure alignment. 6. Significant instrumentation of the above. 7. Added instrumentation to RhelSetGuestFeatures(). 8. Refactored VirtIoHwInitialize() to produce improved and clean instrumentation. 9. Added conditional compilation for STOR_PERF_DPC_REDIRECTION_CURRENT_CPU (dependent on setting of NTDDI_VERSION in project file). 10. Added notes re STOR_PERF_NO_SGL being of no effect. 11. Minor refactoring of definitions in virtio_stor.h plus added some new VQ and registry definitions. 12. Changed VIRTIO_MAX_SG to (MAX_PHYS_SEGMENTS + 1). 13. Changed type (USHORT) of num_queues to ULONG in ADAPTER_EXTENSION. 14. Added max_segments to ADAPTER_EXTENSION. 15. Added WPP trace bits for guest features and registry. 16. Refactored RhelGetDiskGeometry() to improve instrumentation. Signed-off-by: benyamin-codez <[email protected]>
If you can take a moment, please also refer to my WIP... On 4K clusters I had the following results:
... so there are additional optimisations available. I suspect a NOPB off-by-one regression: when max_seg = 254 (default) "breaks" are recorded as 0xfe (254) rather than 0xff (255). This will impact the calculation of MaxXferLength length too. Memory allocation alignment could be a culprit too, but I did not extensively test this... Would you like some more PRs...? |
Top-notch work and debugging as always @benyamin-codez - love to see this work, keep up the good stuff! |
On the WIP branch gains, is there a specific area of work that is more impactful than another? From the commit msg, there were quite a few things stacked together. All looked solid, so I think at a project level, we'd welcome it! |
@JonKohler @vrozenfe @YanVugenfirer
Thanks Jon. Credit also to @york-king for their work in submitting issue #992. Reducing the point-in-time scope to between builds 187 and 189 was most helpful. From there some tag translation followed by a compare reveals the only
We should first consider that if having MaxXferLength = 2MiB (the upper limit) is important, then we need to set Therefore, thinking in order of importance, the splits would probably be:
Preliminary enabling PRs would be required for definitions and helpers, e.g. RegistryRead, etc. Design questions would be:
Any thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for CI to finish before merging
@YanVugenfirer Please wait for Win11 CI too |
We ran into this shenanigans as well in our product, as there is no virtio way to "advertise" what the backend supports, so the frontend can "just do it". There is a proposal spec change that is floating out there that needs a champion. I was going to pick it up, but then my mother passed away and that just wrecked me for a very long time, and I haven't been able to get back to it. Hint hint nudge nudge, if you want to pick it up, I'd back you 100% ! - oasis-tcs/virtio-spec#122 What is interesting is that this seems to be a windows specific nuance, as Linux doesn't have this issue :(
Fabulouso - sounds good to me boss. I know the NOPB issue I fixed a while back made a non-trivial speedup for large block IO, so it'd be good to "fix that" at the project level in all areas it is busted (perhaps this is the last one). On memory alignment, is that just a viostor issue? or would that apply to vioscsi as well? How do you see that, just manual tracing? or is there some other thing? Either way, if we've got pages unaligned, I suspect that's no bueno, so hopefully that's a simple targeted fix
Shamelessly, for me, a single parameter is fine (i.e single HBA/per host sort of functionality) because we run single HBA configurations for simplicity, though I suspect this would be a nice fit-n-finish to be able to target things more purposefully. That said, like my comment below on per OS compile time stuff, I'm always in favor of anything that can "just work" and not make the user A) think too hard or B) have too many knobs on their plate and have to be an internals-focused sort of person to get it right. TLDR, its a toss up from me, I'd put that dead last on the list of things here.
Its nice to have a bit to be able to turn off, but I suspect if we make an "opt in" bit, it would either get A) misused B) misunderstood and/or C) never used. It's always nice when a driver is smart enough to "turn on the right stuff" based on the OS/running environment and/or things that it sees from the "host", so if we can safely turn it on at Compile, absolutely fine by me. Also, while my mind is on the thought - do we do this for vioscsi already? IIRC no, but perhaps my brain is cooked at the end of week.
Not that I'm aware of, but perhaps Vadim et al have some more thoughts on that. |
Thanks for your feedback, Jon. That was quite helpful. Please also accept my condolences regarding your mother. As mentioned in the OP of the spec PR you referenced:
We could try going to the maximum, which would be: I note I have only tested to 512 segments ( iirc, each virtqueue presents as an 8KiB buffer, 4KiB in each split ring. The maximum queue size presently permitted by SeaBIOS is 256 (= So, I had to quadruple check my results, but on closer scrutiny there is quite an odd observation to be made.
[*] +/- STOR_PERF_DPC_REDIRECTION_CURRENT_CPU, NTDDI_VERSION=WIN_THRESHOLD, 4KiB clusters You will see that in Win10 there is very little difference between this PR and and the WIP, but in Win11 this PR barely makes a difference but the WIP does, so that will inform the next cut. Also, the gains for
No I back-ported it from my The adaptExt->pageAllocationVa = (PVOID)(((ULONG_PTR)(uncachedExtensionVa) + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1)); The alternative would be to use I only noticed this was an issue because the trace was skewing the numbers when doing integer division (scaling to KiB), so accurate tracing is another benefit of these mechanics. Thanks for your insights re registry parameters. This was another Regarding the spec work, no guarantees, but I'll put it on my list. Best regards, |
Ok, so I'm going to raise a few PRs so that Win11 can benefit from this one too. My guess is that this anomaly is the reason why the regression made it through, as the performance degradation may not have been evident on Win11. @vrozenfe, perhaps QE would consider performance checking for both targets in future...? I still have some |
Just wanting to check if you were interested in this... tbh, I'm not convinced that the Let me know if you want me to run it up. Best regards, |
As I thought, 1024 for Enjoy...! Ben |
Based on the following results, correct me if I'm wrong, but I think this one is ready to merge...
|
There is no need to increase the maximum transfer size this way. The memory allocation mechanism for SG elements and descriptors will be refactored in order to support IOMMU and DMA v3 in any case. Best regards, |
I presume you mean this comment:
...and not this PR..? I have moved the rest of this conversation to PR #1316.Perhaps in the meantime (while you work on IOMMU and DMA v3), those PRs might work well...? |
MAX_PHYS_SEGMENTS
(8x) and MaxXfer size.Other performance enhancements are available, but this commit should resolve issue #992.