Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[viostor] Fix performance issue (regression) #1289

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

benyamin-codez
Copy link
Contributor

  1. Added missing juice in PR #487 (f8c904c), recommended for SMP with CPU affinity plus increase in MAX_PHYS_SEGMENTS (8x) and MaxXfer size.

Other performance enhancements are available, but this commit should resolve issue #992.

1. Added missing juice in PR virtio-win#487 (f8c904c), recommended for SMP with CPU affinity plus increase in MAX_PHYS_SEGMENTS (8x) and MaxXfer size.

Other performance enhancements are available, but this commit should resolve issue virtio-win#992.

Signed-off-by: benyamin-codez <[email protected]>
benyamin-codez added a commit to benyamin-codez/kvm-guest-drivers-windows that referenced this pull request Feb 12, 2025
1.  Mostly backported components from vioscsi or vioscsi-proposed...
2.  PR virtio-win#1289 - Added missing juice in PR virtio-win#487 (f8c904c), recommended for SMP with CPU affinity plus increase in MAX_PHYS_SEGMENTS (8x) and MaxXfer size (regression).
3.  Addded VioStorReadRegistryParameter() and dependecy CopyBufferToAnsiString() to set max_segments. Fudged hba_id to "1".
4.  Refactored multi-factor max_segments computation for use with NOPB and MaximumTransferLength.
5.  Refactored memory allocation calculations and added padding to ensure alignment.
6.  Significant instrumentation of the above.
7.  Added instrumentation to RhelSetGuestFeatures().
8.  Refactored VirtIoHwInitialize() to produce improved and clean instrumentation.
9.  Added conditional compilation for STOR_PERF_DPC_REDIRECTION_CURRENT_CPU (dependent on setting of NTDDI_VERSION in project file).
10. Added notes re STOR_PERF_NO_SGL being of no effect.
11. Minor refactoring of definitions in virtio_stor.h plus added some new VQ and registry definitions.
12. Changed VIRTIO_MAX_SG to (MAX_PHYS_SEGMENTS + 1).
13. Changed type (USHORT) of num_queues to ULONG in ADAPTER_EXTENSION.
14. Added max_segments to ADAPTER_EXTENSION.
15. Added WPP trace bits for guest features and registry.
16. Refactored RhelGetDiskGeometry() to improve instrumentation.

Signed-off-by: benyamin-codez <[email protected]>
@benyamin-codez
Copy link
Contributor Author

@vrozenfe @YanVugenfirer

If you can take a moment, please also refer to my WIP...

On 4K clusters I had the following results:

SEQ1M (Q8T1) RND4K (Q32T16) RND4K (IOPS) RND4K (μs)
Baseline (v266) 7174.51 959.80 234325.44 2180.13
This PR 7187.66 1257.28 306952.15 1649.53
WIP 7157.31 1465.76 357850.59 1427.34

... so there are additional optimisations available.

I suspect a NOPB off-by-one regression: when max_seg = 254 (default) "breaks" are recorded as 0xfe (254) rather than 0xff (255). This will impact the calculation of MaxXferLength length too.

Memory allocation alignment could be a culprit too, but I did not extensively test this...

Would you like some more PRs...?
Any features in particular, e.g. is the setting of max_segments via the registry of any benefit here?

@JonKohler
Copy link
Contributor

Top-notch work and debugging as always @benyamin-codez - love to see this work, keep up the good stuff!

@JonKohler
Copy link
Contributor

On the WIP branch gains, is there a specific area of work that is more impactful than another? From the commit msg, there were quite a few things stacked together. All looked solid, so I think at a project level, we'd welcome it!

@benyamin-codez
Copy link
Contributor Author

@JonKohler @vrozenfe @YanVugenfirer

Top-notch work and debugging as always @benyamin-codez - love to see this work, keep up the good stuff!

Thanks Jon. Credit also to @york-king for their work in submitting issue #992. Reducing the point-in-time scope to between builds 187 and 189 was most helpful. From there some tag translation followed by a compare reveals the only viostor commit (f8c904c) to peek into. 8^d

On the WIP branch gains, is there a specific area of work that is more impactful than another? From the commit msg, there were quite a few things stacked together. All looked solid, so I think at a project level, we'd welcome it!

We should first consider that if having MaxXferLength = 2MiB (the upper limit) is important, then we need to set max_segments to 512 (MAX_PHYS_SEGMENTS). However, the virtio limit is (presently) 254, so we need a mechanism to exceed that, e.g. via a registry value or some other guest feature.

Therefore, thinking in order of importance, the splits would probably be:

  1. NOPB off-by-one regression
  2. Memory alignment (otherwise StorPort won't align pages in VA)
  3. Instrumentation upgrades by function.

Preliminary enabling PRs would be required for definitions and helpers, e.g. RegistryRead, etc.

Design questions would be:

  1. If the RegistryRead for parameters is in scope, should the per-HBA functionality be implemented as-is? It is functionally superior but makes for a very noisy log (albeit brief) - even for the likes of me. 8^d It would be best to fix the StorPort native \Parameters\Device(d) option, but I haven't been able to get this working. The limitation here is the HwStorPortProhibitedDDIs rule-set, which rules out direct registry manipulation, otherwise we could mimic the required behaviour.
  2. Should we use a registry parameter for enabling STOR_PERF_DPC_REDIRECTION_CURRENT_CPU instead of relying on the NTDDI_VERSION at compile-time? We could then add an entry to viostor.inf in order to enable it by default for certain configurations. I note this was a marginal optimisation, which I only observed on Windows 11 24H2.
  3. VIRTIO_MAX_SG size. My limited testing didn't need more than (MAX_PHYS_SEGMENTS + 1), but is there a specific scenario which still requires (3+MAX_PHYS_SEGMENTS)...?

Any thoughts?

Copy link
Collaborator

@YanVugenfirer YanVugenfirer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for CI to finish before merging

@kostyanf14
Copy link
Member

@YanVugenfirer Please wait for Win11 CI too

@JonKohler
Copy link
Contributor

We should first consider that if having MaxXferLength = 2MiB (the upper limit) is important, then we need to set max_segments to 512 (MAX_PHYS_SEGMENTS). However, the virtio limit is (presently) 254, so we need a mechanism to exceed that, e.g. via a registry value or some other guest feature.

We ran into this shenanigans as well in our product, as there is no virtio way to "advertise" what the backend supports, so the frontend can "just do it". There is a proposal spec change that is floating out there that needs a champion. I was going to pick it up, but then my mother passed away and that just wrecked me for a very long time, and I haven't been able to get back to it.

Hint hint nudge nudge, if you want to pick it up, I'd back you 100% ! - oasis-tcs/virtio-spec#122

What is interesting is that this seems to be a windows specific nuance, as Linux doesn't have this issue :(

Therefore, thinking in order of importance, the splits would probably be:

  1. NOPB off-by-one regression
  2. Memory alignment (otherwise StorPort won't align pages in VA)
  3. Instrumentation upgrades by function.

Fabulouso - sounds good to me boss. I know the NOPB issue I fixed a while back made a non-trivial speedup for large block IO, so it'd be good to "fix that" at the project level in all areas it is busted (perhaps this is the last one).

On memory alignment, is that just a viostor issue? or would that apply to vioscsi as well? How do you see that, just manual tracing? or is there some other thing?

Either way, if we've got pages unaligned, I suspect that's no bueno, so hopefully that's a simple targeted fix

Design questions would be:

  1. If the RegistryRead for parameters is in scope, should the per-HBA functionality be implemented as-is? It is functionally superior but makes for a very noisy log (albeit brief) - even for the likes of me. 8^d It would be best to fix the StorPort native \Parameters\Device(d) option, but I haven't been able to get this working. The limitation here is the HwStorPortProhibitedDDIs rule-set, which rules out direct registry manipulation, otherwise we could mimic the required behaviour.

Shamelessly, for me, a single parameter is fine (i.e single HBA/per host sort of functionality) because we run single HBA configurations for simplicity, though I suspect this would be a nice fit-n-finish to be able to target things more purposefully.

That said, like my comment below on per OS compile time stuff, I'm always in favor of anything that can "just work" and not make the user A) think too hard or B) have too many knobs on their plate and have to be an internals-focused sort of person to get it right.

TLDR, its a toss up from me, I'd put that dead last on the list of things here.

  1. Should we use a registry parameter for enabling STOR_PERF_DPC_REDIRECTION_CURRENT_CPU instead of relying on the NTDDI_VERSION at compile-time? We could then add an entry to viostor.inf in order to enable it by default for certain configurations. I note this was a marginal optimisation, which I only observed on Windows 11 24H2.

Its nice to have a bit to be able to turn off, but I suspect if we make an "opt in" bit, it would either get A) misused B) misunderstood and/or C) never used.

It's always nice when a driver is smart enough to "turn on the right stuff" based on the OS/running environment and/or things that it sees from the "host", so if we can safely turn it on at Compile, absolutely fine by me.

Also, while my mind is on the thought - do we do this for vioscsi already? IIRC no, but perhaps my brain is cooked at the end of week.

  1. VIRTIO_MAX_SG size. My limited testing didn't need more than (MAX_PHYS_SEGMENTS + 1), but is there a specific scenario which still requires (3+MAX_PHYS_SEGMENTS)...?

Not that I'm aware of, but perhaps Vadim et al have some more thoughts on that.

@benyamin-codez
Copy link
Contributor Author

@JonKohler

Thanks for your feedback, Jon. That was quite helpful. Please also accept my condolences regarding your mother.

As mentioned in the OP of the spec PR you referenced:

QEMU tolerates as well if guests drivers exceed the Queue Size. However currently it has a hard coded limit of max. 1024 memory segments [as] #define VIRTQUEUE_MAX_SIZE 1024, [w]hich limits the max. bulk data size per virtio message to slightly below 4MB.

We could try going to the maximum, which would be:
1024 - 2 = 1022 ==> 1022 * PAGE_SIZE = 4,088KiB MaxXferLen, StorPort_NOPB=1023
Would that be of any value to you?

I note I have only tested to 512 segments (2MiB MaxXferLen, StorPort_NOPB=513).

iirc, each virtqueue presents as an 8KiB buffer, 4KiB in each split ring.

The maximum queue size presently permitted by SeaBIOS is 256 (= MAX_CPU, where 1 vq <=> 1 vCPU). This disparity between queue size and max_segments is one reason I am interested in STOR_PERF_DPC_REDIRECTION_CURRENT_CPU.

So, I had to quadruple check my results, but on closer scrutiny there is quite an odd observation to be made.

Target Flavour SEQ1M
Q8T1
(MiBps)
RND4K
Q32T16
(MiBps)
RND4K
Q32T16
(IOPS)
RND4K
Q32T16
(μs)
Win10 v266 7179 1071 201,459 1956
Win10 This PR 7189 1547 377,795 1352
Win10 WIP -* 7179 1543 349,766 1356
Win10 WIP +* 7180 1565 382,159 1338
Win10 WIP +VB 7190 1568 382,748 1335
Win11 v266 7160 1038 253,388 2017
Win11 This PR 7165 1041 254,155 2011
Win11 WIP -* 7148 1520 371,015 1375
Win11 WIP +* 7147 1513 369,380 1384
Win11 WIP +CO 7157 1511 368,854 1385
Win11 WIP +GE 7143 1510 368,697 1387

[*] +/- STOR_PERF_DPC_REDIRECTION_CURRENT_CPU, NTDDI_VERSION=WIN_THRESHOLD, 4KiB clusters
[VB] + STOR_PERF_DPC_REDIRECTION_CURRENT_CPU, NTDDI_VERSION=WIN_10_VB, 4KiB clusters
[CO] + STOR_PERF_DPC_REDIRECTION_CURRENT_CPU, NTDDI_VERSION=WIN_10_CO, 4KiB clusters
[GE] + STOR_PERF_DPC_REDIRECTION_CURRENT_CPU, NTDDI_VERSION=WIN_11_GE, 4KiB clusters

You will see that in Win10 there is very little difference between this PR and and the WIP, but in Win11 this PR barely makes a difference but the WIP does, so that will inform the next cut.

Also, the gains for STOR_PERF_DPC_REDIRECTION_CURRENT_CPU which I previously only observed on Windows 11 24H2 appeared to swap for NTDDI_WIN_THRESHOLD, with the a marginal improvement shifting from Win11 to Win10. Personally I think it worthwhile enabling this performance option so that DPC redirection can be set to the vCPU requesting the DPC rather than vCPU originating the IO request with StorPort.

On memory alignment, is that just a viostor issue? or would that apply to vioscsi as well? How do you see that, just manual tracing? or is there some other thing?

No I back-ported it from my vioscsi WIP.

The uncachedExtensionVa and pageAllocationVa will not align unless you add uncachedExtPad2 (this is padding2 in the art). Then StorPort will return a page-aligned VA. Otherwise StorPort adds a whole page to the allocation to ensure it passes the page boundary whilst still providing enough memory. The VA is then trimmed by the bitwise comparator to PAGE_SIZE boundaries resulting in padding1 being > 0 via:

adaptExt->pageAllocationVa = (PVOID)(((ULONG_PTR)(uncachedExtensionVa) + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1));

The alternative would be to use adaptExt->pageAllocationVa = (PVOID)((ULONG_PTR)(uncachedExtensionVa)) or similar but the above is more efficient because a properly padded request will return a smaller allocation.

I only noticed this was an issue because the trace was skewing the numbers when doing integer division (scaling to KiB), so accurate tracing is another benefit of these mechanics.

Thanks for your insights re registry parameters. This was another vioscsi WIP back-port. I think I will use it again here to extend max_segments as discussed above. We presently do not do any per-HBA reads, the PR for vioscsi is #1216. This permits disks with different backings to have HBAs with different max_segments values, which is important for certain pass-through devices.

Regarding the spec work, no guarantees, but I'll put it on my list.
I do not anticipate being in a position to pick it up for some time.

Best regards,
Ben

@benyamin-codez
Copy link
Contributor Author

You will see that in Win10 there is very little difference between this PR and and the WIP, but in Win11 this PR barely makes a difference but the WIP does, so that will inform the next cut.

Ok, so I'm going to raise a few PRs so that Win11 can benefit from this one too.

My guess is that this anomaly is the reason why the regression made it through, as the performance degradation may not have been evident on Win11. @vrozenfe, perhaps QE would consider performance checking for both targets in future...? I still have some diskspd.exe based data collection scripts I could resurrect if we lack tooling. I didn't write the reporting end yet to parse the diskspd.exe output (I haven't quite been annoyed enough yet with trite manual data collation)... 8^d

@benyamin-codez
Copy link
Contributor Author

Added the following PRs:

#1297
#1298 (waiting on 1297)
#1299 (waiting on 1298)
#1300 (waiting on 1298)
#1301
#1302 (waiting on 1297)

This should fix Win11 too.

@benyamin-codez
Copy link
Contributor Author

@JonKohler

We could try going to the maximum, which would be:
1024 - 2 = 1022 ==> 1022 * PAGE_SIZE = 4,088KiB MaxXferLen, StorPort_NOPB=1023
Would that be of any value to you?

Just wanting to check if you were interested in this...

tbh, I'm not convinced that the num_queues - 2 applies to viostor...
...as it doesn't supply CONTROL and EVENT queues or MSI-X vectors for the same for that matter.
It could be: VIRTQUEUE_MAX_SIZE = 1024 ==> 1024 * PAGE_SIZE = 4,096KiB MaxXferLen, StorPort_NOPB=1025
...but that is presently untested...

Let me know if you want me to run it up.

Best regards,
Ben

@benyamin-codez
Copy link
Contributor Author

@JonKohler

We could try going to the maximum, which would be:
1024 - 2 = 1022 ==> 1022 * PAGE_SIZE = 4,088KiB MaxXferLen, StorPort_NOPB=1023
Would that be of any value to you?

Just wanting to check if you were interested in this...

tbh, I'm not convinced that the num_queues - 2 applies to viostor... ...as it doesn't supply CONTROL and EVENT queues or MSI-X vectors for the same for that matter. It could be: VIRTQUEUE_MAX_SIZE = 1024 ==> 1024 * PAGE_SIZE = 4,096KiB MaxXferLen, StorPort_NOPB=1025 ...but that is presently untested...

As I thought, 1024 for viostor (for 4MiB xfers) and 1022 for vioscsi (for 4,088KiB xfers).
PRs #1315 and #1316 respectively,

Enjoy...!

Ben

@benyamin-codez
Copy link
Contributor Author

Based on the following results, correct me if I'm wrong, but I think this one is ready to merge...

Check Disk Stress (LOGO) Flush Test Disk Verification (LOGO)
viostor-Win2022x64 Pass Fail Pass
viostor-Win11_24H2x64_host_uefi_viommu Fail Pass Fail
viostor-Win2025x64_host_uefi Pass Pass Pass

@vrozenfe
Copy link
Collaborator

vrozenfe commented Mar 2, 2025

There is no need to increase the maximum transfer size this way. The memory allocation mechanism for SG elements and descriptors will be refactored in order to support IOMMU and DMA v3 in any case.

Best regards,
Vadim.

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Mar 2, 2025

@vrozenfe

There is no need to increase the maximum transfer size this way. The memory allocation mechanism for SG elements and descriptors will be refactored in order to support IOMMU and DMA v3 in any case.

I presume you mean this comment:

As I thought, 1024 for viostor (for 4MiB xfers) and 1022 for vioscsi (for 4,088KiB xfers).
PRs #1315 and #1316 respectively,

...and not this PR..?

I have moved the rest of this conversation to PR #1316.

Perhaps in the meantime (while you work on IOMMU and DMA v3), those PRs might work well...?
At risk of hijacking this PR, I should also ask: is there a reason it shouldn't be done this way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants