-
Notifications
You must be signed in to change notification settings - Fork 584
pp_reverse - chunk-at-a-time string reversal (part 2/msvc fix) #23374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bulk88
wants to merge
4
commits into
Perl:blead
Choose a base branch
from
bulk88:reverse_bswap
base: blead
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The performance characteristics of string reversal in blead is very variable depending upon the capabilities of the C compiler. Some compilers are able to vectorize some cases for better performance. This commit introduces explicit reversal and swapping of whole registers at a time, which all builds seem to be able to benefit from. The `_swab_xx_` macros for doing this already exist in perl.h, using them for this purpose was inspired by https://dev.to/wunk/fast-array-reversal-with-simd-j3p The bit shifting done by these macros should be portable and reasonably performant if not optimised further, but it is likely that they will be optimised to bswap, rev, movbe instructions. Some performance comparisons: 1. Large string reversal, with different source & destination buffers my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse $x } gcc blead: 2,388.30 msec task-clock # 0.993 CPUs utilized 10,574,195,388 cycles # 4.427 GHz 61,520,672,268 instructions # 5.82 insn per cycle 10,255,049,869 branches # 4.294 G/sec clang blead: 688.37 msec task-clock # 0.946 CPUs utilized 3,161,754,439 cycles # 4.593 GHz 8,986,420,860 instructions # 2.84 insn per cycle 324,734,391 branches # 471.745 M/sec gcc patched: 408.39 msec task-clock # 0.936 CPUs utilized 1,617,273,653 cycles # 3.960 GHz 6,422,991,675 instructions # 3.97 insn per cycle 644,856,283 branches # 1.579 G/sec clang patched: 397.61 msec task-clock # 0.924 CPUs utilized 1,655,838,316 cycles # 4.165 GHz 5,782,487,237 instructions # 3.49 insn per cycle 324,586,437 branches # 816.350 M/sec 2. Large string reversal, but reversing the buffer in-place my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse "foo",$x } gcc blead: 6,038.06 msec task-clock # 0.996 CPUs utilized 27,109,273,840 cycles # 4.490 GHz 41,987,097,139 instructions # 1.55 insn per cycle 5,211,350,347 branches # 863.083 M/sec clang blead: 5,815.86 msec task-clock # 0.995 CPUs utilized 26,962,768,616 cycles # 4.636 GHz 47,111,208,664 instructions # 1.75 insn per cycle 5,211,117,921 branches # 896.018 M/sec gcc patched: 1,003.49 msec task-clock # 0.999 CPUs utilized 4,298,242,624 cycles # 4.283 GHz 7,387,822,303 instructions # 1.72 insn per cycle 725,892,855 branches # 723.367 M/sec clang patched: 970.78 msec task-clock # 0.973 CPUs utilized 4,436,489,695 cycles # 4.570 GHz 8,028,374,567 instructions # 1.81 insn per cycle 725,867,979 branches # 747.713 M/sec 3. Short string reversal, different source & destination (checking performance on smaller string reversals - note: this one's vary variable due to noise) my $x = "1234567"; my $y; for (0..10_000_000) { $y = reverse $x } gcc blead: 401.20 msec task-clock # 0.916 CPUs utilized 1,672,263,966 cycles # 4.168 GHz 5,564,078,603 instructions # 3.33 insn per cycle 1,250,983,219 branches # 3.118 G/sec clang blead: 380.58 msec task-clock # 0.998 CPUs utilized 1,615,634,265 cycles # 4.245 GHz 5,583,854,366 instructions # 3.46 insn per cycle 1,300,935,443 branches # 3.418 G/sec gcc patched: 381.62 msec task-clock # 0.999 CPUs utilized 1,566,807,988 cycles # 4.106 GHz 5,474,069,670 instructions # 3.49 insn per cycle 1,240,983,221 branches # 3.252 G/sec clang patched: 346.21 msec task-clock # 0.999 CPUs utilized 1,600,780,787 cycles # 4.624 GHz 5,493,773,623 instructions # 3.43 insn per cycle 1,270,915,076 branches # 3.671 G/sec
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Part of PR #23371 see details in that PR