Skip to content

pp_reverse - chunk-at-a-time string reversal (part 2/msvc fix) #23374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: blead
Choose a base branch
from

Conversation

bulk88
Copy link
Contributor

@bulk88 bulk88 commented Jun 17, 2025

Part of PR #23371 see details in that PR

The performance characteristics of string reversal in blead is very
variable depending upon the capabilities of the C compiler. Some
compilers are able to vectorize some cases for better performance.

This commit introduces explicit reversal and swapping of whole
registers at a time, which all builds seem to be able to
benefit from.

The `_swab_xx_` macros for doing this already exist in perl.h,
using them for this purpose was inspired by
https://dev.to/wunk/fast-array-reversal-with-simd-j3p
The bit shifting done by these macros should be portable and reasonably
performant if not optimised further, but it is likely that they will
be optimised to bswap, rev, movbe instructions.

Some performance comparisons:

1. Large string reversal, with different source & destination buffers
    my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse $x }

gcc blead:
          2,388.30 msec task-clock                       #    0.993 CPUs utilized
    10,574,195,388      cycles                           #    4.427 GHz
    61,520,672,268      instructions                     #    5.82  insn per cycle
    10,255,049,869      branches                         #    4.294 G/sec

clang blead:
            688.37 msec task-clock                       #    0.946 CPUs utilized
     3,161,754,439      cycles                           #    4.593 GHz
     8,986,420,860      instructions                     #    2.84  insn per cycle
       324,734,391      branches                         #  471.745 M/sec

gcc patched:
            408.39 msec task-clock                       #    0.936 CPUs utilized
     1,617,273,653      cycles                           #    3.960 GHz
     6,422,991,675      instructions                     #    3.97  insn per cycle
       644,856,283      branches                         #    1.579 G/sec

clang patched:
            397.61 msec task-clock                       #    0.924 CPUs utilized
     1,655,838,316      cycles                           #    4.165 GHz
     5,782,487,237      instructions                     #    3.49  insn per cycle
       324,586,437      branches                         #  816.350 M/sec

2. Large string reversal, but reversing the buffer in-place
    my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse "foo",$x }

gcc blead:
          6,038.06 msec task-clock                       #    0.996 CPUs utilized
    27,109,273,840      cycles                           #    4.490 GHz
    41,987,097,139      instructions                     #    1.55  insn per cycle
     5,211,350,347      branches                         #  863.083 M/sec

clang blead:
          5,815.86 msec task-clock                       #    0.995 CPUs utilized
    26,962,768,616      cycles                           #    4.636 GHz
    47,111,208,664      instructions                     #    1.75  insn per cycle
     5,211,117,921      branches                         #  896.018 M/sec

gcc patched:
          1,003.49 msec task-clock                       #    0.999 CPUs utilized
     4,298,242,624      cycles                           #    4.283 GHz
     7,387,822,303      instructions                     #    1.72  insn per cycle
       725,892,855      branches                         #  723.367 M/sec

clang patched:
            970.78 msec task-clock                       #    0.973 CPUs utilized
     4,436,489,695      cycles                           #    4.570 GHz
     8,028,374,567      instructions                     #    1.81  insn per cycle
       725,867,979      branches                         #  747.713 M/sec

3. Short string reversal, different source & destination (checking performance on
smaller string reversals - note: this one's vary variable due to noise)
    my $x = "1234567"; my $y; for (0..10_000_000) { $y = reverse $x }

gcc blead:
            401.20 msec task-clock                       #    0.916 CPUs utilized
     1,672,263,966      cycles                           #    4.168 GHz
     5,564,078,603      instructions                     #    3.33  insn per cycle
      1,250,983,219      branches                         #    3.118 G/sec

clang blead:
            380.58 msec task-clock                       #    0.998 CPUs utilized
     1,615,634,265      cycles                           #    4.245 GHz
     5,583,854,366      instructions                     #    3.46  insn per cycle
     1,300,935,443      branches                         #    3.418 G/sec

gcc patched:
            381.62 msec task-clock                       #    0.999 CPUs utilized
     1,566,807,988      cycles                           #    4.106 GHz
     5,474,069,670      instructions                     #    3.49  insn per cycle
     1,240,983,221      branches                         #    3.252 G/sec

clang patched:
            346.21 msec task-clock                       #    0.999 CPUs utilized
     1,600,780,787      cycles                           #    4.624 GHz
     5,493,773,623      instructions                     #    3.43  insn per cycle
     1,270,915,076      branches                         #    3.671 G/sec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants