Skip to content

pp_reverse - chunk-at-a-time string reversal #23371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: blead
Choose a base branch
from

Conversation

richardleach
Copy link
Contributor

The performance characteristics of string reversal in blead is very variable depending upon the capabilities of the C compiler. Some compilers are able to vectorize some cases for better performance.

This commit introduces explicit reversal and swapping of whole registers at a time, which all builds seem to be able to benefit from.

The _swab_xx_ macros for doing this already exist in perl.h, using them for this purpose was inspired by
https://dev.to/wunk/fast-array-reversal-with-simd-j3p The bit shifting done by these macros should be portable and reasonably performant if not optimised further, but it is likely that they will be optimised to bswap, rev, movbe instructions.

Some performance comparisons:

1. Large string reversal, with different source & destination buffers
my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse $x }

gcc blead:

          2,388.30 msec task-clock                       #    0.993 CPUs utilized
    10,574,195,388      cycles                           #    4.427 GHz
    61,520,672,268      instructions                     #    5.82  insn per cycle
    10,255,049,869      branches                         #    4.294 G/sec

clang blead:

            688.37 msec task-clock                       #    0.946 CPUs utilized
     3,161,754,439      cycles                           #    4.593 GHz
     8,986,420,860      instructions                     #    2.84  insn per cycle
       324,734,391      branches                         #  471.745 M/sec

gcc patched:

            408.39 msec task-clock                       #    0.936 CPUs utilized
     1,617,273,653      cycles                           #    3.960 GHz
     6,422,991,675      instructions                     #    3.97  insn per cycle
       644,856,283      branches                         #    1.579 G/sec

clang patched:

            397.61 msec task-clock                       #    0.924 CPUs utilized
     1,655,838,316      cycles                           #    4.165 GHz
     5,782,487,237      instructions                     #    3.49  insn per cycle
       324,586,437      branches                         #  816.350 M/sec

2. Large string reversal, but reversing the buffer in-place
my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse "foo",$x }

gcc blead:

          6,038.06 msec task-clock                       #    0.996 CPUs utilized
    27,109,273,840      cycles                           #    4.490 GHz
    41,987,097,139      instructions                     #    1.55  insn per cycle
     5,211,350,347      branches                         #  863.083 M/sec

clang blead:

          5,815.86 msec task-clock                       #    0.995 CPUs utilized
    26,962,768,616      cycles                           #    4.636 GHz
    47,111,208,664      instructions                     #    1.75  insn per cycle
     5,211,117,921      branches                         #  896.018 M/sec

gcc patched:

          1,003.49 msec task-clock                       #    0.999 CPUs utilized
     4,298,242,624      cycles                           #    4.283 GHz
     7,387,822,303      instructions                     #    1.72  insn per cycle
       725,892,855      branches                         #  723.367 M/sec

clang patched:

            970.78 msec task-clock                       #    0.973 CPUs utilized
     4,436,489,695      cycles                           #    4.570 GHz
     8,028,374,567      instructions                     #    1.81  insn per cycle
       725,867,979      branches                         #  747.713 M/sec

3. Short string reversal, different source & destination (checking performance on smaller string reversals - note: this one's vary variable due to noise)
my $x = "1234567"; my $y; for (0..10_000_000) { $y = reverse $x }

gcc blead:

            401.20 msec task-clock                       #    0.916 CPUs utilized
     1,672,263,966      cycles                           #    4.168 GHz
     5,564,078,603      instructions                     #    3.33  insn per cycle
      1,250,983,219      branches                         #    3.118 G/sec

clang blead:

            380.58 msec task-clock                       #    0.998 CPUs utilized
     1,615,634,265      cycles                           #    4.245 GHz
     5,583,854,366      instructions                     #    3.46  insn per cycle
     1,300,935,443      branches                         #    3.418 G/sec

gcc patched:

            381.62 msec task-clock                       #    0.999 CPUs utilized
     1,566,807,988      cycles                           #    4.106 GHz
     5,474,069,670      instructions                     #    3.49  insn per cycle
     1,240,983,221      branches                         #    3.252 G/sec

clang patched:

            346.21 msec task-clock                       #    0.999 CPUs utilized
     1,600,780,787      cycles                           #    4.624 GHz
     5,493,773,623      instructions                     #    3.43  insn per cycle
     1,270,915,076      branches                         #    3.671 G/sec

  • This set of changes requires a perldelta entry, and one will be added after the 5.42 release.

The performance characteristics of string reversal in blead is very
variable depending upon the capabilities of the C compiler. Some
compilers are able to vectorize some cases for better performance.

This commit introduces explicit reversal and swapping of whole
registers at a time, which all builds seem to be able to
benefit from.

The `_swab_xx_` macros for doing this already exist in perl.h,
using them for this purpose was inspired by
https://dev.to/wunk/fast-array-reversal-with-simd-j3p
The bit shifting done by these macros should be portable and reasonably
performant if not optimised further, but it is likely that they will
be optimised to bswap, rev, movbe instructions.

Some performance comparisons:

1. Large string reversal, with different source & destination buffers
    my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse $x }

gcc blead:
          2,388.30 msec task-clock                       #    0.993 CPUs utilized
    10,574,195,388      cycles                           #    4.427 GHz
    61,520,672,268      instructions                     #    5.82  insn per cycle
    10,255,049,869      branches                         #    4.294 G/sec

clang blead:
            688.37 msec task-clock                       #    0.946 CPUs utilized
     3,161,754,439      cycles                           #    4.593 GHz
     8,986,420,860      instructions                     #    2.84  insn per cycle
       324,734,391      branches                         #  471.745 M/sec

gcc patched:
            408.39 msec task-clock                       #    0.936 CPUs utilized
     1,617,273,653      cycles                           #    3.960 GHz
     6,422,991,675      instructions                     #    3.97  insn per cycle
       644,856,283      branches                         #    1.579 G/sec

clang patched:
            397.61 msec task-clock                       #    0.924 CPUs utilized
     1,655,838,316      cycles                           #    4.165 GHz
     5,782,487,237      instructions                     #    3.49  insn per cycle
       324,586,437      branches                         #  816.350 M/sec

2. Large string reversal, but reversing the buffer in-place
    my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse "foo",$x }

gcc blead:
          6,038.06 msec task-clock                       #    0.996 CPUs utilized
    27,109,273,840      cycles                           #    4.490 GHz
    41,987,097,139      instructions                     #    1.55  insn per cycle
     5,211,350,347      branches                         #  863.083 M/sec

clang blead:
          5,815.86 msec task-clock                       #    0.995 CPUs utilized
    26,962,768,616      cycles                           #    4.636 GHz
    47,111,208,664      instructions                     #    1.75  insn per cycle
     5,211,117,921      branches                         #  896.018 M/sec

gcc patched:
          1,003.49 msec task-clock                       #    0.999 CPUs utilized
     4,298,242,624      cycles                           #    4.283 GHz
     7,387,822,303      instructions                     #    1.72  insn per cycle
       725,892,855      branches                         #  723.367 M/sec

clang patched:
            970.78 msec task-clock                       #    0.973 CPUs utilized
     4,436,489,695      cycles                           #    4.570 GHz
     8,028,374,567      instructions                     #    1.81  insn per cycle
       725,867,979      branches                         #  747.713 M/sec

3. Short string reversal, different source & destination (checking performance on
smaller string reversals - note: this one's vary variable due to noise)
    my $x = "1234567"; my $y; for (0..10_000_000) { $y = reverse $x }

gcc blead:
            401.20 msec task-clock                       #    0.916 CPUs utilized
     1,672,263,966      cycles                           #    4.168 GHz
     5,564,078,603      instructions                     #    3.33  insn per cycle
      1,250,983,219      branches                         #    3.118 G/sec

clang blead:
            380.58 msec task-clock                       #    0.998 CPUs utilized
     1,615,634,265      cycles                           #    4.245 GHz
     5,583,854,366      instructions                     #    3.46  insn per cycle
     1,300,935,443      branches                         #    3.418 G/sec

gcc patched:
            381.62 msec task-clock                       #    0.999 CPUs utilized
     1,566,807,988      cycles                           #    4.106 GHz
     5,474,069,670      instructions                     #    3.49  insn per cycle
     1,240,983,221      branches                         #    3.252 G/sec

clang patched:
            346.21 msec task-clock                       #    0.999 CPUs utilized
     1,600,780,787      cycles                           #    4.624 GHz
     5,493,773,623      instructions                     #    3.43  insn per cycle
     1,270,915,076      branches                         #    3.671 G/sec
@richardleach richardleach added the defer-next-dev This PR should not be merged yet, but await the next development cycle label Jun 13, 2025
@t-a-k
Copy link
Contributor

t-a-k commented Jun 14, 2025

I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures.

@richardleach
Copy link
Contributor Author

I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures.

Thanks, I'll look into wrapping this in architecture-based preprocessor guards.

@bulk88
Copy link
Contributor

bulk88 commented Jun 15, 2025

My PR at #23330 by coincidence is trying to ADD THE INTRINSICS to the Perl C API, that THIS PR WANTS to use. But the ticket has stalled.

50% of the reason there is no progress, is b/c I input or a vote is needed on the best name/identifier to use for a new macro.

50% of the reason is claims that Perl 5 has no valid production code reason to need to swap ASCII/binary bytes in the C level VM at runtime.

I don't actually thing we need ntohll and htonll

Originally posted by @Leont in #23330 (comment)

@bulk88
Copy link
Contributor

bulk88 commented Jun 15, 2025

I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures.

Perl has had the ./Configure probe for HW UA
alphaldst
memory since 2001, but there are no P5P tuits to actually use the #define.

my long write of 25 years of apathy #22886

some random related links
#16680
e8864db

https://www.nntp.perl.org/group/perl.perl5.porters/2010/03/msg157755.html

Ive been complaining since 2015 about this.

https://www.nntp.perl.org/group/perl.perl5.porters/2015/11/msg232849.html

#12565 my PRs got rejected to add atomics and UA, for what? to protect non-existent hardware that was melted in Asia as eWaste 10-15 years. The needs of air gapped Perl users are irrelavent. The are airgapped an will never upgrage

Who Here can take a selfie of themselves in 2025 with a post 2010 mfg-ed server or desktop that DOES NOT have HW UA MEM read support?

Alpha in 1999 already had hardware supported UA mem reads aslong as you use the appropriate CC typedecls/declspecs/attributes to pick the "special" unaligned instruction.

@bulk88
Copy link
Contributor

bulk88 commented Jun 15, 2025

I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures.

Which ones? I'm requesting .pdf links to the official Arch ISA specs that you are discussing.

@thesamesam
Copy link

SPARC and earlier ARMv5 or so? Anyway, just write it idiomatically with memcpy and let the compiler notice it (or use builtins).

It also introduces more aliasing issues (e.g. src accessed with both U64* and U32*) which is a shame, as it brings Perl further away from dropping -fno-strict-aliasing.

@thesamesam
Copy link

I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures.

Thanks, I'll look into wrapping this in architecture-based preprocessor guards.

Even this isn't a great idea, because while the implementation is free to use unaligned access, that doesn't mean it's okay in C. If it vectorises it, it may assume aligned access and then trap. That's happened a few times recently (subversion, numpy).

@xenu
Copy link
Member

xenu commented Jun 16, 2025

Unaligned access is undefined behaviour in C.

This can be an problem in practice even on x86, when the compiler decides to use aligned SSE/AVX instructions. As the author of this blogpost has found out.

The usual workaround is to use memcpy. For some algorithms (probably not this one) it's also possible to skip the first few bytes of input until the pointer is aligned.

@xenu
Copy link
Member

xenu commented Jun 16, 2025

Although it should be noted that memcpy doesn't help with the "unaligned reads are implemented in software and super slow" problem. I don't know which architectures suffer from this.

@bulk88
Copy link
Contributor

bulk88 commented Jun 16, 2025

SPARC and earlier ARMv5 or so? Anyway, just write it idiomatically with memcpy and let the compiler notice it (or use builtins).

Wrong. After 15 minutes of trying I couldn't find unaligned HW mem R/W in SPARC V7 Sept 1988 ISA.

But I did find this.

SPARC V9 ISA May 2002

sparcv9_1

sparcv9_2

If P5P voted to Remove Windows 2000/XP/Server 2003 from Perl, that rule also applies to all same vintage Unix boxes from National Museum of Technology And Science of whatever country you live in.

ARM V5, or I call personally call an"ARM V4 CPU" is an obsolete eWaste/Museum CPU. iPhone 1, and Android 1.0 introduced ARM V7 ISA as the minimum platform. The last commercial use of ARM v4 ISA, was for the ubiquitous ARM7TDMI CPU that was inside LG Nokia Motorola and Samsung 2G and 3G flip phones, and that ARM7TDMI is was ran your J2ME flip phone apps. WinCE Perl was compiled as ARMv4 by default, I once ever set the MSVC cmd line arg to ARMv5 Thumb (2 Byte instruction format) as an experiment, but perl5XX.dll grew some KB in disk size, and I decided to never play around with ARMv5 Thumb for Perl again. Maybe >= 2010s ARM32 V5 Thumb compilers like Clang have the intelligence to compile each C function as 4 byte legacy ARMv4, or 2 byte ARMv5, and the C linker picks the smaller machine code side one at link time, but I've never heard of my wishlist C compiler feature actually existing in any production big name C compiler. In any case, ARM32 Thumb is EOL/obsolete in 2025. 2 byte instruction format was removed by ARM LLC between the ARM32 to ARM64 switch,

I've researching and created screen shots for ancient Alpha and ancient SPARC ISA so far, My ARM info above is from memory, not fresh research. I don't remember if ARM v4 has dedicated 1 op HW unaligned memory reads/writes, or some magical and fast enough multi opcode unaligned mem read write asm code recipe that is better and faster than U32 u32var = p[0] | p[1] << 8 | p[2] << 16 | p[3] << 24;. the recipe is something like * U32 u32var = (U32) ( *((U64*)(ptr & ~0x7) <<ROLL()<< ((ptr & 0x7) * 8) ); but I dont care and Perl doesn't care, thats a CC domain problem, not a Perl in C problem.

In any case, the 2 platforms @thesamesam mentioned have some non-ISO-C-Abstract-Machine CC vendor specific "fast enough" C grammar token to do unaligned mem R/Ws. Someone just has to find it in that C Compiler's man page.

FSF LLC and GNU LLC's proprietary CC vendor specific "fast enough" C grammar token to do unaligned mem R/Ws has the ASCII string name of memcpy. But this is proprietary to the GPL 3 repo published by GNU and isn't blindly universally portable to all other C compilers.

MSVC instead wants you to use *(( __unaligned unsigned __int64 *)p) on WinCE for ARM32 and Ancient WinNT IA64/Alpha/SPARC/MIPS. *(( __unaligned unsigned __int64 *)p) is a syntax error in a .i file for MSVC x64 mode, MSVC WinNT ARM32 mode and MSVC WinNT ARM64 mode. Mingw GCC's and MSVC's .h headers instantly #define __unaligned \n to nothing. MSVC i386 mode .i mode ignored __unaligned but doesn't syntax error like MSVC for x64 / NT ARM32 / NT ARM64. I researched all of this, but never tried to prove this as true. Its irrelavent for me, my C/C++ private code, or Perl C code. I and everyone should just write the __unaligned if MSVC is detected even if it does nothing nowadays and MS's std headers NOOP it away. Maybe one day MS will reintroduce __unaligned as best practice or mandatory. Who knows, its harmless to write it in the 2000s-2025 time period.

If we are arguing about ISO C Abstract Virtual Machine CPU compliace. The C compiler published by GNU/FSF is INVALID and non-conformant to ISO C.

Easy demonstration: Add a char * memcpy(char* d, const char * s, size_t n) {} to perl's util.c file that logs every single call to STDERR. If any executions of memcpy() in Perl's C code, do not appear in my shell console, the C Compiler is NON-CONFORMANT and INVALID and should be promptly blocked and blacklisted in perl's ./Configure.

Perl's only project scope is to run on modern enough hardware, not on the https://en.wikipedia.org/wiki/AN/FSQ-7_Combat_Direction_Central OS or ISO C Abstract Virtual OS or https://en.wikipedia.org/wiki/Apple_II

@thesamesam
Copy link

Wrong. After 15 minutes of trying I couldn't find unaligned HW mem R/W in SPARC V7 Sept 1988 ISA.

I was saying that SPARC aggressively traps on unaligned access, not that it has instructions for that.

@bulk88
Copy link
Contributor

bulk88 commented Jun 16, 2025

Wrong. After 15 minutes of trying I couldn't find unaligned HW mem R/W in SPARC V7 Sept 1988 ISA.

I was saying that SPARC aggressively traps on unaligned access, not that it has instructions for that.

Yeah, RISC CPUs are expected and its normal for them to throw an exception/sig handler each time if you use do a vanilla C machine type >1 byte mem read write in C lang on an unaligned addr. My argument is that all CCs have some magic special token/grammer way on how to safely and quickly do unaligned U16/U32/U64 mem reads and writes. So the end user dev just has to write those special magical tokens/grammer things as needed in C src code and the issue goes away. = ( [] << | [] << | [] << | [] ) is the worst possible but still acceptable solution to use in C to deal with UA memory reads/writes of > 1 byte data values.

@bulk88
Copy link
Contributor

bulk88 commented Jun 16, 2025

I vaguely remember ARMv4's/WinCE Perl, if someone does an unaligned U16 or U32 mem read on ARMv4, you get a free shift and a free mask as a gift from ARMv4. You don't get a SEGV on ARMv4 from UA mem reads. I think reason was, not going to google it, was b/c this was how ARMv4 wanted you to do UA mem reads.

U32 u32_var =    *(U32*) ua_32p
                            | *((U32*)  (((char*) ua_32p)+3)  );

Not safely and correctly SEGVing like I assumed WinCE Perl on ARM would, but instead finding a junk integer in my MSVC watch window, while step debugging in VS IDE was a learning experience for me.

@bulk88
Copy link
Contributor

bulk88 commented Jun 16, 2025

Although it should be noted that memcpy doesn't help with the "unaligned reads are implemented in software and super slow" problem. I don't know which architectures suffer from this.

@bulk88 makes note to self to finally turn on intrinsic memcpy(d,s, n <= 16); optimization for all MSVC compilers. @bulk88 doesn't know why he didn't make that PR on day 1 of his WinPerl in C hobby, since all his day job employer's binaries made with MSVC have it turned on.

@bulk88
Copy link
Contributor

bulk88 commented Jun 16, 2025

Although it should be noted that memcpy doesn't help with the "unaligned reads are implemented in software and super slow" problem. I don't know which architectures suffer from this.

A bunch of links to the topic. Since C language UA mem read/writes are Day 1 of Win32 Platform Requirement, All RISC WinNT kernels ever compiled, will trap, emulate and resume the CPU's SIGILL/SIGBUS, with I can't the find the article on google, but it was a 100x or 1000x wall time delay for NT kernel to emulate UA mem reads, vs the C compiler emitting 3-6 more CPU opcodes in the binary when the dev declares an UA C machine type.

A bunch of links, the short summary is, go read your C compiler's docs on how to write your C code correctly using VENDOR SPECIFIC C syntax.

https://devblogs.microsoft.com/oldnewthing/20170814-00/?p=96806
https://devblogs.microsoft.com/oldnewthing/20220810-00/?p=106958
https://devblogs.microsoft.com/oldnewthing/20180409-00/?p=98465
https://devblogs.microsoft.com/oldnewthing/20210611-00/?p=105299
https://devblogs.microsoft.com/oldnewthing/20190821-00/?p=102794
https://devblogs.microsoft.com/oldnewthing/20200103-00/?p=103290
https://devblogs.microsoft.com/oldnewthing/20250605-00/?p=111250
https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/avoiding-misalignment-of-fixed-precision-data-types
https://learn.microsoft.com/en-us/windows/win32/winprog64/fault-alignments
https://wiki.debian.org/ArmEabiFixes#word_accesses_must_be_aligned_to_a_multiple_of_their_size
https://web.archive.org/web/20090204204507/http://lecs.cs.ucla.edu/wiki/index.php/XScale_alignment
https://devblogs.microsoft.com/oldnewthing/20040830-00/?p=38013

FWIW, here is a list of instruction my Win 7 x64 Kernel is capable of emulating. IDK why this emulation code exists or why this switch ladder exists in my Win 7 x64 kernel, I don't have a reason to need to know this, Remember this is closed source SW. But there must be bizarre, very rare cases, where for whatever reason, the Win 7 x64 kernel needs to silently fix up the CPU hardware fault/signal/interrut/exception/event handler whatever u call it.

Update, this instruction list may or may not have something to do with an x64 CPU executing temporarily in 16 bit Real Mode, probably for WinNT's Win16 emulator ntvdm.exe. Good enough answer for my curiosity to be satisfied.

``` XmAaaOp XmAadOp XmAamOp XmAasOp XmAccumImmediate XmAccumRegister XmAdcOp XmAddOp XmAddOperands XmBitScanGeneral XmBoundOp XmBsfOp XmBsrOp XmBswapOp XmBtOp XmBtcOp XmBtrOp XmBtsOp XmByteImmediate XmCallOp XmCbwOp XmClcOp XmCldOp XmCliOp XmCmcOp XmCmpsOp XmCmpxchgOp XmCompareOperands XmCwdOp XmDaaOp XmDasOp XmDecOp XmDivOp XmEffectiveOffset XmEmulateStream XmEnterOp XmEvaluateAddressSpecifier XmEvaluateIndexSpecifier XmExecuteInt1a XmFlagsRegister XmGeneralBitOffset XmGeneralRegister XmGetCodeByte XmGetImmediateSourceValue XmGetLongImmediate XmGetOffsetAddress XmGetStringAddress XmGetWordImmediate XmGroup1General XmGroup1Immediate XmGroup2By1 XmGroup2ByByte XmGroup2ByCL XmGroup3General XmGroup45General XmGroup7General XmGroup8BitOffset XmHltOp XmIdivOp XmIllOp XmImmediateEnter XmImmediateJump XmImulImmediate XmImulOp XmImulxOp XmInOp XmIncOp XmInsOp XmInt1aFindPciClassCode XmInt1aFindPciDevice XmInt1aReadConfigRegister XmInt1aWriteConfigRegister XmIntOp XmIretOp XmJcxzOp XmJmpOp XmJxxOp XmLahfOp XmLeaveOp XmLoadSegment XmLodsOp XmLongJump XmLoopOp XmMovOp XmMoveGeneral XmMoveImmediate XmMoveRegImmediate XmMoveSegment XmMoveXxGeneral XmMovsOp XmMulOp XmNegOp XmNotOp XmOpcodeEscape XmOpcodeRegister XmOrOp XmOutOp XmOutsOp XmPopGeneral XmPopOp XmPopStack XmPopaOp XmPortDX XmPortImmediate XmPrefixOpcode XmPushImmediate XmPushOp XmPushPopSegment XmPushStack XmPushaOp XmRclOp XmRcrOp XmRdtscOp XmRetOp XmRolOp XmRorOp XmSahfOp XmSarOp XmSbbOp XmScasOp XmSegmentOffset XmSetDataType XmSetLogicalResult XmSetccByte XmShiftDouble XmShlOp XmShldOp XmShortJump XmShrOp XmShrdOp XmSmswOp XmStcOp XmStdOp XmStiOp XmStosOp XmStringOperands XmSubOp XmSubOperands XmSxxOp XmTestOp XmXaddOp XmXchgOp XmXlatOpcode XmXorOp ```

@bulk88
Copy link
Contributor

bulk88 commented Jun 16, 2025

I will also add, Perl/P5P devs might be be able to in C lang, in certain permutations of CC OS and CPU, outsmart a C compiler's native UA mem read/write "best practices" algorithm. The engineering reason is, is it legal, to use 2 aligned U32 mem reads to read an unaligned U32?

What if a unknown unnamed CPU has a hardware MMU virtual memory page size of 1 byte?

You can get one of these CPUs with a 1 byte page size MMU right now for free at https://github.com/google/sanitizers/wiki/addresssanitizer or https://en.wikipedia.org/wiki/Valgrind

In real life, aslong as you aren't over reading at the very end of a malloc block at address 4096-1 or 4096-2, 2 x U32 aligned reads are harmless. Malloc() won't hand out a 4 or 5 byte aligned memory block pressed against a 4096 boundary when you execute malloc(5).

IDK of a real world malloc() impl that hands out raw 4096 aligned mempages for request >= 4096, with a unmapped mem page at `malloc(0x1000)-1.

But some should give the real rev engineered meaning of this OSX API doc, read section "Allocating Large Memory Blocks using Malloc" at https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/MemoryAlloc.html#//apple_ref/doc/uid/20001881-99765

But regarding the perl C API, remember that a fictional malloc(1) pressed against a 4096 boundary is illegal and impossible in the Perl C API. A fictional malloc(1) means SvPVX(sv)[0] with that sv being SvCUR(sv) == 0 || SvCUR(sv) == 1 is not '\0' null terminated, and strlen(SvPVX(sv)) is 100% guaranteed to SEGV. Perl could never execute on such an architecture or such a libc impl.

@bulk88
Copy link
Contributor

bulk88 commented Jun 16, 2025

Here is a reason why memcpy() and memcmp() can't be blindly assumed to be a magical UA memory compliance tool and a C dev STILL needs to read their CC's vendor specific docs no matter what.

Here is a list of all call sites to memcmp() symbol in the UCRT dll from my -O1 MSVC blead perl.

A couple calls in 1 func are fine, but anything over 5 calls points to big problems with P5P macros and C code, example of 1 func

S_handle_possible_posix+1013
S_handle_possible_posix+1049
S_handle_possible_posix+107B
S_handle_possible_posix+10A8
S_handle_possible_posix+10D5
S_handle_possible_posix+1102
S_handle_possible_posix+112F
S_handle_possible_posix+1155
S_handle_possible_posix+1176
S_handle_possible_posix+1199
S_handle_possible_posix+F3C
S_handle_possible_posix+FB5
S_handle_possible_posix+FD0
S_handle_possible_posix+FEB

I know someone assumed GCC's inline memcmp() optimization is part of the ISO C TC's reference unit tests for C89. It is not.

full list of all call sites to memcmp() is hidden below

``` PerlIO_find_layer+6C Perl_Gv_AMupdate+2AF Perl_Gv_AMupdate+2D4 Perl__invlistEQ+BB Perl__to_utf8_fold_flags+1EF Perl__to_utf8_fold_flags+242 Perl_allocmy+25C Perl_ck_method+12B Perl_cv_ckproto_len_flags+183 Perl_do_uniprop_match+B5 Perl_do_uniprop_match+DB Perl_fbm_instr+15D Perl_fbm_instr+184 Perl_fbm_instr+245 Perl_fbm_instr+DE Perl_foldEQ_utf8_flags+30B Perl_get_and_check_backslash_N_name+26E Perl_grok_number_flags+437 Perl_grok_numeric_radix+1D0 Perl_gv_add_by_type+8B Perl_gv_autoload_pvn+53 Perl_gv_fetchmeth_pvn_autoload+6C Perl_gv_fetchmethod_pvn_flags+25F Perl_gv_fetchmethod_pvn_flags+D7 Perl_gv_fetchpvn_flags+25E Perl_gv_fullname4+BB Perl_gv_init_pvn+2E6 Perl_gv_setref+51C Perl_hv_common+A80 Perl_hv_ename_add+1A7 Perl_hv_ename_add+D1 Perl_hv_ename_delete+16B Perl_hv_ename_delete+1BF Perl_hv_ename_delete+AA Perl_is_utf8_FF_helper_+A8 Perl_leave_scope+875 Perl_magic_sethook+7B Perl_magic_sethook+AA Perl_magic_setsig+2BA Perl_magic_setsig+2EC Perl_magic_setsig+93 Perl_magic_setsig+DE Perl_mro_isa_changed_in+1E9 Perl_mro_method_changed_in+170 Perl_mro_package_moved+23C Perl_mro_package_moved+FF Perl_ninstr+78 Perl_pad_findmy_pvn+C2 Perl_pp_gelem+122 Perl_pp_gelem+1A1 Perl_pp_gelem+1CB Perl_pp_gelem+1FD Perl_pp_gelem+22F Perl_pp_gelem+257 Perl_pp_gelem+27D Perl_pp_gelem+2A4 Perl_pp_gelem+2FB Perl_pp_gelem+D1 Perl_pp_lc+465 Perl_pp_prototype+64 Perl_pp_ucfirst+30D Perl_pp_ucfirst+6EB Perl_re_intuit_start+22D Perl_re_op_compile+50A Perl_refcounted_he_chain_2hv+C1 Perl_refcounted_he_fetch_pvn+E9 Perl_regexec_flags+A93 Perl_regexec_flags+AB9 Perl_rninstr+90 Perl_save_gp+AE Perl_scan_str+629 Perl_scan_str+660 Perl_scan_str+696 Perl_scan_str+6E4 Perl_sv_cmp_flags+159 Perl_sv_cmp_locale_flags+B6 Perl_sv_eq_flags+1A7 Perl_sv_gets+605 Perl_sv_gets+7F7 Perl_upg_version+1E1 Perl_whichsig_pvn+47 Perl_xs_handshake+119 S_apply_builtin_cv_attribute+45 S_apply_builtin_cv_attribute+69 S_apply_builtin_cv_attribute+92 S_do_chomp+314 S_dofindlabel+2AE S_doopen_pm+D6 S_dopoptolabel+14D S_find_in_my_stash+37 S_force_word+1D4 S_glob_assign_glob+23F S_gv_fetchmeth_internal+269 S_gv_fetchmeth_internal+387 S_gv_fetchmeth_internal+3EC S_gv_magicalize+130 S_gv_magicalize+1A0 S_gv_magicalize+1E7 S_gv_magicalize+203 S_gv_magicalize+224 S_gv_magicalize+244 S_gv_magicalize+2DD S_gv_magicalize+30A S_gv_magicalize+32B S_gv_magicalize+37B S_gv_magicalize+3E3 S_gv_magicalize+40D S_gv_magicalize+435 S_gv_magicalize+456 S_gv_magicalize+4DE S_gv_magicalize+521 S_gv_magicalize+5AF S_gv_magicalize+5EB S_gv_magicalize+60C S_gv_magicalize+65E S_gv_magicalize+698 S_gv_magicalize+6C8 S_gv_magicalize+79F S_gv_magicalize+8D4 S_handle_possible_posix+1013 S_handle_possible_posix+1049 S_handle_possible_posix+107B S_handle_possible_posix+10A8 S_handle_possible_posix+10D5 S_handle_possible_posix+1102 S_handle_possible_posix+112F S_handle_possible_posix+1155 S_handle_possible_posix+1176 S_handle_possible_posix+1199 S_handle_possible_posix+F3C S_handle_possible_posix+FB5 S_handle_possible_posix+FD0 S_handle_possible_posix+FEB S_hv_delete_common+36A S_hv_delete_common+4D8 S_hv_delete_common+53E S_incline+F6 S_is_codeset_name_UTF8+3A S_is_codeset_name_UTF8+78 S_is_codeset_name_UTF8+8F S_is_utf8_overlong+67 S_magic_sethint_feature+11A S_magic_sethint_feature+14E S_magic_sethint_feature+180 S_magic_sethint_feature+1AC S_magic_sethint_feature+1DB S_magic_sethint_feature+20D S_magic_sethint_feature+23F S_magic_sethint_feature+26E S_magic_sethint_feature+29F S_magic_sethint_feature+311 S_magic_sethint_feature+341 S_magic_sethint_feature+375 S_magic_sethint_feature+3A9 S_magic_sethint_feature+3D7 S_magic_sethint_feature+402 S_magic_sethint_feature+42E S_magic_sethint_feature+45E S_magic_sethint_feature+492 S_magic_sethint_feature+4C6 S_magic_sethint_feature+4FA S_magic_sethint_feature+526 S_magic_sethint_feature+552 S_magic_sethint_feature+582 S_magic_sethint_feature+5B2 S_magic_sethint_feature+5DC S_magic_sethint_feature+77 S_magic_sethint_feature+E9 S_maybe_add_coresub+5F6 S_mayberelocate+107 S_mayberelocate+50 S_move_proto_attr+21C S_move_proto_attr+CA S_pad_check_dup+1F1 S_pad_check_dup+B7 S_pad_findlex+DE S_parse_LC_ALL_string+14F S_parse_uniprop_string+1091 S_parse_uniprop_string+10B5 S_parse_uniprop_string+178 S_parse_uniprop_string+1942 S_parse_uniprop_string+31A S_parse_uniprop_string+41B S_parse_uniprop_string+440 S_parse_uniprop_string+548 S_parse_uniprop_string+566 S_parse_uniprop_string+B6D S_parse_uniprop_string+BAF S_parse_uniprop_string+BCD S_parse_uniprop_string+BF5 S_parse_uniprop_string+C0E S_parse_uniprop_string+C29 S_parse_uniprop_string+C87 S_parse_uniprop_string+CE2 S_parse_uniprop_string+D2F S_parse_uniprop_string+D8A S_parse_uniprop_string+E07 S_parse_uniprop_string+E77 S_reg+100D S_reg+1038 S_reg+1063 S_reg+1093 S_reg+10C0 S_reg+10E7 S_reg+110E S_reg+11C0 S_reg+4F1 S_reg+523 S_reg+550 S_reg+576 S_reg+5A7 S_reg+5CE S_reg+64D S_reg+66E S_reg+699 S_reg+6B6 S_reg+6E1 S_reg+706 S_reg+72C S_reg+749 S_reg+77E S_reg+79F S_reg+7CA S_reg+7EB S_reg+94C S_reg+A4F S_reg+FE2 S_regatom+8AF S_regclass+16D7 S_regmatch+15A2 S_regmatch+2AE7 S_regmatch+3568 S_regrepeat+159E S_regrepeat+1674 S_regrepeat+79C S_require_file+13C7 S_require_file+1C0B S_require_file+1C41 S_require_file+6BE S_scan_ident+2FA S_setup_EXACTISH_ST+1006 S_setup_EXACTISH_ST+108F S_setup_EXACTISH_ST+108F S_setup_EXACTISH_ST+10B0 S_setup_EXACTISH_ST+10B0 S_setup_EXACTISH_ST+2123 S_setup_EXACTISH_ST+FC0 S_share_hek_flags+88 S_swallow_bom+D0 S_turkic_fc+43 S_turkic_lc+50 S_turkic_uc+43 S_unpack_rec+1C5B S_unshare_hek_or_pvn+10C XS_NamedCapture_TIEHASH+DD XS_PerlIO_get_layers+14C XS_PerlIO_get_layers+1A8 XS_PerlIO_get_layers+F2 hek_eq_pvn_flags+7B set_w32_module_name+B3 yyl_do+B4 yyl_fake_eof+159 yyl_foreach+1AE yyl_foreach+1EF yyl_foreach+320 yyl_foreach+38F yyl_hyphen+CB yyl_interpcasemod+72 yyl_interpcasemod+A3 yyl_keylookup+109 yyl_leftcurly+7E3 yyl_my+2DC yyl_my+317 yyl_slash+114 yyl_try+860 yyl_try+9EB yyl_try+AB3 yyl_try+B19 ```

@bulk88
Copy link
Contributor

bulk88 commented Jun 16, 2025

It also introduces more aliasing issues (e.g. src accessed with both U64* and U32*) which is a shame, as it brings Perl further away from dropping -fno-strict-aliasing.

I don't believe its possible to write production grade C not C++ SW without -fno-strict-aliasing. You would be removing (type_t) operator from C89/C99's grammar to write a non crashing -fyes-strict-aliasing. Perl isn't a .o/TU file holding the hottest guts of a AES MPEG JPG codec where -fyes-strict-aliasing could make a benchmarkable improvement.

Either don't use GCC's -O3 / -O4 flag (-O2 ??? -O1 ???) or understand you have to live with adding the -fno-strict-aliasing flag. JAPHs, GOLFs, and educational demos and samples arent production grade C SW.

If you remove (type_t) and & operators from C, you now have the JavaScript ISA/WASM/ECMAScript/V8 virtual machine and those aren't C, even if an AES decryption lib written in JS/nodeJS happens to benchmark identical to a GNU C AES decryption lib on 1 particular build number of Chrome/Node engine.

u64_2 = _swab_64_(u64_2);
memcpy(outp + j - 8, &u64_2, 8);
memcpy(outp + i, &u64_1, 8);
i += 8;
Copy link
Contributor

@bulk88 bulk88 Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not get rid of the memcpy()s and put in the textbook reference implementation

 switch( ((size_t)ptr) & 0x7) {
    case 3:
    case 2:
    case 1:
    case 0:
}
/* NOW START HIGH ITERATION U64 LOOP */

The above is the universal university comp sci student textbook implementation of memcpy() and memcmp(). All libcs from all authors from all OSes do the above impl, unless the Foo LibC/OS authors care enough to actually optimize their toy weekend hobby OS/Libc for production profit making desktop or server usage to put food on the table.

If performance is #1 highest priority, this code will fail a benchmark vs a revision with that mask off and switch(), since this code permanently keeps doing HW unaligned reads/writes through the entire inner loop, whether the SVPV* string is 16 bytes, 71 bytes, or 7103 bytes or 70 MBs + 5 bytes long.

I'll bet doing Native i386 HW unaligned U64 math in a loop for 32 MBs on a 2020s Core iSeries will always loose a bench mark vs same code but aligned U64 math in a loop for 32 MBs on same CPU. My guess is the diff would be %5- max 30% loss doing all 32 MBs with UA read write on a brand new Intel Core. Its a small loss, nothing serious, but if you are expect typically huge data strings to chug through, that are orders of magnitude larger than the C function's own machine code, then yeah, put in the 1/2 screen of new src code logic and 3-7 conditional branches to align that pointer.

update:

above algo might not work unless the total string length is divideable by U16/U32/U64,

RA32 read align 32
WU32 write unalign 32

1       2       3       4       5       6       7       8       9
RA32    RA32    RA32    RA32
                                        WU32    WU32    WU32    WU32

once the far end is unaligned all reads and writes on the tail have to stay unaligned because you will make bytes disappear trying to snap the tail end to become aligned, the other solution is a barrel roll/shift roller bit wise math op that +8s the roll on every tail side U32 read or write, or other words +8s the shift rollover factor on every iteration of the loop on the tail far side only.

There might be a 99% guarantee the near / starting side of any SvPVX()ptr is aligned to 8/16 in the Perl C API only.,OOK_onandSvLEN()==0` are very rare.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we copied the entire string to align it - adding an extra full copy if the source & destination buffers are different - how would we avoid an unaligned read/write at the tail end when the string isn't really conveniently sized?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm an auto mechanic, not a blacksmith. I can change your transmission. I can't make you a new one with a nail file.

Read this algo I invented in a fixed width IDE so the coloms line up

I dont like this algo, but I'm not a bitwise math PHD, I know enough to diagnose and debug the algos if needed. But info google offers alot better algos than I can I invent from my head.

1       2       3       4       5       6       7       8       9       A       B       C       D       E
RA32A   RA32B   RA32C   RA32D
                                                                                WU32D   WU32C   WU32B   WU32A
                                                                                or this algo
                                                                                        32D     32C     32B     32A
                                                                                        >> 8
                                                                                        32C     32B     32A     32ZeroUB
                                                                                        do |= op, protect UB byte
                                                                                        WA32C   WA32B   WA32A   WA32ZeroUB
                                                                                (32D    32C     32B     32A)   >> (8 * 3)
                                                                                0       0       0       32Dold
!!!!!!DONE!!!!! NEXT CYCLE
                                RA32A   RA32B   RA32C   RA32D
                                                                32D     32C     32B     32A
                                                                >> 8
                                                        == 32C  32B     32A     0
                                                        0       0       0       32Dold
                                                        |=
                                                        32C     32B     32A     32Dold
                                                        WA32C   WA32B   WA32A   WA32Dold
                                                        (32D    32C     32B     32A)   >> (8 * 3)
                                                        0       0       0       32Dold2
!!!!!!DONE!!!!! NEXT CYCLE

My guess all this bit math algebra will be slower than letting an x86 or ARM CPU do its native UA mem R/Ws. Feeding 4-9 opcodes into the front end decoder/compiler of the CPU can't be faster than feeding 1 opcode into the front end compiler of the CPU and letting the CPU secretly internally expand the 1 public API opcode to the 4-10 chop/slice/slide/glue microfusion private API opcodes to do a UA memory read.

From the last time I had to a UA mem vs Perl design thread a few years ago, from reading the Verilog src code of a FOSS 32-bit ARM v4 CPU on Github, U32 UA memory reads are penalty free if contained to the same 128-bit/16 byte cache line. The least significant bits of the mem addr in the load store unit decide in hardwired TTL/CMOS transistor, decide if its same cache line, or 2 cache lines.

If same cache line, a 0 overhead barrel rotate is done to extract the U32 from the U128 L1 cache line. All aligned U32s have to do the 0 overhead barrel rotate, since 3 out of 4 aligned U32s, are actually unaligned memory reads @ U128.

If the UA U32 read crosses a 16 byte cache line, the load store unit injects a pause instruction into the ARM pipeline (a 1 hz delay, which is basically nothing), since now 2 different L1 U128 registers need to be read and transferred to the load store unit.

UA U32 writes have no runtime overhead in the FOSS CPU. The splicing of the UA U32 into memory is done asynchronously in the write back L1 cache/L1 write combining unit.The new unaligned U32 value to store, has already left/exited/retired from the ARM CPU pipeline. The UA splicing into the L1 U128 registers is basically free. The FOSS ARM CPU's pipeline is 4 stages. The design of this FOSS 32-bit ARM CPU is such that all the L1 U128 registers are actually alias/pinned/renames/mapped to the public API ARM general purpose registers in the instruction decoder/scheduler. You can't do 128 bit integer math with an ARM32 CPU, but this CPU's HW actually only has 128 bit registers. types byte, short and 32b int/size_t/void * are emulated to meet the requirements of ARMv4 Asm Lang.

Because of 4 stages, and assignment/decoding of ARM Asm GPRs to HW regs, the L1 register is locked at stage 1 or stage 2, and unlocked at stage 4/5. Because this is a RISC CPU, the CPU ins decoder always is looking 1 instruction ahead of current Instruction Pointer for a memory load opcode, That is when the L1 U128 lock takes place. By the time of the commit/store opcode, the CPU is -1 ahead of next read/modify/write cycle to the same address. If there is a an unaliged@u128 aligned @u32, U32 value to write, its not a conflict because the last read/modify/write cycle already unlocked the U128 L1 register, allowing the splicing of the U32 into a U128 to take place.

If an aligned@U8 true unaligned U32 value crosses a 16 byte L1 cache line, that will cause the store opcode at stage 4/5 to issue a lock on the 2nd u128. If 0-3 hertz later, at stage 1, the instruction decoder does a -1 read ahead looking for a load instruction, and that load instruction, conflicts with the lock on the 2nd U128, the secret -1 read ahead load instruction becomes a "do as written in ARM Asm code by the developer" load instruction. And therefore isn't really a block or conflict.

Remember, all U8/U16/U32 writes are unaligned@U128, and therefore must get spliced aka &= ~; |= ; into a U128 register in the write back cache.

In a real life production CPU, if I google it, nobody can make a benchmark showing any speed difference between SSE's movaps and movups instructions for aligned addresses, and L1/L2 cached memory addresses (aka the C stack or tiny malloc blocks <= 128 bytes).

Crossing a 4096 boundary, is a different case. Tripping the OS page fault handler, or making the CPU's MMU fetch a piece of the page table from a DDR stick, you get what you deserve,

So for this FOSS ARM CPU's L1 cache, and L2 cache, and DDR wires only work with aligned U128s units. U32s are unaligned memory access. In commercial CPUs 32 bit and 64 bit integers don't exist if your have DDR-2/DDR-3/DDR-4. Sorry pedantic ISO C gurus, your DDR-4 memory sticks only work with aligned U512 / 64 byte integers. DDR2 is 32 bytes MTU, DDR 3 is 64 byte MTU, DDR 4 is also 64 byte MTU,

Back to this pp_reverse() algorithm, doing this bit math algebra I described above with SSE or AVX-256 or Neon registers might be faster than using x86/ARM native UA mem R/Ws and the x86/ARM GPR. But that is beyond my skill set. IDK how to read SSE code. Reading Spanish French German Polish Czech, is easier for me as an American than SSE code. And the first 3 langs, I have zero educational background to be able to read them.

Writing my bit math algebra algo for pp_reverse() in SSE probably involves keywords "shuffle" "saturate" "packed" "immediate" and IDK what SSE's buzzword or euphemism for i386's EFLAGS psuedo register is "immediate" ??? "status word" ??? "packed control" ??? Im making things up now.

I don't see the point of spending the effort and trying to write this pp_reverse algo to use aligned R/W mem ops + SSE/NEON. If someone else already invented the src code and example/demo how to do pp_reverse on unaligned U8 strings, using aligned SSE/NEON opcodes, go head, I have no objection to copy pasting that code into Perl's pp_reverse. But I don't see the effort/profit ratio to invent this algo in Perl 5 for the 1st time in history of Comp Sci. Perl doesn't need to be the first to fly to Mars. Let another prog lang/SW VM with unlimited funding from a FANNG do it first.

Copy link
Contributor

@bulk88 bulk88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the retval of SvGROW() as intended and save it to a C auto. Then DIY your own SvEND, skipping a couple redundant mem reads to struct member slot SvPVX(),

edit lets see what gh renders this as

https://github.com/Perl/perl5/pull/23371/files/67d79fed92f352ed9d807b90b413c4f9c94d085f#diff-a70d20047abcfcd3faacb266bd4822a5816fb7fea1754e2fcc14fc18cc13c426R6647-R6663

update failed idk how to force gh to show a code sample inline, in a comment, without doing three apos code block escape and copy pasting with a mouse than 3 mor`e apos.

STRLEN i = 0;
STRLEN j = len;
uint32_t u32_1, u32_2;
uint16_t u16_1, u16_2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use perl's native U32/U16 types. Less to type, less screen space waste, more obvious this is P5P/CPAN code and not a "whats the license and author credit requirements of this FOSS file? Is this an Affero GPL or GPL 3 file?". If things "go wrong" on some CC/CPU/OS permutation P5P can fix U32/I32 very easily from Configure or perl.h. Fixing uint32_t might be hard to impossible to hook and replace.

Exampl Perl's U32/U64 can be hooked and overridden by P5P bc Remember Microsoft says 2 and 3 words long C types are satanic. long long int long long double etc. or unsigned long long long long jkjk. MSVC/Toolchain always calls it __int8163264128 see https://learn.microsoft.com/en-us/cpp/cpp/int8-int16-int32-int64?view=msvc-170

In any case, ISO C ~2000s/~2010s still has your uint32_t listed as vendor optional. see https://en.cppreference.com/w/cpp/types/integer.html

https://stackoverflow.com/questions/24929184/can-i-rely-on-sizeofuint32-t-4

On sarcasm side side how do you know your uint32_t isn't 5 or 8 bytes long because each type uint32_t has 4 extra bytes of Reed–Solomon coding? Yes uint32_t only holds 0 to 4294967295 but sizeof(uint32_t) == 8. Perl is running on a control rod servo controller at Nuclear Power plant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSVC didn't like the U32/U16 that I originally used. What's the best solution there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSVC didn't like the U32/U16 that I originally used. What's the best solution there?

Do you have a link to the bad GH runner log?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, See my patch at bottom of this pr.

@bulk88
Copy link
Contributor

bulk88 commented Jun 16, 2025

I'll also add byte order swapping, more specifically swapping the byte order of a 64 bit variable aka htonll()/ntohll(), was a patented technology until 2022 and perl has been doing some low level SW piracy for decades at

https://github.com/Perl/perl5/blob/blead/perl.h#L4601

see this patent by Intel and compare it to Perl's math operator's above

https://patents.google.com/patent/US20040010676A1/en

cough cough pro- or anti-patent troll cough cough

@bulk88
Copy link
Contributor

bulk88 commented Jun 17, 2025

MSVC fix branch pushed at #23374

Reasons for all the MSVC specific code is, here is very poor code gen round 1, after getting the syntax errors fixed.

48 8B D1                   mov     rdx, rcx
4C 8B C7                   mov     r8, rdi         ; Size
48 8D 4D F7                lea     rcx, [rbp+57h+var_60] ; Dst
FF 15 8E 85 09 00          call    cs:__imp_memcpy
48 8B 55 DF                mov     rdx, [rbp+57h+Src] ; Src
48 8D 4D EF                lea     rcx, [rbp+57h+Dst] ; Dst
4C 8B C7                   mov     r8, rdi         ; Size
FF 15 7D 85 09 00          call    cs:__imp_memcpy
48 8B 55 F7                mov     rdx, [rbp+57h+var_60]
41 B9 00 FF 00 00          mov     r9d, 0FF00h
4C 8B C2                   mov     r8, rdx
48 8B C2                   mov     rax, rdx
48 C1 E8 10                shr     rax, 10h
4D 23 C7                   and     r8, r15
4C 0B C0                   or      r8, rax
48 8B CA                   mov     rcx, rdx
49 C1 E8 10                shr     r8, 10h
48 8B C2                   mov     rax, rdx
48 23 C3                   and     rax, rbx
48 C1 E1 10                shl     rcx, 10h
4C 0B C0                   or      r8, rax
41 BA 00 00 FF 00          mov     r10d, 0FF0000h
49 C1 E8 10                shr     r8, 10h
48 8B C2                   mov     rax, rdx
49 23 C5                   and     rax, r13
4C 0B C0                   or      r8, rax
48 8B C2                   mov     rax, rdx
49 23 C1                   and     rax, r9
49 C1 E8 08                shr     r8, 8
48 0B C8                   or      rcx, rax
48 8B C2                   mov     rax, rdx
48 C1 E1 10                shl     rcx, 10h
49 23 C2                   and     rax, r10
48 0B C8                   or      rcx, rax
49 23 D6                   and     rdx, r14
48 C1 E1 10                shl     rcx, 10h
48 0B CA                   or      rcx, rdx
48 8B 55 EF                mov     rdx, [rbp+57h+Dst]
48 C1 E1 08                shl     rcx, 8
48 8B C2                   mov     rax, rdx
4C 0B C1                   or      r8, rcx
48 C1 E8 10                shr     rax, 10h
4C 89 45 F7                mov     [rbp+57h+var_60], r8
48 8B CA                   mov     rcx, rdx
48 C1 E1 10                shl     rcx, 10h
4C 8B C2                   mov     r8, rdx
4D 23 C7                   and     r8, r15
4C 0B C0                   or      r8, rax
48 8B C2                   mov     rax, rdx
48 23 C3                   and     rax, rbx
49 C1 E8 10                shr     r8, 10h
4C 0B C0                   or      r8, rax
48 8B C2                   mov     rax, rdx
49 23 C5                   and     rax, r13
49 C1 E8 10                shr     r8, 10h
4C 0B C0                   or      r8, rax
48 8B C2                   mov     rax, rdx
49 23 C1                   and     rax, r9
49 C1 E8 08                shr     r8, 8
48 0B C8                   or      rcx, rax
48 8B C2                   mov     rax, rdx
48 C1 E1 10                shl     rcx, 10h
49 23 D6                   and     rdx, r14
49 23 C2                   and     rax, r10
48 0B C8                   or      rcx, rax
48 C1 E1 10                shl     rcx, 10h
48 0B CA                   or      rcx, rdx
48 8D 55 EF                lea     rdx, [rbp+57h+Dst] ; Src
48 C1 E1 08                shl     rcx, 8
4C 0B C1                   or      r8, rcx
48 8B 4D CF                mov     rcx, [rbp+57h+e] ; Dst
4C 89 45 EF                mov     [rbp+57h+Dst], r8
4C 8B C7                   mov     r8, rdi         ; Size
FF 15 98 84 09 00          call    cs:__imp_memcpy
48 8B 4D DF                mov     rcx, [rbp+57h+Src] ; Dst
48 8D 55 F7                lea     rdx, [rbp+57h+var_60] ; Src
4C 8B C7                   mov     r8, rdi         ; Size
FF 15 87 84 09 00          call    cs:__imp_memcpy
48 01 7D DF                add     [rbp+57h+Src], rdi
48 03 F7                   add     rsi, rdi
4C 2B E7                   sub     r12, rdi
48 8B 4D CF                mov     rcx, [rbp+57h+e]
49 8B C4                   mov     rax, r12
48 2B CF                   sub     rcx, rdi
48 2B C6                   sub     rax, rsi
48 89 4D CF                mov     [rbp+57h+e], rcx
48 83 F8 10                cmp     rax, 10h
0F 83 C4 FE FF FF          jnb     loc_140090DA2

here is round 2 after turning on inline memcpy, still bad code gen


4D 8B 0C 13             mov     r9, [r11+rdx]
49 83 EA 08             sub     r10, 8
49 8B 14 0B             mov     rdx, [r11+rcx]
48 83 C7 08             add     rdi, 8
4C 8B C2                mov     r8, rdx
48 8B C2                mov     rax, rdx
48 C1 E8 10             shr     rax, 10h
48 8B CA                mov     rcx, rdx
48 C1 E1 10             shl     rcx, 10h
4C 23 C3                and     r8, rbx
4C 0B C0                or      r8, rax
48 8B C2                mov     rax, rdx
49 23 C6                and     rax, r14
49 C1 E8 10             shr     r8, 10h
4C 0B C0                or      r8, rax
48 8B C2                mov     rax, rdx
48 23 C6                and     rax, rsi
49 C1 E8 10             shr     r8, 10h
4C 0B C0                or      r8, rax
48 8B C2                mov     rax, rdx
49 23 C4                and     rax, r12
49 C1 E8 08             shr     r8, 8
48 0B C8                or      rcx, rax
48 8B C2                mov     rax, rdx
49 23 C5                and     rax, r13
48 C1 E1 10             shl     rcx, 10h
48 0B C8                or      rcx, rax
49 23 D7                and     rdx, r15
48 8B 45 58             mov     rax, [rbp+dsv]
48 C1 E1 10             shl     rcx, 10h
48 0B CA                or      rcx, rdx
49 8B D1                mov     rdx, r9
48 C1 E1 08             shl     rcx, 8
48 23 D3                and     rdx, rbx
4C 0B C1                or      r8, rcx
49 8B C9                mov     rcx, r9
4C 89 00                mov     [rax], r8
49 8B C1                mov     rax, r9
48 C1 E8 10             shr     rax, 10h
48 0B D0                or      rdx, rax
48 C1 E1 10             shl     rcx, 10h
48 C1 EA 10             shr     rdx, 10h
49 8B C1                mov     rax, r9
49 23 C6                and     rax, r14
48 0B D0                or      rdx, rax
49 8B C1                mov     rax, r9
48 23 C6                and     rax, rsi
48 C1 EA 10             shr     rdx, 10h
48 0B D0                or      rdx, rax
49 8B C1                mov     rax, r9
49 23 C4                and     rax, r12
48 C1 EA 08             shr     rdx, 8
48 0B C8                or      rcx, rax
49 8B C1                mov     rax, r9
48 C1 E1 10             shl     rcx, 10h
49 23 C5                and     rax, r13
48 0B C8                or      rcx, rax
4D 23 CF                and     r9, r15
48 C1 E1 10             shl     rcx, 10h
49 8B C2                mov     rax, r10
49 0B C9                or      rcx, r9
48 2B C7                sub     rax, rdi
48 C1 E1 08             shl     rcx, 8
48 0B D1                or      rdx, rcx
48 8B 4D 60             mov     rcx, [rbp+arg_18]
48 89 11                mov     [rcx], rdx
48 83 C1 08             add     rcx, 8
48 8B 55 58             mov     rdx, [rbp+dsv]
48 83 EA 08             sub     rdx, 8
48 89 4D 60             mov     [rbp+arg_18], rcx
48 89 55 58             mov     [rbp+dsv], rdx
48 83 F8 10             cmp     rax, 10h
0F 83 06 FF FF FF       jnb     loc_14009093C

after all of my tweaks it looks perfect or identical to whatever GCC and Clang would emit

4B 8B 04 08             mov     rax, [r8+r9]
48 83 EA 08             sub     rdx, 8
4B 8B 0C 10             mov     rcx, [r8+r10]
48 83 C7 08             add     rdi, 8
48 0F C8                bswap   rax
49 89 02                mov     [r10], rax
4D 8D 52 F8             lea     r10, [r10-8]
48 8B C2                mov     rax, rdx
48 2B C7                sub     rax, rdi
48 0F C9                bswap   rcx
49 89 09                mov     [r9], rcx
4D 8D 49 08             lea     r9, [r9+8]
48 83 F8 10             cmp     rax, 10h

@bulk88
Copy link
Contributor

bulk88 commented Jun 17, 2025

BEFORE 5.41.7

C:\sources\perl5>timeit C:\pb64\bin\perl.exe  -e "my $x = \"X\"x(1024*1000*10);
my $y; for (0..1_000) { $y = reverse \"foo\",$x }"
Exit code      : 0
Elapsed time   : 10.20
Kernel time    : 2.76 (27.1%)
User time      : 7.33 (71.9%)
page fault #   : 2511948
Working set    : 35628 KB
Paged pool     : 92 KB
Non-paged pool : 7 KB
Page file size : 31732 KB

AFTER

C:\sources\perl5>timeit perl -Ilib -e "my $x = \"X\"x(1024*1000*10); my $y; for
(0..1_000) { $y = reverse \"foo\",$x }"
Exit code      : 0
Elapsed time   : 5.64
Kernel time    : 2.96 (52.6%)
User time      : 2.67 (47.3%)
page fault #   : 2511998
Working set    : 35808 KB
Paged pool     : 94 KB
Non-paged pool : 8 KB
Page file size : 31752 KB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defer-next-dev This PR should not be merged yet, but await the next development cycle
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants