Skip to content

pp_reverse - chunk-at-a-time string reversal #23371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: blead
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 121 additions & 10 deletions pp.c
Original file line number Diff line number Diff line change
Expand Up @@ -6529,7 +6529,6 @@ PP(pp_unshift)
return NORMAL;
}


PP_wrapped(pp_reverse, 0, 1)
{
dSP; dMARK;
Expand Down Expand Up @@ -6679,10 +6678,69 @@ PP_wrapped(pp_reverse, 0, 1)
}
}
} else {
STRLEN i = 0;
STRLEN j = len;
uint32_t u32_1, u32_2;
uint16_t u16_1, u16_2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use perl's native U32/U16 types. Less to type, less screen space waste, more obvious this is P5P/CPAN code and not a "whats the license and author credit requirements of this FOSS file? Is this an Affero GPL or GPL 3 file?". If things "go wrong" on some CC/CPU/OS permutation P5P can fix U32/I32 very easily from Configure or perl.h. Fixing uint32_t might be hard to impossible to hook and replace.

Exampl Perl's U32/U64 can be hooked and overridden by P5P bc Remember Microsoft says 2 and 3 words long C types are satanic. long long int long long double etc. or unsigned long long long long jkjk. MSVC/Toolchain always calls it __int8163264128 see https://learn.microsoft.com/en-us/cpp/cpp/int8-int16-int32-int64?view=msvc-170

In any case, ISO C ~2000s/~2010s still has your uint32_t listed as vendor optional. see https://en.cppreference.com/w/cpp/types/integer.html

https://stackoverflow.com/questions/24929184/can-i-rely-on-sizeofuint32-t-4

On sarcasm side side how do you know your uint32_t isn't 5 or 8 bytes long because each type uint32_t has 4 extra bytes of Reed–Solomon coding? Yes uint32_t only holds 0 to 4294967295 but sizeof(uint32_t) == 8. Perl is running on a control rod servo controller at Nuclear Power plant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSVC didn't like the U32/U16 that I originally used. What's the best solution there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSVC didn't like the U32/U16 that I originally used. What's the best solution there?

Do you have a link to the bad GH runner log?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, See my patch at bottom of this pr.

char * outp= SvPVX(TARG);
const char *p = src + len;
while (p != src)
*outp++ = *--p;
/* Take a chunk of bytes from the front and from the
* back, reverse the bytes in each and and swap the
* chunks over. This should have generally good
* performance but also is likely to be optimised
* into bswap instructions by the compiler.
*/
#ifdef HAS_QUAD
uint64_t u64_1, u64_2;
while (j - i >= 16) {
memcpy(&u64_1, src + j - 8, 8);
memcpy(&u64_2, src + i, 8);
u64_1 = _swab_64_(u64_1);
u64_2 = _swab_64_(u64_2);
memcpy(outp + j - 8, &u64_2, 8);
memcpy(outp + i, &u64_1, 8);
i += 8;
Copy link
Contributor

@bulk88 bulk88 Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not get rid of the memcpy()s and put in the textbook reference implementation

 switch( ((size_t)ptr) & 0x7) {
    case 3:
    case 2:
    case 1:
    case 0:
}
/* NOW START HIGH ITERATION U64 LOOP */

The above is the universal university comp sci student textbook implementation of memcpy() and memcmp(). All libcs from all authors from all OSes do the above impl, unless the Foo LibC/OS authors care enough to actually optimize their toy weekend hobby OS/Libc for production profit making desktop or server usage to put food on the table.

If performance is #1 highest priority, this code will fail a benchmark vs a revision with that mask off and switch(), since this code permanently keeps doing HW unaligned reads/writes through the entire inner loop, whether the SVPV* string is 16 bytes, 71 bytes, or 7103 bytes or 70 MBs + 5 bytes long.

I'll bet doing Native i386 HW unaligned U64 math in a loop for 32 MBs on a 2020s Core iSeries will always loose a bench mark vs same code but aligned U64 math in a loop for 32 MBs on same CPU. My guess is the diff would be %5- max 30% loss doing all 32 MBs with UA read write on a brand new Intel Core. Its a small loss, nothing serious, but if you are expect typically huge data strings to chug through, that are orders of magnitude larger than the C function's own machine code, then yeah, put in the 1/2 screen of new src code logic and 3-7 conditional branches to align that pointer.

update:

above algo might not work unless the total string length is divideable by U16/U32/U64,

RA32 read align 32
WU32 write unalign 32

1       2       3       4       5       6       7       8       9
RA32    RA32    RA32    RA32
                                        WU32    WU32    WU32    WU32

once the far end is unaligned all reads and writes on the tail have to stay unaligned because you will make bytes disappear trying to snap the tail end to become aligned, the other solution is a barrel roll/shift roller bit wise math op that +8s the roll on every tail side U32 read or write, or other words +8s the shift rollover factor on every iteration of the loop on the tail far side only.

There might be a 99% guarantee the near / starting side of any SvPVX()ptr is aligned to 8/16 in the Perl C API only.,OOK_onandSvLEN()==0` are very rare.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we copied the entire string to align it - adding an extra full copy if the source & destination buffers are different - how would we avoid an unaligned read/write at the tail end when the string isn't really conveniently sized?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm an auto mechanic, not a blacksmith. I can change your transmission. I can't make you a new one with a nail file.

Read this algo I invented in a fixed width IDE so the coloms line up

I dont like this algo, but I'm not a bitwise math PHD, I know enough to diagnose and debug the algos if needed. But info google offers alot better algos than I can I invent from my head.

1       2       3       4       5       6       7       8       9       A       B       C       D       E
RA32A   RA32B   RA32C   RA32D
                                                                                WU32D   WU32C   WU32B   WU32A
                                                                                or this algo
                                                                                        32D     32C     32B     32A
                                                                                        >> 8
                                                                                        32C     32B     32A     32ZeroUB
                                                                                        do |= op, protect UB byte
                                                                                        WA32C   WA32B   WA32A   WA32ZeroUB
                                                                                (32D    32C     32B     32A)   >> (8 * 3)
                                                                                0       0       0       32Dold
!!!!!!DONE!!!!! NEXT CYCLE
                                RA32A   RA32B   RA32C   RA32D
                                                                32D     32C     32B     32A
                                                                >> 8
                                                        == 32C  32B     32A     0
                                                        0       0       0       32Dold
                                                        |=
                                                        32C     32B     32A     32Dold
                                                        WA32C   WA32B   WA32A   WA32Dold
                                                        (32D    32C     32B     32A)   >> (8 * 3)
                                                        0       0       0       32Dold2
!!!!!!DONE!!!!! NEXT CYCLE

My guess all this bit math algebra will be slower than letting an x86 or ARM CPU do its native UA mem R/Ws. Feeding 4-9 opcodes into the front end decoder/compiler of the CPU can't be faster than feeding 1 opcode into the front end compiler of the CPU and letting the CPU secretly internally expand the 1 public API opcode to the 4-10 chop/slice/slide/glue microfusion private API opcodes to do a UA memory read.

From the last time I had to a UA mem vs Perl design thread a few years ago, from reading the Verilog src code of a FOSS 32-bit ARM v4 CPU on Github, U32 UA memory reads are penalty free if contained to the same 128-bit/16 byte cache line. The least significant bits of the mem addr in the load store unit decide in hardwired TTL/CMOS transistor, decide if its same cache line, or 2 cache lines.

If same cache line, a 0 overhead barrel rotate is done to extract the U32 from the U128 L1 cache line. All aligned U32s have to do the 0 overhead barrel rotate, since 3 out of 4 aligned U32s, are actually unaligned memory reads @ U128.

If the UA U32 read crosses a 16 byte cache line, the load store unit injects a pause instruction into the ARM pipeline (a 1 hz delay, which is basically nothing), since now 2 different L1 U128 registers need to be read and transferred to the load store unit.

UA U32 writes have no runtime overhead in the FOSS CPU. The splicing of the UA U32 into memory is done asynchronously in the write back L1 cache/L1 write combining unit.The new unaligned U32 value to store, has already left/exited/retired from the ARM CPU pipeline. The UA splicing into the L1 U128 registers is basically free. The FOSS ARM CPU's pipeline is 4 stages. The design of this FOSS 32-bit ARM CPU is such that all the L1 U128 registers are actually alias/pinned/renames/mapped to the public API ARM general purpose registers in the instruction decoder/scheduler. You can't do 128 bit integer math with an ARM32 CPU, but this CPU's HW actually only has 128 bit registers. types byte, short and 32b int/size_t/void * are emulated to meet the requirements of ARMv4 Asm Lang.

Because of 4 stages, and assignment/decoding of ARM Asm GPRs to HW regs, the L1 register is locked at stage 1 or stage 2, and unlocked at stage 4/5. Because this is a RISC CPU, the CPU ins decoder always is looking 1 instruction ahead of current Instruction Pointer for a memory load opcode, That is when the L1 U128 lock takes place. By the time of the commit/store opcode, the CPU is -1 ahead of next read/modify/write cycle to the same address. If there is a an unaliged@u128 aligned @u32, U32 value to write, its not a conflict because the last read/modify/write cycle already unlocked the U128 L1 register, allowing the splicing of the U32 into a U128 to take place.

If an aligned@U8 true unaligned U32 value crosses a 16 byte L1 cache line, that will cause the store opcode at stage 4/5 to issue a lock on the 2nd u128. If 0-3 hertz later, at stage 1, the instruction decoder does a -1 read ahead looking for a load instruction, and that load instruction, conflicts with the lock on the 2nd U128, the secret -1 read ahead load instruction becomes a "do as written in ARM Asm code by the developer" load instruction. And therefore isn't really a block or conflict.

Remember, all U8/U16/U32 writes are unaligned@U128, and therefore must get spliced aka &= ~; |= ; into a U128 register in the write back cache.

In a real life production CPU, if I google it, nobody can make a benchmark showing any speed difference between SSE's movaps and movups instructions for aligned addresses, and L1/L2 cached memory addresses (aka the C stack or tiny malloc blocks <= 128 bytes).

Crossing a 4096 boundary, is a different case. Tripping the OS page fault handler, or making the CPU's MMU fetch a piece of the page table from a DDR stick, you get what you deserve,

So for this FOSS ARM CPU's L1 cache, and L2 cache, and DDR wires only work with aligned U128s units. U32s are unaligned memory access. In commercial CPUs 32 bit and 64 bit integers don't exist if your have DDR-2/DDR-3/DDR-4. Sorry pedantic ISO C gurus, your DDR-4 memory sticks only work with aligned U512 / 64 byte integers. DDR2 is 32 bytes MTU, DDR 3 is 64 byte MTU, DDR 4 is also 64 byte MTU,

Back to this pp_reverse() algorithm, doing this bit math algebra I described above with SSE or AVX-256 or Neon registers might be faster than using x86/ARM native UA mem R/Ws and the x86/ARM GPR. But that is beyond my skill set. IDK how to read SSE code. Reading Spanish French German Polish Czech, is easier for me as an American than SSE code. And the first 3 langs, I have zero educational background to be able to read them.

Writing my bit math algebra algo for pp_reverse() in SSE probably involves keywords "shuffle" "saturate" "packed" "immediate" and IDK what SSE's buzzword or euphemism for i386's EFLAGS psuedo register is "immediate" ??? "status word" ??? "packed control" ??? Im making things up now.

I don't see the point of spending the effort and trying to write this pp_reverse algo to use aligned R/W mem ops + SSE/NEON. If someone else already invented the src code and example/demo how to do pp_reverse on unaligned U8 strings, using aligned SSE/NEON opcodes, go head, I have no objection to copy pasting that code into Perl's pp_reverse. But I don't see the effort/profit ratio to invent this algo in Perl 5 for the 1st time in history of Comp Sci. Perl doesn't need to be the first to fly to Mars. Let another prog lang/SW VM with unlimited funding from a FANNG do it first.

j -= 8;
}

if (j - i >= 8) {
memcpy(&u32_1, src + j - 4, 4);
memcpy(&u32_2, src + i, 4);
u32_1 = _swab_32_(u32_1);
u32_2 = _swab_32_(u32_2);
memcpy(outp + j - 4, &u32_2, 4);
memcpy(outp + i, &u32_1, 4);
i += 4;
j -= 4;
}
#else
while (j - i >= 8) {
memcpy(&u32_1, src + j - 4, 4);
memcpy(&u32_2, src + i, 4);
u32_1 = _swab_32_(u32_1);
u32_2 = _swab_32_(u32_2);
memcpy(outp + j - 4, &u32_2, 4);
memcpy(outp + i, &u32_1, 4);
i += 4;
j -= 4;
}
#endif
if (j - i >= 4) {
memcpy(&u16_1, src + j - 2, 2);
memcpy(&u16_2, src + i, 2);
u16_1 = _swab_16_(u16_1);
u16_2 = _swab_16_(u16_2);
memcpy(outp + j - 2, &u16_2, 2);
memcpy(outp + i, &u16_1, 2);
i += 2;
j -= 2;
}

/* Swap any remaining bytes one by one. */
while (i < j) {
outp[i] = src[j - 1];
outp[j - 1] = src[i];
i++; j--;
}
}
RETURN;
}
Expand All @@ -6695,8 +6753,8 @@ PP_wrapped(pp_reverse, 0, 1)

if (len > 1) {
/* The traditional way, operate on the current byte buffer */
char *down;
if (DO_UTF8(TARG)) { /* first reverse each character */
char *down;
U8* s = (U8*)SvPVX(TARG);
const U8* send = (U8*)(s + len);
while (s < send) {
Expand All @@ -6720,11 +6778,64 @@ PP_wrapped(pp_reverse, 0, 1)
}
up = SvPVX(TARG);
}
down = SvPVX(TARG) + len - 1;
while (down > up) {
const char tmp = *up;
*up++ = *down;
*down-- = tmp;
STRLEN i = 0;
STRLEN j = len;
uint32_t u32_1, u32_2;
uint16_t u16_1, u16_2;
/* Reverse the buffer in place, in chunks where possible */
#ifdef HAS_QUAD
uint64_t u64_1, u64_2;
while (j - i >= 16) {
memcpy(&u64_1, up + j - 8, 8);
memcpy(&u64_2, up + i, 8);
u64_1 = _swab_64_(u64_1);
u64_2 = _swab_64_(u64_2);
memcpy(up + j - 8, &u64_2, 8);
memcpy(up + i, &u64_1, 8);
i += 8;
j -= 8;
}

if (j - i >= 8) {
memcpy(&u32_1, up + j - 4, 4);
memcpy(&u32_2, up + i, 4);
u32_1 = _swab_32_(u32_1);
u32_2 = _swab_32_(u32_2);
memcpy(up + j - 4, &u32_2, 4);
memcpy(up + i, &u32_1, 4);
i += 4;
j -= 4;
}
#else
while (j - i >= 8) {
memcpy(&u32_1, up + j - 4, 4);
memcpy(&u32_2, up + i, 4);
u32_1 = _swab_32_(u32_1);
u32_2 = _swab_32_(u32_2);
memcpy(up + j - 4, &u32_2, 4);
memcpy(up + i, &u32_1, 4);
i += 4;
j -= 4;
}
#endif
if (j - i >= 4) {
memcpy(&u16_1, up + j - 2, 2);
memcpy(&u16_2, up + i, 2);
u16_1 = _swab_16_(u16_1);
u16_2 = _swab_16_(u16_2);
memcpy(up + j - 2, &u16_2, 2);
memcpy(up + i, &u16_1, 2);
i += 2;
j -= 2;
}

/* Finally, swap any remaining bytes one-by-one. */
while (i < j) {
unsigned char tmp = up[i];
up[i] = up[j - 1];
up[j - 1] = tmp;
i++;
j--;
}
}
(void)SvPOK_only_UTF8(TARG);
Expand Down
Loading