-
Notifications
You must be signed in to change notification settings - Fork 584
pp_reverse - chunk-at-a-time string reversal #23371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: blead
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6529,7 +6529,6 @@ PP(pp_unshift) | |
return NORMAL; | ||
} | ||
|
||
|
||
PP_wrapped(pp_reverse, 0, 1) | ||
{ | ||
dSP; dMARK; | ||
|
@@ -6679,10 +6678,69 @@ PP_wrapped(pp_reverse, 0, 1) | |
} | ||
} | ||
} else { | ||
STRLEN i = 0; | ||
STRLEN j = len; | ||
uint32_t u32_1, u32_2; | ||
uint16_t u16_1, u16_2; | ||
char * outp= SvPVX(TARG); | ||
const char *p = src + len; | ||
while (p != src) | ||
*outp++ = *--p; | ||
/* Take a chunk of bytes from the front and from the | ||
* back, reverse the bytes in each and and swap the | ||
* chunks over. This should have generally good | ||
* performance but also is likely to be optimised | ||
* into bswap instructions by the compiler. | ||
*/ | ||
#ifdef HAS_QUAD | ||
uint64_t u64_1, u64_2; | ||
while (j - i >= 16) { | ||
memcpy(&u64_1, src + j - 8, 8); | ||
memcpy(&u64_2, src + i, 8); | ||
u64_1 = _swab_64_(u64_1); | ||
u64_2 = _swab_64_(u64_2); | ||
memcpy(outp + j - 8, &u64_2, 8); | ||
memcpy(outp + i, &u64_1, 8); | ||
i += 8; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not get rid of the
The above is the universal university comp sci student textbook implementation of If performance is #1 highest priority, this code will fail a benchmark vs a revision with that mask off and switch(), since this code permanently keeps doing HW unaligned reads/writes through the entire inner loop, whether the SVPV* string is 16 bytes, 71 bytes, or 7103 bytes or 70 MBs + 5 bytes long. I'll bet doing Native i386 HW unaligned U64 math in a loop for 32 MBs on a 2020s Core iSeries will always loose a bench mark vs same code but aligned U64 math in a loop for 32 MBs on same CPU. My guess is the diff would be %5- max 30% loss doing all 32 MBs with UA read write on a brand new Intel Core. Its a small loss, nothing serious, but if you are expect typically huge data strings to chug through, that are orders of magnitude larger than the C function's own machine code, then yeah, put in the 1/2 screen of new src code logic and 3-7 conditional branches to align that pointer. update: above algo might not work unless the total string length is divideable by U16/U32/U64, RA32 read align 32
once the far end is unaligned all reads and writes on the tail have to stay unaligned because you will make bytes disappear trying to snap the tail end to become aligned, the other solution is a barrel roll/shift roller bit wise math op that +8s the roll on every tail side U32 read or write, or other words +8s the shift rollover factor on every iteration of the loop on the tail far side only. There might be a 99% guarantee the near / starting side of any SvPVX() There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Even if we copied the entire string to align it - adding an extra full copy if the source & destination buffers are different - how would we avoid an unaligned read/write at the tail end when the string isn't really conveniently sized? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm an auto mechanic, not a blacksmith. I can change your transmission. I can't make you a new one with a nail file. Read this algo I invented in a fixed width IDE so the coloms line up I dont like this algo, but I'm not a bitwise math PHD, I know enough to diagnose and debug the algos if needed. But info google offers alot better algos than I can I invent from my head.
My guess all this bit math algebra will be slower than letting an x86 or ARM CPU do its native UA mem R/Ws. Feeding 4-9 opcodes into the front end decoder/compiler of the CPU can't be faster than feeding 1 opcode into the front end compiler of the CPU and letting the CPU secretly internally expand the 1 public API opcode to the 4-10 chop/slice/slide/glue microfusion private API opcodes to do a UA memory read. From the last time I had to a UA mem vs Perl design thread a few years ago, from reading the Verilog src code of a FOSS 32-bit ARM v4 CPU on Github, U32 UA memory reads are penalty free if contained to the same 128-bit/16 byte cache line. The least significant bits of the mem addr in the load store unit decide in hardwired TTL/CMOS transistor, decide if its same cache line, or 2 cache lines. If same cache line, a 0 overhead barrel rotate is done to extract the U32 from the U128 L1 cache line. All aligned U32s have to do the 0 overhead barrel rotate, since 3 out of 4 aligned U32s, are actually unaligned memory reads @ U128. If the UA U32 read crosses a 16 byte cache line, the load store unit injects a pause instruction into the ARM pipeline (a 1 hz delay, which is basically nothing), since now 2 different L1 U128 registers need to be read and transferred to the load store unit. UA U32 writes have no runtime overhead in the FOSS CPU. The splicing of the UA U32 into memory is done asynchronously in the write back L1 cache/L1 write combining unit.The new unaligned U32 value to store, has already left/exited/retired from the ARM CPU pipeline. The UA splicing into the L1 U128 registers is basically free. The FOSS ARM CPU's pipeline is 4 stages. The design of this FOSS 32-bit ARM CPU is such that all the L1 U128 registers are actually alias/pinned/renames/mapped to the public API ARM general purpose registers in the instruction decoder/scheduler. You can't do 128 bit integer math with an ARM32 CPU, but this CPU's HW actually only has 128 bit registers. types Because of 4 stages, and assignment/decoding of ARM Asm GPRs to HW regs, the L1 register is locked at stage 1 or stage 2, and unlocked at stage 4/5. Because this is a RISC CPU, the CPU ins decoder always is looking 1 instruction ahead of current Instruction Pointer for a memory load opcode, That is when the L1 U128 lock takes place. By the time of the commit/store opcode, the CPU is -1 ahead of next read/modify/write cycle to the same address. If there is a an unaliged@u128 aligned @u32, U32 value to write, its not a conflict because the last read/modify/write cycle already unlocked the U128 L1 register, allowing the splicing of the U32 into a U128 to take place. If an aligned@U8 true unaligned U32 value crosses a 16 byte L1 cache line, that will cause the store opcode at stage 4/5 to issue a lock on the 2nd u128. If 0-3 hertz later, at stage 1, the instruction decoder does a -1 read ahead looking for a load instruction, and that load instruction, conflicts with the lock on the 2nd U128, the secret -1 read ahead load instruction becomes a "do as written in ARM Asm code by the developer" load instruction. And therefore isn't really a block or conflict. Remember, all U8/U16/U32 writes are unaligned@U128, and therefore must get spliced aka In a real life production CPU, if I google it, nobody can make a benchmark showing any speed difference between SSE's Crossing a 4096 boundary, is a different case. Tripping the OS page fault handler, or making the CPU's MMU fetch a piece of the page table from a DDR stick, you get what you deserve, So for this FOSS ARM CPU's L1 cache, and L2 cache, and DDR wires only work with aligned U128s units. U32s are unaligned memory access. In commercial CPUs 32 bit and 64 bit integers don't exist if your have DDR-2/DDR-3/DDR-4. Sorry pedantic ISO C gurus, your DDR-4 memory sticks only work with aligned U512 / 64 byte integers. DDR2 is 32 bytes MTU, DDR 3 is 64 byte MTU, DDR 4 is also 64 byte MTU, Back to this Writing my bit math algebra algo for I don't see the point of spending the effort and trying to write this pp_reverse algo to use aligned R/W mem ops + SSE/NEON. If someone else already invented the src code and example/demo how to do pp_reverse on unaligned U8 strings, using aligned SSE/NEON opcodes, go head, I have no objection to copy pasting that code into Perl's pp_reverse. But I don't see the effort/profit ratio to invent this algo in Perl 5 for the 1st time in history of Comp Sci. Perl doesn't need to be the first to fly to Mars. Let another prog lang/SW VM with unlimited funding from a FANNG do it first. |
||
j -= 8; | ||
} | ||
|
||
if (j - i >= 8) { | ||
memcpy(&u32_1, src + j - 4, 4); | ||
memcpy(&u32_2, src + i, 4); | ||
u32_1 = _swab_32_(u32_1); | ||
u32_2 = _swab_32_(u32_2); | ||
memcpy(outp + j - 4, &u32_2, 4); | ||
memcpy(outp + i, &u32_1, 4); | ||
i += 4; | ||
j -= 4; | ||
} | ||
#else | ||
while (j - i >= 8) { | ||
memcpy(&u32_1, src + j - 4, 4); | ||
memcpy(&u32_2, src + i, 4); | ||
u32_1 = _swab_32_(u32_1); | ||
u32_2 = _swab_32_(u32_2); | ||
memcpy(outp + j - 4, &u32_2, 4); | ||
memcpy(outp + i, &u32_1, 4); | ||
i += 4; | ||
j -= 4; | ||
} | ||
#endif | ||
if (j - i >= 4) { | ||
memcpy(&u16_1, src + j - 2, 2); | ||
memcpy(&u16_2, src + i, 2); | ||
u16_1 = _swab_16_(u16_1); | ||
u16_2 = _swab_16_(u16_2); | ||
memcpy(outp + j - 2, &u16_2, 2); | ||
memcpy(outp + i, &u16_1, 2); | ||
i += 2; | ||
j -= 2; | ||
} | ||
|
||
/* Swap any remaining bytes one by one. */ | ||
while (i < j) { | ||
outp[i] = src[j - 1]; | ||
outp[j - 1] = src[i]; | ||
i++; j--; | ||
} | ||
} | ||
RETURN; | ||
} | ||
|
@@ -6695,8 +6753,8 @@ PP_wrapped(pp_reverse, 0, 1) | |
|
||
if (len > 1) { | ||
/* The traditional way, operate on the current byte buffer */ | ||
char *down; | ||
if (DO_UTF8(TARG)) { /* first reverse each character */ | ||
char *down; | ||
U8* s = (U8*)SvPVX(TARG); | ||
const U8* send = (U8*)(s + len); | ||
while (s < send) { | ||
|
@@ -6720,11 +6778,64 @@ PP_wrapped(pp_reverse, 0, 1) | |
} | ||
up = SvPVX(TARG); | ||
} | ||
down = SvPVX(TARG) + len - 1; | ||
while (down > up) { | ||
const char tmp = *up; | ||
*up++ = *down; | ||
*down-- = tmp; | ||
STRLEN i = 0; | ||
STRLEN j = len; | ||
uint32_t u32_1, u32_2; | ||
uint16_t u16_1, u16_2; | ||
/* Reverse the buffer in place, in chunks where possible */ | ||
#ifdef HAS_QUAD | ||
uint64_t u64_1, u64_2; | ||
while (j - i >= 16) { | ||
memcpy(&u64_1, up + j - 8, 8); | ||
memcpy(&u64_2, up + i, 8); | ||
u64_1 = _swab_64_(u64_1); | ||
u64_2 = _swab_64_(u64_2); | ||
memcpy(up + j - 8, &u64_2, 8); | ||
memcpy(up + i, &u64_1, 8); | ||
i += 8; | ||
j -= 8; | ||
} | ||
|
||
if (j - i >= 8) { | ||
memcpy(&u32_1, up + j - 4, 4); | ||
memcpy(&u32_2, up + i, 4); | ||
u32_1 = _swab_32_(u32_1); | ||
u32_2 = _swab_32_(u32_2); | ||
memcpy(up + j - 4, &u32_2, 4); | ||
memcpy(up + i, &u32_1, 4); | ||
i += 4; | ||
j -= 4; | ||
} | ||
#else | ||
while (j - i >= 8) { | ||
memcpy(&u32_1, up + j - 4, 4); | ||
memcpy(&u32_2, up + i, 4); | ||
u32_1 = _swab_32_(u32_1); | ||
u32_2 = _swab_32_(u32_2); | ||
memcpy(up + j - 4, &u32_2, 4); | ||
memcpy(up + i, &u32_1, 4); | ||
i += 4; | ||
j -= 4; | ||
} | ||
#endif | ||
if (j - i >= 4) { | ||
memcpy(&u16_1, up + j - 2, 2); | ||
memcpy(&u16_2, up + i, 2); | ||
u16_1 = _swab_16_(u16_1); | ||
u16_2 = _swab_16_(u16_2); | ||
memcpy(up + j - 2, &u16_2, 2); | ||
memcpy(up + i, &u16_1, 2); | ||
i += 2; | ||
j -= 2; | ||
} | ||
|
||
/* Finally, swap any remaining bytes one-by-one. */ | ||
while (i < j) { | ||
unsigned char tmp = up[i]; | ||
up[i] = up[j - 1]; | ||
up[j - 1] = tmp; | ||
i++; | ||
j--; | ||
} | ||
} | ||
(void)SvPOK_only_UTF8(TARG); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use perl's native
U32
/U16
types. Less to type, less screen space waste, more obvious this is P5P/CPAN code and not a "whats the license and author credit requirements of this FOSS file? Is this an Affero GPL or GPL 3 file?". If things "go wrong" on some CC/CPU/OS permutation P5P can fixU32/I32
very easily fromConfigure
orperl.h
. Fixinguint32_t
might be hard to impossible to hook and replace.Exampl Perl's U32/U64 can be hooked and overridden by P5P bc Remember Microsoft says 2 and 3 words long C types are satanic.
long long int
long long double
etc. orunsigned long long long long
jkjk. MSVC/Toolchain always calls it__int8163264128
see https://learn.microsoft.com/en-us/cpp/cpp/int8-int16-int32-int64?view=msvc-170In any case, ISO C ~2000s/~2010s still has your
uint32_t
listed as vendor optional. see https://en.cppreference.com/w/cpp/types/integer.htmlhttps://stackoverflow.com/questions/24929184/can-i-rely-on-sizeofuint32-t-4
On sarcasm side side how do you know your
uint32_t
isn't 5 or 8 bytes long because each typeuint32_t
has 4 extra bytes of Reed–Solomon coding? Yesuint32_t
only holds 0 to 4294967295 butsizeof(uint32_t) == 8
. Perl is running on a control rod servo controller at Nuclear Power plant.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MSVC didn't like the U32/U16 that I originally used. What's the best solution there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a link to the bad GH runner log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, See my patch at bottom of this pr.