-
Notifications
You must be signed in to change notification settings - Fork 584
pp_reverse - chunk-at-a-time string reversal #23371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: blead
Are you sure you want to change the base?
Conversation
The performance characteristics of string reversal in blead is very variable depending upon the capabilities of the C compiler. Some compilers are able to vectorize some cases for better performance. This commit introduces explicit reversal and swapping of whole registers at a time, which all builds seem to be able to benefit from. The `_swab_xx_` macros for doing this already exist in perl.h, using them for this purpose was inspired by https://dev.to/wunk/fast-array-reversal-with-simd-j3p The bit shifting done by these macros should be portable and reasonably performant if not optimised further, but it is likely that they will be optimised to bswap, rev, movbe instructions. Some performance comparisons: 1. Large string reversal, with different source & destination buffers my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse $x } gcc blead: 2,388.30 msec task-clock # 0.993 CPUs utilized 10,574,195,388 cycles # 4.427 GHz 61,520,672,268 instructions # 5.82 insn per cycle 10,255,049,869 branches # 4.294 G/sec clang blead: 688.37 msec task-clock # 0.946 CPUs utilized 3,161,754,439 cycles # 4.593 GHz 8,986,420,860 instructions # 2.84 insn per cycle 324,734,391 branches # 471.745 M/sec gcc patched: 408.39 msec task-clock # 0.936 CPUs utilized 1,617,273,653 cycles # 3.960 GHz 6,422,991,675 instructions # 3.97 insn per cycle 644,856,283 branches # 1.579 G/sec clang patched: 397.61 msec task-clock # 0.924 CPUs utilized 1,655,838,316 cycles # 4.165 GHz 5,782,487,237 instructions # 3.49 insn per cycle 324,586,437 branches # 816.350 M/sec 2. Large string reversal, but reversing the buffer in-place my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse "foo",$x } gcc blead: 6,038.06 msec task-clock # 0.996 CPUs utilized 27,109,273,840 cycles # 4.490 GHz 41,987,097,139 instructions # 1.55 insn per cycle 5,211,350,347 branches # 863.083 M/sec clang blead: 5,815.86 msec task-clock # 0.995 CPUs utilized 26,962,768,616 cycles # 4.636 GHz 47,111,208,664 instructions # 1.75 insn per cycle 5,211,117,921 branches # 896.018 M/sec gcc patched: 1,003.49 msec task-clock # 0.999 CPUs utilized 4,298,242,624 cycles # 4.283 GHz 7,387,822,303 instructions # 1.72 insn per cycle 725,892,855 branches # 723.367 M/sec clang patched: 970.78 msec task-clock # 0.973 CPUs utilized 4,436,489,695 cycles # 4.570 GHz 8,028,374,567 instructions # 1.81 insn per cycle 725,867,979 branches # 747.713 M/sec 3. Short string reversal, different source & destination (checking performance on smaller string reversals - note: this one's vary variable due to noise) my $x = "1234567"; my $y; for (0..10_000_000) { $y = reverse $x } gcc blead: 401.20 msec task-clock # 0.916 CPUs utilized 1,672,263,966 cycles # 4.168 GHz 5,564,078,603 instructions # 3.33 insn per cycle 1,250,983,219 branches # 3.118 G/sec clang blead: 380.58 msec task-clock # 0.998 CPUs utilized 1,615,634,265 cycles # 4.245 GHz 5,583,854,366 instructions # 3.46 insn per cycle 1,300,935,443 branches # 3.418 G/sec gcc patched: 381.62 msec task-clock # 0.999 CPUs utilized 1,566,807,988 cycles # 4.106 GHz 5,474,069,670 instructions # 3.49 insn per cycle 1,240,983,221 branches # 3.252 G/sec clang patched: 346.21 msec task-clock # 0.999 CPUs utilized 1,600,780,787 cycles # 4.624 GHz 5,493,773,623 instructions # 3.43 insn per cycle 1,270,915,076 branches # 3.671 G/sec
I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures. |
Thanks, I'll look into wrapping this in architecture-based preprocessor guards. |
My PR at #23330 by coincidence is trying to ADD THE INTRINSICS to the Perl C API, that THIS PR WANTS to use. But the ticket has stalled. 50% of the reason there is no progress, is b/c I input or a vote is needed on the best name/identifier to use for a new macro. 50% of the reason is claims that Perl 5 has no valid production code reason to need to swap ASCII/binary bytes in the C level VM at runtime.
|
Perl has had the my long write of 25 years of apathy #22886 some random related links https://www.nntp.perl.org/group/perl.perl5.porters/2010/03/msg157755.html Ive been complaining since 2015 about this. https://www.nntp.perl.org/group/perl.perl5.porters/2015/11/msg232849.html #12565 my PRs got rejected to add atomics and UA, for what? to protect non-existent hardware that was melted in Asia as eWaste 10-15 years. The needs of air gapped Perl users are irrelavent. The are airgapped an will never upgrage Who Here can take a selfie of themselves in 2025 with a post 2010 mfg-ed server or desktop that DOES NOT have HW UA MEM read support? Alpha in 1999 already had hardware supported UA mem reads aslong as you use the appropriate CC typedecls/declspecs/attributes to pick the "special" unaligned instruction. |
Which ones? I'm requesting .pdf links to the official Arch ISA specs that you are discussing. |
SPARC and earlier ARMv5 or so? Anyway, just write it idiomatically with memcpy and let the compiler notice it (or use builtins). It also introduces more aliasing issues (e.g. |
Even this isn't a great idea, because while the implementation is free to use unaligned access, that doesn't mean it's okay in C. If it vectorises it, it may assume aligned access and then trap. That's happened a few times recently (subversion, numpy). |
Unaligned access is undefined behaviour in C. This can be an problem in practice even on x86, when the compiler decides to use aligned SSE/AVX instructions. As the author of this blogpost has found out. The usual workaround is to use memcpy. For some algorithms (probably not this one) it's also possible to skip the first few bytes of input until the pointer is aligned. |
Although it should be noted that memcpy doesn't help with the "unaligned reads are implemented in software and super slow" problem. I don't know which architectures suffer from this. |
Wrong. After 15 minutes of trying I couldn't find unaligned HW mem R/W in SPARC V7 Sept 1988 ISA. But I did find this. SPARC V9 ISA May 2002 If P5P voted to Remove Windows 2000/XP/Server 2003 from Perl, that rule also applies to all same vintage Unix boxes from National Museum of Technology And Science of whatever country you live in. ARM V5, or I call personally call an"ARM V4 CPU" is an obsolete eWaste/Museum CPU. iPhone 1, and Android 1.0 introduced ARM V7 ISA as the minimum platform. The last commercial use of ARM v4 ISA, was for the ubiquitous ARM7TDMI CPU that was inside LG Nokia Motorola and Samsung 2G and 3G flip phones, and that ARM7TDMI is was ran your I've researching and created screen shots for ancient Alpha and ancient SPARC ISA so far, My ARM info above is from memory, not fresh research. I don't remember if ARM v4 has dedicated 1 op HW unaligned memory reads/writes, or some magical and fast enough multi opcode unaligned mem read write asm code recipe that is better and faster than In any case, the 2 platforms @thesamesam mentioned have some non-ISO-C-Abstract-Machine CC vendor specific "fast enough" C grammar token to do unaligned mem R/Ws. Someone just has to find it in that C Compiler's man page. FSF LLC and GNU LLC's proprietary CC vendor specific "fast enough" C grammar token to do unaligned mem R/Ws has the ASCII string name of MSVC instead wants you to use If we are arguing about ISO C Abstract Virtual Machine CPU compliace. The C compiler published by GNU/FSF is Easy demonstration: Add a Perl's only project scope is to run on modern enough hardware, not on the https://en.wikipedia.org/wiki/AN/FSQ-7_Combat_Direction_Central OS or ISO C Abstract Virtual OS or https://en.wikipedia.org/wiki/Apple_II |
I was saying that SPARC aggressively traps on unaligned access, not that it has instructions for that. |
Yeah, RISC CPUs are expected and its normal for them to throw an exception/sig handler each time if you use do a vanilla C machine type >1 byte mem read write in C lang on an unaligned addr. My argument is that all CCs have some magic special token/grammer way on how to safely and quickly do unaligned U16/U32/U64 mem reads and writes. So the end user dev just has to write those special magical tokens/grammer things as needed in C src code and the issue goes away. |
I vaguely remember ARMv4's/WinCE Perl, if someone does an unaligned U16 or U32 mem read on ARMv4, you get a free shift and a free mask as a gift from ARMv4. You don't get a SEGV on ARMv4 from UA mem reads. I think reason was, not going to google it, was b/c this was how ARMv4 wanted you to do UA mem reads.
Not safely and correctly SEGVing like I assumed WinCE Perl on ARM would, but instead finding a junk integer in my MSVC watch window, while step debugging in VS IDE was a learning experience for me. |
@bulk88 makes note to self to finally turn on intrinsic |
A bunch of links to the topic. Since C language UA mem read/writes are Day 1 of Win32 Platform Requirement, All RISC WinNT kernels ever compiled, will trap, emulate and resume the CPU's A bunch of links, the short summary is, go read your C compiler's docs on how to write your C code correctly using VENDOR SPECIFIC C syntax. https://devblogs.microsoft.com/oldnewthing/20170814-00/?p=96806 FWIW, here is a list of instruction my Win 7 x64 Kernel is capable of emulating. IDK why this emulation code exists or why this switch ladder exists in my Win 7 x64 kernel, I don't have a reason to need to know this, Remember this is closed source SW. But there must be bizarre, very rare cases, where for whatever reason, the Win 7 x64 kernel needs to silently fix up the CPU hardware fault/signal/interrut/exception/event handler whatever u call it. Update, this instruction list may or may not have something to do with an x64 CPU executing temporarily in 16 bit Real Mode, probably for WinNT's Win16 emulator
```
XmAaaOp
XmAadOp
XmAamOp
XmAasOp
XmAccumImmediate
XmAccumRegister
XmAdcOp
XmAddOp
XmAddOperands
XmBitScanGeneral
XmBoundOp
XmBsfOp
XmBsrOp
XmBswapOp
XmBtOp
XmBtcOp
XmBtrOp
XmBtsOp
XmByteImmediate
XmCallOp
XmCbwOp
XmClcOp
XmCldOp
XmCliOp
XmCmcOp
XmCmpsOp
XmCmpxchgOp
XmCompareOperands
XmCwdOp
XmDaaOp
XmDasOp
XmDecOp
XmDivOp
XmEffectiveOffset
XmEmulateStream
XmEnterOp
XmEvaluateAddressSpecifier
XmEvaluateIndexSpecifier
XmExecuteInt1a
XmFlagsRegister
XmGeneralBitOffset
XmGeneralRegister
XmGetCodeByte
XmGetImmediateSourceValue
XmGetLongImmediate
XmGetOffsetAddress
XmGetStringAddress
XmGetWordImmediate
XmGroup1General
XmGroup1Immediate
XmGroup2By1
XmGroup2ByByte
XmGroup2ByCL
XmGroup3General
XmGroup45General
XmGroup7General
XmGroup8BitOffset
XmHltOp
XmIdivOp
XmIllOp
XmImmediateEnter
XmImmediateJump
XmImulImmediate
XmImulOp
XmImulxOp
XmInOp
XmIncOp
XmInsOp
XmInt1aFindPciClassCode
XmInt1aFindPciDevice
XmInt1aReadConfigRegister
XmInt1aWriteConfigRegister
XmIntOp
XmIretOp
XmJcxzOp
XmJmpOp
XmJxxOp
XmLahfOp
XmLeaveOp
XmLoadSegment
XmLodsOp
XmLongJump
XmLoopOp
XmMovOp
XmMoveGeneral
XmMoveImmediate
XmMoveRegImmediate
XmMoveSegment
XmMoveXxGeneral
XmMovsOp
XmMulOp
XmNegOp
XmNotOp
XmOpcodeEscape
XmOpcodeRegister
XmOrOp
XmOutOp
XmOutsOp
XmPopGeneral
XmPopOp
XmPopStack
XmPopaOp
XmPortDX
XmPortImmediate
XmPrefixOpcode
XmPushImmediate
XmPushOp
XmPushPopSegment
XmPushStack
XmPushaOp
XmRclOp
XmRcrOp
XmRdtscOp
XmRetOp
XmRolOp
XmRorOp
XmSahfOp
XmSarOp
XmSbbOp
XmScasOp
XmSegmentOffset
XmSetDataType
XmSetLogicalResult
XmSetccByte
XmShiftDouble
XmShlOp
XmShldOp
XmShortJump
XmShrOp
XmShrdOp
XmSmswOp
XmStcOp
XmStdOp
XmStiOp
XmStosOp
XmStringOperands
XmSubOp
XmSubOperands
XmSxxOp
XmTestOp
XmXaddOp
XmXchgOp
XmXlatOpcode
XmXorOp
```
|
I will also add, Perl/P5P devs might be be able to in C lang, in certain permutations of CC OS and CPU, outsmart a C compiler's native UA mem read/write "best practices" algorithm. The engineering reason is, is it legal, to use 2 aligned U32 mem reads to read an unaligned U32? What if a unknown unnamed CPU has a hardware MMU virtual memory page size of 1 byte? You can get one of these CPUs with a 1 byte page size MMU right now for free at https://github.com/google/sanitizers/wiki/addresssanitizer or https://en.wikipedia.org/wiki/Valgrind In real life, aslong as you aren't over reading at the very end of a malloc block at address 4096-1 or 4096-2, 2 x U32 aligned reads are harmless. Malloc() won't hand out a 4 or 5 byte aligned memory block pressed against a 4096 boundary when you execute IDK of a real world But some should give the real rev engineered meaning of this OSX API doc, read section "Allocating Large Memory Blocks using Malloc" at https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/MemoryAlloc.html#//apple_ref/doc/uid/20001881-99765 But regarding the perl C API, remember that a fictional |
Here is a reason why Here is a list of all call sites to A couple calls in 1 func are fine, but anything over 5 calls points to big problems with P5P macros and C code, example of 1 func
I know someone assumed GCC's inline full list of all call sites to
```
PerlIO_find_layer+6C
Perl_Gv_AMupdate+2AF
Perl_Gv_AMupdate+2D4
Perl__invlistEQ+BB
Perl__to_utf8_fold_flags+1EF
Perl__to_utf8_fold_flags+242
Perl_allocmy+25C
Perl_ck_method+12B
Perl_cv_ckproto_len_flags+183
Perl_do_uniprop_match+B5
Perl_do_uniprop_match+DB
Perl_fbm_instr+15D
Perl_fbm_instr+184
Perl_fbm_instr+245
Perl_fbm_instr+DE
Perl_foldEQ_utf8_flags+30B
Perl_get_and_check_backslash_N_name+26E
Perl_grok_number_flags+437
Perl_grok_numeric_radix+1D0
Perl_gv_add_by_type+8B
Perl_gv_autoload_pvn+53
Perl_gv_fetchmeth_pvn_autoload+6C
Perl_gv_fetchmethod_pvn_flags+25F
Perl_gv_fetchmethod_pvn_flags+D7
Perl_gv_fetchpvn_flags+25E
Perl_gv_fullname4+BB
Perl_gv_init_pvn+2E6
Perl_gv_setref+51C
Perl_hv_common+A80
Perl_hv_ename_add+1A7
Perl_hv_ename_add+D1
Perl_hv_ename_delete+16B
Perl_hv_ename_delete+1BF
Perl_hv_ename_delete+AA
Perl_is_utf8_FF_helper_+A8
Perl_leave_scope+875
Perl_magic_sethook+7B
Perl_magic_sethook+AA
Perl_magic_setsig+2BA
Perl_magic_setsig+2EC
Perl_magic_setsig+93
Perl_magic_setsig+DE
Perl_mro_isa_changed_in+1E9
Perl_mro_method_changed_in+170
Perl_mro_package_moved+23C
Perl_mro_package_moved+FF
Perl_ninstr+78
Perl_pad_findmy_pvn+C2
Perl_pp_gelem+122
Perl_pp_gelem+1A1
Perl_pp_gelem+1CB
Perl_pp_gelem+1FD
Perl_pp_gelem+22F
Perl_pp_gelem+257
Perl_pp_gelem+27D
Perl_pp_gelem+2A4
Perl_pp_gelem+2FB
Perl_pp_gelem+D1
Perl_pp_lc+465
Perl_pp_prototype+64
Perl_pp_ucfirst+30D
Perl_pp_ucfirst+6EB
Perl_re_intuit_start+22D
Perl_re_op_compile+50A
Perl_refcounted_he_chain_2hv+C1
Perl_refcounted_he_fetch_pvn+E9
Perl_regexec_flags+A93
Perl_regexec_flags+AB9
Perl_rninstr+90
Perl_save_gp+AE
Perl_scan_str+629
Perl_scan_str+660
Perl_scan_str+696
Perl_scan_str+6E4
Perl_sv_cmp_flags+159
Perl_sv_cmp_locale_flags+B6
Perl_sv_eq_flags+1A7
Perl_sv_gets+605
Perl_sv_gets+7F7
Perl_upg_version+1E1
Perl_whichsig_pvn+47
Perl_xs_handshake+119
S_apply_builtin_cv_attribute+45
S_apply_builtin_cv_attribute+69
S_apply_builtin_cv_attribute+92
S_do_chomp+314
S_dofindlabel+2AE
S_doopen_pm+D6
S_dopoptolabel+14D
S_find_in_my_stash+37
S_force_word+1D4
S_glob_assign_glob+23F
S_gv_fetchmeth_internal+269
S_gv_fetchmeth_internal+387
S_gv_fetchmeth_internal+3EC
S_gv_magicalize+130
S_gv_magicalize+1A0
S_gv_magicalize+1E7
S_gv_magicalize+203
S_gv_magicalize+224
S_gv_magicalize+244
S_gv_magicalize+2DD
S_gv_magicalize+30A
S_gv_magicalize+32B
S_gv_magicalize+37B
S_gv_magicalize+3E3
S_gv_magicalize+40D
S_gv_magicalize+435
S_gv_magicalize+456
S_gv_magicalize+4DE
S_gv_magicalize+521
S_gv_magicalize+5AF
S_gv_magicalize+5EB
S_gv_magicalize+60C
S_gv_magicalize+65E
S_gv_magicalize+698
S_gv_magicalize+6C8
S_gv_magicalize+79F
S_gv_magicalize+8D4
S_handle_possible_posix+1013
S_handle_possible_posix+1049
S_handle_possible_posix+107B
S_handle_possible_posix+10A8
S_handle_possible_posix+10D5
S_handle_possible_posix+1102
S_handle_possible_posix+112F
S_handle_possible_posix+1155
S_handle_possible_posix+1176
S_handle_possible_posix+1199
S_handle_possible_posix+F3C
S_handle_possible_posix+FB5
S_handle_possible_posix+FD0
S_handle_possible_posix+FEB
S_hv_delete_common+36A
S_hv_delete_common+4D8
S_hv_delete_common+53E
S_incline+F6
S_is_codeset_name_UTF8+3A
S_is_codeset_name_UTF8+78
S_is_codeset_name_UTF8+8F
S_is_utf8_overlong+67
S_magic_sethint_feature+11A
S_magic_sethint_feature+14E
S_magic_sethint_feature+180
S_magic_sethint_feature+1AC
S_magic_sethint_feature+1DB
S_magic_sethint_feature+20D
S_magic_sethint_feature+23F
S_magic_sethint_feature+26E
S_magic_sethint_feature+29F
S_magic_sethint_feature+311
S_magic_sethint_feature+341
S_magic_sethint_feature+375
S_magic_sethint_feature+3A9
S_magic_sethint_feature+3D7
S_magic_sethint_feature+402
S_magic_sethint_feature+42E
S_magic_sethint_feature+45E
S_magic_sethint_feature+492
S_magic_sethint_feature+4C6
S_magic_sethint_feature+4FA
S_magic_sethint_feature+526
S_magic_sethint_feature+552
S_magic_sethint_feature+582
S_magic_sethint_feature+5B2
S_magic_sethint_feature+5DC
S_magic_sethint_feature+77
S_magic_sethint_feature+E9
S_maybe_add_coresub+5F6
S_mayberelocate+107
S_mayberelocate+50
S_move_proto_attr+21C
S_move_proto_attr+CA
S_pad_check_dup+1F1
S_pad_check_dup+B7
S_pad_findlex+DE
S_parse_LC_ALL_string+14F
S_parse_uniprop_string+1091
S_parse_uniprop_string+10B5
S_parse_uniprop_string+178
S_parse_uniprop_string+1942
S_parse_uniprop_string+31A
S_parse_uniprop_string+41B
S_parse_uniprop_string+440
S_parse_uniprop_string+548
S_parse_uniprop_string+566
S_parse_uniprop_string+B6D
S_parse_uniprop_string+BAF
S_parse_uniprop_string+BCD
S_parse_uniprop_string+BF5
S_parse_uniprop_string+C0E
S_parse_uniprop_string+C29
S_parse_uniprop_string+C87
S_parse_uniprop_string+CE2
S_parse_uniprop_string+D2F
S_parse_uniprop_string+D8A
S_parse_uniprop_string+E07
S_parse_uniprop_string+E77
S_reg+100D
S_reg+1038
S_reg+1063
S_reg+1093
S_reg+10C0
S_reg+10E7
S_reg+110E
S_reg+11C0
S_reg+4F1
S_reg+523
S_reg+550
S_reg+576
S_reg+5A7
S_reg+5CE
S_reg+64D
S_reg+66E
S_reg+699
S_reg+6B6
S_reg+6E1
S_reg+706
S_reg+72C
S_reg+749
S_reg+77E
S_reg+79F
S_reg+7CA
S_reg+7EB
S_reg+94C
S_reg+A4F
S_reg+FE2
S_regatom+8AF
S_regclass+16D7
S_regmatch+15A2
S_regmatch+2AE7
S_regmatch+3568
S_regrepeat+159E
S_regrepeat+1674
S_regrepeat+79C
S_require_file+13C7
S_require_file+1C0B
S_require_file+1C41
S_require_file+6BE
S_scan_ident+2FA
S_setup_EXACTISH_ST+1006
S_setup_EXACTISH_ST+108F
S_setup_EXACTISH_ST+108F
S_setup_EXACTISH_ST+10B0
S_setup_EXACTISH_ST+10B0
S_setup_EXACTISH_ST+2123
S_setup_EXACTISH_ST+FC0
S_share_hek_flags+88
S_swallow_bom+D0
S_turkic_fc+43
S_turkic_lc+50
S_turkic_uc+43
S_unpack_rec+1C5B
S_unshare_hek_or_pvn+10C
XS_NamedCapture_TIEHASH+DD
XS_PerlIO_get_layers+14C
XS_PerlIO_get_layers+1A8
XS_PerlIO_get_layers+F2
hek_eq_pvn_flags+7B
set_w32_module_name+B3
yyl_do+B4
yyl_fake_eof+159
yyl_foreach+1AE
yyl_foreach+1EF
yyl_foreach+320
yyl_foreach+38F
yyl_hyphen+CB
yyl_interpcasemod+72
yyl_interpcasemod+A3
yyl_keylookup+109
yyl_leftcurly+7E3
yyl_my+2DC
yyl_my+317
yyl_slash+114
yyl_try+860
yyl_try+9EB
yyl_try+AB3
yyl_try+B19
```
|
eb866fe
to
8a7804b
Compare
I don't believe its possible to write production grade C not C++ SW without Either don't use GCC's -O3 / -O4 flag (-O2 ??? -O1 ???) or understand you have to live with adding the If you remove |
u64_2 = _swab_64_(u64_2); | ||
memcpy(outp + j - 8, &u64_2, 8); | ||
memcpy(outp + i, &u64_1, 8); | ||
i += 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not get rid of the memcpy()
s and put in the textbook reference implementation
switch( ((size_t)ptr) & 0x7) {
case 3:
case 2:
case 1:
case 0:
}
/* NOW START HIGH ITERATION U64 LOOP */
The above is the universal university comp sci student textbook implementation of memcpy()
and memcmp()
. All libcs from all authors from all OSes do the above impl, unless the Foo LibC/OS authors care enough to actually optimize their toy weekend hobby OS/Libc for production profit making desktop or server usage to put food on the table.
If performance is #1 highest priority, this code will fail a benchmark vs a revision with that mask off and switch(), since this code permanently keeps doing HW unaligned reads/writes through the entire inner loop, whether the SVPV* string is 16 bytes, 71 bytes, or 7103 bytes or 70 MBs + 5 bytes long.
I'll bet doing Native i386 HW unaligned U64 math in a loop for 32 MBs on a 2020s Core iSeries will always loose a bench mark vs same code but aligned U64 math in a loop for 32 MBs on same CPU. My guess is the diff would be %5- max 30% loss doing all 32 MBs with UA read write on a brand new Intel Core. Its a small loss, nothing serious, but if you are expect typically huge data strings to chug through, that are orders of magnitude larger than the C function's own machine code, then yeah, put in the 1/2 screen of new src code logic and 3-7 conditional branches to align that pointer.
update:
above algo might not work unless the total string length is divideable by U16/U32/U64,
RA32 read align 32
WU32 write unalign 32
1 2 3 4 5 6 7 8 9
RA32 RA32 RA32 RA32
WU32 WU32 WU32 WU32
once the far end is unaligned all reads and writes on the tail have to stay unaligned because you will make bytes disappear trying to snap the tail end to become aligned, the other solution is a barrel roll/shift roller bit wise math op that +8s the roll on every tail side U32 read or write, or other words +8s the shift rollover factor on every iteration of the loop on the tail far side only.
There might be a 99% guarantee the near / starting side of any SvPVX()ptr is aligned to 8/16 in the Perl C API only.,
OOK_onand
SvLEN()==0` are very rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we copied the entire string to align it - adding an extra full copy if the source & destination buffers are different - how would we avoid an unaligned read/write at the tail end when the string isn't really conveniently sized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm an auto mechanic, not a blacksmith. I can change your transmission. I can't make you a new one with a nail file.
Read this algo I invented in a fixed width IDE so the coloms line up
I dont like this algo, but I'm not a bitwise math PHD, I know enough to diagnose and debug the algos if needed. But info google offers alot better algos than I can I invent from my head.
1 2 3 4 5 6 7 8 9 A B C D E
RA32A RA32B RA32C RA32D
WU32D WU32C WU32B WU32A
or this algo
32D 32C 32B 32A
>> 8
32C 32B 32A 32ZeroUB
do |= op, protect UB byte
WA32C WA32B WA32A WA32ZeroUB
(32D 32C 32B 32A) >> (8 * 3)
0 0 0 32Dold
!!!!!!DONE!!!!! NEXT CYCLE
RA32A RA32B RA32C RA32D
32D 32C 32B 32A
>> 8
== 32C 32B 32A 0
0 0 0 32Dold
|=
32C 32B 32A 32Dold
WA32C WA32B WA32A WA32Dold
(32D 32C 32B 32A) >> (8 * 3)
0 0 0 32Dold2
!!!!!!DONE!!!!! NEXT CYCLE
My guess all this bit math algebra will be slower than letting an x86 or ARM CPU do its native UA mem R/Ws. Feeding 4-9 opcodes into the front end decoder/compiler of the CPU can't be faster than feeding 1 opcode into the front end compiler of the CPU and letting the CPU secretly internally expand the 1 public API opcode to the 4-10 chop/slice/slide/glue microfusion private API opcodes to do a UA memory read.
From the last time I had to a UA mem vs Perl design thread a few years ago, from reading the Verilog src code of a FOSS 32-bit ARM v4 CPU on Github, U32 UA memory reads are penalty free if contained to the same 128-bit/16 byte cache line. The least significant bits of the mem addr in the load store unit decide in hardwired TTL/CMOS transistor, decide if its same cache line, or 2 cache lines.
If same cache line, a 0 overhead barrel rotate is done to extract the U32 from the U128 L1 cache line. All aligned U32s have to do the 0 overhead barrel rotate, since 3 out of 4 aligned U32s, are actually unaligned memory reads @ U128.
If the UA U32 read crosses a 16 byte cache line, the load store unit injects a pause instruction into the ARM pipeline (a 1 hz delay, which is basically nothing), since now 2 different L1 U128 registers need to be read and transferred to the load store unit.
UA U32 writes have no runtime overhead in the FOSS CPU. The splicing of the UA U32 into memory is done asynchronously in the write back L1 cache/L1 write combining unit.The new unaligned U32 value to store, has already left/exited/retired from the ARM CPU pipeline. The UA splicing into the L1 U128 registers is basically free. The FOSS ARM CPU's pipeline is 4 stages. The design of this FOSS 32-bit ARM CPU is such that all the L1 U128 registers are actually alias/pinned/renames/mapped to the public API ARM general purpose registers in the instruction decoder/scheduler. You can't do 128 bit integer math with an ARM32 CPU, but this CPU's HW actually only has 128 bit registers. types byte
, short
and 32b int
/size_t
/void *
are emulated to meet the requirements of ARMv4 Asm Lang.
Because of 4 stages, and assignment/decoding of ARM Asm GPRs to HW regs, the L1 register is locked at stage 1 or stage 2, and unlocked at stage 4/5. Because this is a RISC CPU, the CPU ins decoder always is looking 1 instruction ahead of current Instruction Pointer for a memory load opcode, That is when the L1 U128 lock takes place. By the time of the commit/store opcode, the CPU is -1 ahead of next read/modify/write cycle to the same address. If there is a an unaliged@u128 aligned @u32, U32 value to write, its not a conflict because the last read/modify/write cycle already unlocked the U128 L1 register, allowing the splicing of the U32 into a U128 to take place.
If an aligned@U8 true unaligned U32 value crosses a 16 byte L1 cache line, that will cause the store opcode at stage 4/5 to issue a lock on the 2nd u128. If 0-3 hertz later, at stage 1, the instruction decoder does a -1 read ahead looking for a load instruction, and that load instruction, conflicts with the lock on the 2nd U128, the secret -1 read ahead load instruction becomes a "do as written in ARM Asm code by the developer" load instruction. And therefore isn't really a block or conflict.
Remember, all U8/U16/U32 writes are unaligned@U128, and therefore must get spliced aka &= ~; |= ;
into a U128 register in the write back cache.
In a real life production CPU, if I google it, nobody can make a benchmark showing any speed difference between SSE's movaps
and movups
instructions for aligned addresses, and L1/L2 cached memory addresses (aka the C stack or tiny malloc blocks <= 128 bytes).
Crossing a 4096 boundary, is a different case. Tripping the OS page fault handler, or making the CPU's MMU fetch a piece of the page table from a DDR stick, you get what you deserve,
So for this FOSS ARM CPU's L1 cache, and L2 cache, and DDR wires only work with aligned U128s units. U32s are unaligned memory access. In commercial CPUs 32 bit and 64 bit integers don't exist if your have DDR-2/DDR-3/DDR-4. Sorry pedantic ISO C gurus, your DDR-4 memory sticks only work with aligned U512 / 64 byte integers. DDR2 is 32 bytes MTU, DDR 3 is 64 byte MTU, DDR 4 is also 64 byte MTU,
Back to this pp_reverse()
algorithm, doing this bit math algebra I described above with SSE or AVX-256 or Neon registers might be faster than using x86/ARM native UA mem R/Ws and the x86/ARM GPR. But that is beyond my skill set. IDK how to read SSE code. Reading Spanish French German Polish Czech, is easier for me as an American than SSE code. And the first 3 langs, I have zero educational background to be able to read them.
Writing my bit math algebra algo for pp_reverse()
in SSE probably involves keywords "shuffle" "saturate" "packed" "immediate" and IDK what SSE's buzzword or euphemism for i386's EFLAGS psuedo register is "immediate" ??? "status word" ??? "packed control" ??? Im making things up now.
I don't see the point of spending the effort and trying to write this pp_reverse algo to use aligned R/W mem ops + SSE/NEON. If someone else already invented the src code and example/demo how to do pp_reverse on unaligned U8 strings, using aligned SSE/NEON opcodes, go head, I have no objection to copy pasting that code into Perl's pp_reverse. But I don't see the effort/profit ratio to invent this algo in Perl 5 for the 1st time in history of Comp Sci. Perl doesn't need to be the first to fly to Mars. Let another prog lang/SW VM with unlimited funding from a FANNG do it first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the retval of SvGROW()
as intended and save it to a C auto. Then DIY your own SvEND, skipping a couple redundant mem reads to struct member slot SvPVX(),
edit lets see what gh renders this as
update failed idk how to force gh to show a code sample inline, in a comment, without doing th
re
e apos code block escape and copy pasting with a mouse than 3 mo
r`e apos.
STRLEN i = 0; | ||
STRLEN j = len; | ||
uint32_t u32_1, u32_2; | ||
uint16_t u16_1, u16_2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use perl's native U32
/U16
types. Less to type, less screen space waste, more obvious this is P5P/CPAN code and not a "whats the license and author credit requirements of this FOSS file? Is this an Affero GPL or GPL 3 file?". If things "go wrong" on some CC/CPU/OS permutation P5P can fix U32/I32
very easily from Configure
or perl.h
. Fixing uint32_t
might be hard to impossible to hook and replace.
Exampl Perl's U32/U64 can be hooked and overridden by P5P bc Remember Microsoft says 2 and 3 words long C types are satanic. long long int
long long double
etc. or unsigned long long long long
jkjk. MSVC/Toolchain always calls it __int8163264128
see https://learn.microsoft.com/en-us/cpp/cpp/int8-int16-int32-int64?view=msvc-170
In any case, ISO C ~2000s/~2010s still has your uint32_t
listed as vendor optional. see https://en.cppreference.com/w/cpp/types/integer.html
https://stackoverflow.com/questions/24929184/can-i-rely-on-sizeofuint32-t-4
On sarcasm side side how do you know your uint32_t
isn't 5 or 8 bytes long because each type uint32_t
has 4 extra bytes of Reed–Solomon coding? Yes uint32_t
only holds 0 to 4294967295 but sizeof(uint32_t) == 8
. Perl is running on a control rod servo controller at Nuclear Power plant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MSVC didn't like the U32/U16 that I originally used. What's the best solution there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MSVC didn't like the U32/U16 that I originally used. What's the best solution there?
Do you have a link to the bad GH runner log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, See my patch at bottom of this pr.
I'll also add byte order swapping, more specifically swapping the byte order of a 64 bit variable aka htonll()/ntohll(), was a patented technology until 2022 and perl has been doing some low level SW piracy for decades at https://github.com/Perl/perl5/blob/blead/perl.h#L4601 see this patent by Intel and compare it to Perl's math operator's above https://patents.google.com/patent/US20040010676A1/en cough cough pro- or anti-patent troll cough cough |
MSVC fix branch pushed at #23374 Reasons for all the MSVC specific code is, here is very poor code gen round 1, after getting the syntax errors fixed.
here is round 2 after turning on inline memcpy, still bad code gen
after all of my tweaks it looks perfect or identical to whatever GCC and Clang would emit
|
BEFORE 5.41.7
AFTER
|
The performance characteristics of string reversal in blead is very variable depending upon the capabilities of the C compiler. Some compilers are able to vectorize some cases for better performance.
This commit introduces explicit reversal and swapping of whole registers at a time, which all builds seem to be able to benefit from.
The
_swab_xx_
macros for doing this already exist in perl.h, using them for this purpose was inspired byhttps://dev.to/wunk/fast-array-reversal-with-simd-j3p The bit shifting done by these macros should be portable and reasonably performant if not optimised further, but it is likely that they will be optimised to bswap, rev, movbe instructions.
Some performance comparisons:
1. Large string reversal, with different source & destination buffers
my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse $x }
gcc blead:
clang blead:
gcc patched:
clang patched:
2. Large string reversal, but reversing the buffer in-place
my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse "foo",$x }
gcc blead:
clang blead:
gcc patched:
clang patched:
3. Short string reversal, different source & destination (checking performance on smaller string reversals - note: this one's vary variable due to noise)
my $x = "1234567"; my $y; for (0..10_000_000) { $y = reverse $x }
gcc blead:
clang blead:
gcc patched:
clang patched: