Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memset is terrible and not performant likely (for nonzeroing path) #265

Open
cartazio opened this issue Mar 9, 2020 · 20 comments
Open

memset is terrible and not performant likely (for nonzeroing path) #265

cartazio opened this issue Mar 9, 2020 · 20 comments
Labels
bug enhancement task Something that needs to be done (not a bug fix)

Comments

@cartazio
Copy link
Contributor

cartazio commented Mar 9, 2020

it falls back to a scalar c loop when not zeroing. which seems disappointing. Also not sure if it correctly handles -0 in the floating point code

@cartazio cartazio added bug enhancement task Something that needs to be done (not a bug fix) labels Mar 9, 2020
@cartazio cartazio added this to the 0.7.1 milestone Mar 9, 2020
@cartazio
Copy link
Contributor Author

cartazio commented Mar 9, 2020

this is gonna be some fun hackery to do a sane portable performant thingy :)

@cartazio
Copy link
Contributor Author

cartazio commented Mar 9, 2020

@bgamari under what circumstances does the GHC code gen do something smart with memset/memmove? As best i can recall, the only instances are internally generated and via the llvm codegen?

@andrewthad
Copy link
Collaborator

GHC has a primop for memset (setByteArray#) that is not currently used by primitive. There is a PR that switches to using this. In GHC 8.10, GHC is going to become smarter about unrolling this when the size is known and small.

@andrewthad
Copy link
Collaborator

Original issue on GHC is here and Artem's work that resolved the issue is here.

@cartazio
Copy link
Contributor Author

cartazio commented Mar 9, 2020

I’ll look at how ghc does that primop and figure out what the different choices are. Thanks for helping out :)

@cartazio
Copy link
Contributor Author

cartazio commented Mar 9, 2020

ok, so
https://gitlab.haskell.org/ghc/ghc/blob/4898df1cc25132dc9e2599d4fa4e1bbc9423cda5/compiler/GHC/StgToCmm/Prim.hs#L2480-2496

-- ----------------------------------------------------------------------------
-- Setting byte arrays

-- | Takes a 'MutableByteArray#', an offset into the array, a length,
-- and a byte, and sets each of the selected bytes in the array to the
-- character.
doSetByteArrayOp :: CmmExpr -> CmmExpr -> CmmExpr -> CmmExpr
                 -> FCode ()
doSetByteArrayOp ba off len c = do
    dflags <- getDynFlags

    let byteArrayAlignment = wordAlignment dflags -- known since BA is allocated on heap
        offsetAlignment = cmmExprAlignment off
        align = min byteArrayAlignment offsetAlignment

    p <- assignTempE $ cmmOffsetExpr dflags (cmmOffsetB dflags ba (arrWordsHdrSize dflags)) off
    emitMemsetCall p c len align

and emitMemsetCall is defined as

-- | Emit a call to @memset@.  The second argument must fit inside an
-- unsigned char.
emitMemsetCall :: CmmExpr -> CmmExpr -> CmmExpr -> Alignment -> FCode ()
emitMemsetCall dst c n align = do
    emitPrimCall
        [ {- no results -} ]
        (MO_Memset (alignmentBytes align))
        [ dst, c, n ]

then for known stuff we get

https://gitlab.haskell.org/ghc/ghc/blob/4898df1cc25132dc9e2599d4fa4e1bbc9423cda5/compiler/nativeGen/X86/CodeGen.hs#L2222-2293

genCCall' dflags _ (PrimTarget (MO_Memset align)) _
         [dst,
          CmmLit (CmmInt c _),
          CmmLit (CmmInt n _)]
         _
    | fromInteger insns <= maxInlineMemsetInsns dflags = do
        code_dst <- getAnyReg dst
        dst_r <- getNewRegNat format
        if format == II64 && n >= 8 then do
          code_imm8byte <- getAnyReg (CmmLit (CmmInt c8 W64))
          imm8byte_r <- getNewRegNat II64
          return $ code_dst dst_r `appOL`
                   code_imm8byte imm8byte_r `appOL`
                   go8 dst_r imm8byte_r (fromInteger n)
        else
          return $ code_dst dst_r `appOL`
                   go4 dst_r (fromInteger n)
  where
    maxAlignment = wordAlignment dflags -- only machine word wide MOVs are supported
    effectiveAlignment = min (alignmentOf align) maxAlignment
    format = intFormat . widthFromBytes $ alignmentBytes effectiveAlignment
    c2 = c `shiftL` 8 .|. c
    c4 = c2 `shiftL` 16 .|. c2
    c8 = c4 `shiftL` 32 .|. c4

    -- The number of instructions we will generate (approx). We need 1
    -- instructions per move.
    insns = (n + sizeBytes - 1) `div` sizeBytes

    -- The size of each move, in bytes.
    sizeBytes :: Integer
    sizeBytes = fromIntegral (formatInBytes format)

    -- Depending on size returns the widest MOV instruction and its
    -- width.
    gen4 :: AddrMode -> Integer -> (InstrBlock, Integer)
    gen4 addr size
        | size >= 4 =
            (unitOL (MOV II32 (OpImm (ImmInteger c4)) (OpAddr addr)), 4)
        | size >= 2 =
            (unitOL (MOV II16 (OpImm (ImmInteger c2)) (OpAddr addr)), 2)
        | size >= 1 =
            (unitOL (MOV II8 (OpImm (ImmInteger c)) (OpAddr addr)), 1)
        | otherwise = (nilOL, 0)

    -- Generates a 64-bit wide MOV instruction from REG to MEM.
    gen8 :: AddrMode -> Reg -> InstrBlock
    gen8 addr reg8byte =
      unitOL (MOV format (OpReg reg8byte) (OpAddr addr))

    -- Unrolls memset when the widest MOV is <= 4 bytes.
    go4 :: Reg -> Integer -> InstrBlock
    go4 dst left =
      if left <= 0 then nilOL
      else curMov `appOL` go4 dst (left - curWidth)
      where
        possibleWidth = minimum [left, sizeBytes]
        dst_addr = AddrBaseIndex (EABaseReg dst) EAIndexNone (ImmInteger (n - left))
        (curMov, curWidth) = gen4 dst_addr possibleWidth

    -- Unrolls memset when the widest MOV is 8 bytes (thus another Reg
    -- argument). Falls back to go4 when all 8 byte moves are
    -- exhausted.
    go8 :: Reg -> Reg -> Integer -> InstrBlock
    go8 dst reg8byte left =
      if possibleWidth >= 8 then
        let curMov = gen8 dst_addr reg8byte
        in  curMov `appOL` go8 dst reg8byte (left - 8)
      else go4 dst left
      where
        possibleWidth = minimum [left, sizeBytes]
        dst_addr = AddrBaseIndex (EABaseReg dst) EAIndexNone (ImmInteger (n - left))

otherwise it falls back to the c function call on a given platform.

@cartazio
Copy link
Contributor Author

cartazio commented Mar 9, 2020

now to understand what code impls on what platforms!

https://github.com/freebsd/freebsd/blob/master/lib/libc/amd64/string/memset.S
is the amd 64 platform impl for freebsd. and the other related operations

glibC , at least on AMD64, has a whole mess of fallback cases for different micro architectures
as can be seen https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/multiarch or here
https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=tree;f=sysdeps/x86_64/multiarch;h=942a61db8fc4615bf1998c484d883720b82c1e70;hb=HEAD

@cartazio
Copy link
Contributor Author

cartazio commented Mar 9, 2020

the default cutoff for ghc level unrolling in native code gen is
https://gitlab.haskell.org/ghc/ghc/blob/4898df1cc25132dc9e2599d4fa4e1bbc9423cda5/compiler/main/DynFlags.hs#L2121-2124

        maxInlineAllocSize = 128,
        maxInlineMemcpyInsns = 32,
        maxInlineMemsetInsns = 32,

so roughly, ghc native code gen can / will unroll any const call to memset when it has constant arguments such that
(# of bytes copied + native word size in bytes - 1 ) / (native word size in bytes) <= 32

on 64bit amd64 that simplifies to n + 8 -1 / 8 <= 32, which rewrite to n + 7 <= 256, so n <= 249 bytes will get unrolled with this current scheme. (using ~ at most 32 GPR flavored asm moves), and otherwise falls back to lib c stuff.

how was this threshold chosen? Fun aside: probably could improve this now that we assume >= SSE2 on all AMD64/x86 platforms.

Also theres a question about whether the OS/LIb C flavor is acceptable on each platform.

@cartazio
Copy link
Contributor Author

cartazio commented Mar 9, 2020

interstingly, according to godbolt, clang optimizes to "call lib c" for large constant args

https://godbolt.org/z/JBhCJz

@cartazio
Copy link
Contributor Author

https://godbolt.org/z/dxwawW heres some smaller thresholds and how clang does it

@cartazio
Copy link
Contributor Author

https://godbolt.org/z/6Fs3pA thres some intersting variation in code selection depending on what cpu architecture is chosen

@cartazio
Copy link
Contributor Author

ok, so i think for small enough argumements < 249 on amd64, and < 125 bytes on x86, ghc does a reasonable thing for this stuff. THOUGH, i think we'd get a genuine win from using sse registers instead of gpr wrt halving the code size.. I think we have roughly the right unrolling threshold, at least if clang and gcc tuning are actually decent.

cc @andrewthad @bgamari @m37 (andreask?)

it does look like on platforms that aren't linux, we might not be able to assume a GOOD optimized fallback c symbol? (though some benchmarking will be needed i guess)

@cartazio
Copy link
Contributor Author

my current thinking is :
write the primitive stuff such that for known args, theres a branch that goes through into the unrolling case for ghc, and we have a fallback for large stuff thats fancy and fast and simple

@cartazio
Copy link
Contributor Author

cartazio commented Mar 10, 2020

as best i can tell, the only OSS lib c implementation which has a SSE/AVX/simd anything implementation is linux/glib c, everything just does hand written assembly to (probably?) prevent the c compiler from being tooooo clever and replacing the mem* implementation with a call to itself!

edit: some implementations do a tiny c code thing

@cartazio
Copy link
Contributor Author

zooming out: the real answer is benchmarks! :)

@andrewthad
Copy link
Collaborator

I think the only reasonable thing to do in primitive is to use the primop that GHC provides. Optimizing the implementation further is a GHC issue, not a primitive issue. Having GHC use wider registers when possible might be a win, but then you won't be doing aligned writes (which I think doesn't really matter on more recent Intel processors), and you have to load the 128- or 256-bit word from memory rather than including it in the instruction, so for writing 16 or 24 bytes, this may not actually be better.

@cartazio
Copy link
Contributor Author

cartazio commented Mar 10, 2020 via email

@cartazio
Copy link
Contributor Author

andrew: to clarify, i'm not concerned with the ghc unrolling case, i'm concerned with the performance characteristics for input sizes that fall back to libc / system provided memset and friends. (the conclusion may very well be that libc simple codes are fine on all cpus we care about, or not). Theres also a subtle mismatch in how we use the mem* related operations in ghc / haskell today versus the specified of how they're done today.

Also theres some nontrival ways where a more haskell specific flavor will actually make a 2x difference in storable vector in some cases (at least per asymptotics, not sure if the constant factors work out in practice)

@cartazio
Copy link
Contributor Author

as a bit of context for why this actually should be done as performantly as possible for unknown size / large buffers is how it comes into storable vector
see
https://github.com/haskell/vector/blob/eeb42ad42aa345ce192086baed80c805bcfc3e72/Data/Vector/Storable/Mutable.hs#L202-L256

@andrewthad andrewthad removed this from the 0.7.1 milestone Jun 12, 2020
@dcoutts
Copy link

dcoutts commented Mar 26, 2024

I think the only reasonable thing to do in primitive is to use the primop that GHC provides.

Yes! And currently it doesn't. It uses memset directly, so doesn't benefit from any unrolling or from -fcheck-prim-bounds debug support.

I take it back, shimmedSetWord8Array# does use GHC.Exts.setByteArray#.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug enhancement task Something that needs to be done (not a bug fix)
Projects
None yet
Development

No branches or pull requests

3 participants