Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make fromListN functions good consumers #424

Merged
merged 1 commit into from
Nov 19, 2024

Conversation

meooow25
Copy link
Contributor

...in terms of list fusion.

Closes #418.

Note: Compared to the proposed change in #418, this PR also uses GHC.Exts.oneShot. GHC doesn't need this help if the input is just a list, but I have found that it can improve performance after fusion reduces it to a foldr on another structure, such as a Set. Benchmarks showing this effect: gist

@meooow25 meooow25 changed the title Make fromListArrayN functions good consumers Make fromListN functions good consumers Nov 17, 2024
@meooow25
Copy link
Contributor Author

Looks like oneShot is not exported from GHC.Exts before 8.2 (GHC #12011). How should we handle this?

@andrewthad
Copy link
Collaborator

I'll drop support for 8.0 to deal with oneShot. Rebase on top of that after I drop it from CI.

In PR 425, I've adopted the benchmark that you linked to, but my copy of it just measures the performance of the implementation that is currently in primitive. Before your change (with GHC 9.4.8):

    arrayFromListN
      set-to-list-to-array: OK
        9.79 μs ± 347 ns

After your change:

    arrayFromListN
      set-to-list-to-array: OK
        11.6 μs ± 990 ns

And then after your change but with GHC 9.8.2:

    arrayFromListN
      set-to-list-to-array: OK
        12.3 μs ± 1.2 μs

Maybe the improved performance only shows up when lists are much larger? My example set in the benchmark suite only has 1024 elements.

@chessai
Copy link
Member

chessai commented Nov 18, 2024

Maybe the improved performance only shows up when lists are much larger? My example set in the benchmark suite only has 1024 elements.

If that's true, is it worth it to branch on the Int, providing the old implementation if it's sufficiently small?

...in terms of list fusion.
@meooow25
Copy link
Contributor Author

I'll drop support for 8.0 to deal with oneShot. Rebase on top of that after I drop it from CI.

Thanks, rebased.

Maybe the improved performance only shows up when lists are much larger? My example set in the benchmark suite only has 1024 elements.

No, it's just that I benchmarked PrimArray and you benchmarked Array.
But thanks for noticing this, because on inspection I found that the index Int is not unboxed. This is something GHC is currently not able to optimize (GHC #24628), so I've changed the implementations to use unboxed Int# with an explanatory note.

I've also updated the gist, and we are now faster for Array.

  PrimArray
    primArrayFromSet:              OK
      4.18 ms ± 293 μs,  13 MB allocated, 1.8 MB copied,  24 MB peak memory
    ...
    primArrayFromSetNewOneShotUnb: OK
      657  μs ±  42 μs, 5.3 MB allocated,  87 KB copied,  37 MB peak memory
  Array
    arrayFromSet:                  OK
      1.43 ms ±  40 μs, 6.1 MB allocated, 218 KB copied,  37 MB peak memory
    ...
    arrayFromSetNewOneShotUnb:     OK
      1.12 ms ±  59 μs, 5.3 MB allocated,  88 KB copied,  38 MB peak memory

@meooow25
Copy link
Contributor Author

I'm wondering why the new arrayFromListN takes twice as long as primArrayFromListN, I would expect it to take around the same time. They even allocate the same, as expected. I'll take another look.

@meooow25
Copy link
Contributor Author

As far as I can tell, this is due to GC. GHC must be traversing the large Array to perform GC, but doesn't need to for PrimArray.

If I reduce n to 1000, the difference reduces.

All
  PrimArray
    primArrayFromSetNewOneShotUnb: OK
      5.76 μs ± 349 ns,  55 KB allocated,   4 B  copied, 7.0 MB peak memory
  Array
    arrayFromSetNewOneShotUnb:     OK
      6.87 μs ± 371 ns,  55 KB allocated,   3 B  copied, 7.0 MB peak memory

Bumping n to 10^6 and collecting RTS stats for one iteration shows

PrimArray
gcs=12
mutator_cpu_ns=9.482836ms
gc_cpu_ns=3.727663ms

Array
gcs=12
mutator_cpu_ns=9.354649ms
gc_cpu_ns=11.735041ms

mutator_cpu_ns being almost the same matches what we would expect. The difference is solely due to GC.

So I think the PR is fine, nothing strange going on with the code.

@andrewthad
Copy link
Collaborator

Now I'm seeing better performance:

    arrayFromListN
      set-to-list-to-array: OK
        9.39 μs ± 242 ns

Weird, GC should not be happening at all while the array of boxed values is being initialized. Nothing should be getting allocated (other than the array itself, but the PrimArray is the same size), so there is no yield point for the GC. We are just copying pointers from a source into a destination. Regardless, this is certainly an improvement. I'll poke around a little more to see if anything seems odd, and then I'll merge. Thanks!

@andrewthad
Copy link
Collaborator

Oh, I was mistaken. There are allocations. Here's a bit of GHC Core from the benchmark suite:

$warrayFromSet_r5Wq
  = \ s_s5GS ->
      join {
        $w$j_s5GQ ds_s5GK ww_s5GN
          = case ds_s5GK of ds1_i4u0 {
              __DEFAULT ->
                runRW#
                  (\ s1_i4u8 ->
                     case newArray# ds1_i4u0 lvl6_r5Wn (s1_i4u8 `cast` <Co:4> :: ...) of
                     { (# ipv_i4uj, ipv1_i4uk #) ->
                     letrec {
                       go5_s5Cf
                         = \ z'_a5BQ ds4_a5BR eta_B0 eta7_B1 ->
                             case ds4_a5BR of {
                               Bin bx_a5BT x_a5BU l_a5BV r_a5BW ->
                                 go5_s5Cf
                                   ((\ v_i4ui eta8_X1F ->
                                       case <# v_i4ui ww_s5GN of {
                                         __DEFAULT -> case lvl8_r5Wp of wild1_00 { };
                                         1# ->
                                           case writeArray# ipv1_i4uk v_i4ui x_a5BU (eta8_X1F `cast` <Co:4> :: ...)
                                           of s'#_i4ur
                                           { __DEFAULT -> go5_s5Cf z'_a5BQ r_a5BW (+# v_i4ui 1#) (s'#_i4ur `cast` <Co:3> :: ...)
                                           }
                                       })
                                    `cast` <Co:7> :: ...)
                                   l_a5BV
                                   eta_B0
                                   eta7_B1;
                               Tip -> ((z'_a5BQ eta_B0) `cast` <Co:3> :: ...) eta7_B1
                             }; } in
                     case go5_s5Cf
                            ((\ ix#_i4us eta_B0 ->
                                case ==# ix#_i4us ww_s5GN of {
                                  __DEFAULT -> case lvl7_r5Wo of wild_00 { };
                                  1# -> (# eta_B0, () #)
                                })
                             `cast` <Co:7> :: ...)
                            s_s5GS
                            0#
                            (ipv_i4uj `cast` <Co:3> :: ...)
                     of
                     { (# ipv2_i4uB, ipv3_i4uC #) ->
                     case unsafeFreezeArray# (ipv1_i4uk `cast` <Co:5> :: ...) ipv2_i4uB
                     of
                     { (# ipv4_i4uF, ipv5_i4uG #) ->
                     ipv5_i4uG
                     }
                     }
                     });
              0# -> emptyArray# (##)
            } } in
      case s_s5GS of {
        Bin bx_a5xP ds1_a5xQ ds2_a5xR ds3_a5xS ->
          jump $w$j_s5GQ bx_a5xP bx_a5xP;
        Tip -> jump $w$j_s5GQ 0# 0#
      }

We have to repeatedly allocate a closure at the first argument to the recursively defined go5_s5Cf as we walk through the set, so then GC ends up getting to scan the array.

Merging.

@andrewthad andrewthad merged commit 75c5365 into haskell:master Nov 19, 2024
13 checks passed
@meooow25
Copy link
Contributor Author

Yes, that's the source of the allocations. Thanks for merging!

@meooow25 meooow25 deleted the fromListN-fusion branch November 19, 2024 21:12
Copy link
Contributor

@lehins lehins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was a bit too late with my comments. I'll submit my review anyways, maybe it will be useful, but feel free to ignore it.

-- We want arrayFromListN to be a "good consumer" in list fusion, so we define
-- the function using foldr and inline it to help fire fusion rules.
-- If fusion occurs with a "good producer", it may reduce to a fold on some
-- structure. In certain cases (such as for Data.Set) GHC is not be able to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is best to move this sentence ("In certain cases (such as for Data.Set) GHC is not be able to ...") to a comment into the body of the function, since that is not relevant information for the user.

Suggested change
-- structure. In certain cases (such as for Data.Set) GHC is not be able to
-- structure. In certain cases (such as for Data.Set) GHC is not able to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which "user" do you mean? This comment block is not part of the Haddocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, you are right. I did not notice an empty new line on line number 589 between the haddock and the comment.

then return ()
else die "fromListN" "list length less than specified size"
go !ix (x : xs) = if ix < n
f x k = GHC.Exts.oneShot $ \ix# -> if I# ix# < n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewthad You didn't need to drop support for ghc-8.0 in #426, since oneShot is available form ghc-prim for a very long time, so it could have been imported from GHC.Magic for all GHC versions, instead of relying on CPP. Especially since the test suite already depends on ghc-prim. Seems like a very minor thing to drop support for a whole ghc version, but I personally not gonna cry over it 🥲

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I would've needed to add the dependency on ghc-prim back to primitive, and it was so nice to finally have it gone. Dropping GHC 8.0 has the additional advantage of letting me delete several shims in #427. I'll probably end up dropping support for all GHCs earlier than 8.6 soon.

@meooow25
Copy link
Contributor Author

Here's something which GHC optimizes well and it works with zero heap-allocations (other than the array).

arrayFromFoldableN :: Foldable f => Int -> f a -> Array a
arrayFromFoldableN n xs = createArray n undefined $ \arr -> do
  j <- ($ 0) $ runS $ flip foldMap xs $ \x -> S $ \i ->
    if i < n
    then do
      writeArray arr i x
      pure $! i+1
    else die "fromListN" "list length greater than specified size"
  if j == n
  then pure ()
  else die "fromListN" "list length less than specified size"
{-# INLINE arrayFromFoldableN #-}

-- Might want to use oneShot, but works well enough with Set
newtype S s = S { runS :: Int -> ST s Int }

instance Semigroup (S s) where
  (<>) = coerce ((>=>) @(ST _) @Int @Int @Int)

instance Monoid (S s) where
  mempty = S pure

----------

arrayFromSetFoldable :: S.Set a -> Array a
arrayFromSetFoldable s = arrayFromFoldableN (S.size s) s
    arrayFromSetFoldable:          OK
      403  μs ±  38 μs, 1.5 MB allocated,  50 B  copied,  38 MB peak memory

The catch is that we're using foldMap, which makes it a different function and a bigger ask. I'm not convinced that this is worth adding to the API, but I still wanted to share it.

This function is also fusion-friendly, because []'s foldMap uses foldr, so that's nice.

@andrewthad
Copy link
Collaborator

This is pretty clever. I believe that it should be possible to accomplish the same thing with foldr instead of foldMap. In Data.Foldable, there is a foldlM defined with foldr:

foldlM :: (Foldable t, Monad m) => (b -> a -> m b) -> b -> t a -> m b
foldlM f z0 xs = foldr c return xs z0
  -- See Note [List fusion and continuations in 'c']
  where c x k z = f z x >>= k
        {-# INLINE c #-}

Setting b to Int (and incrementing in the callback) gives us a monadic fold that exposes the index. If this approach works (I'm not 100% sure it does), it has the advantage of always being defined with foldr, so it's always fusion friendly.

I would take another PR with a arrayFromFoldableN. Building arrays from other foldable data structures isn't extremely common, but I've needed something like this more than once, and I've always rolled it by hand. I think that this library is a reasonable place for it. I think the most serious performance gains would show up for Array and SmallArray. If you are building a PrimArray from a Set, you've already boxed all the elements when you built the set, which leaves a lot of performance on the table. Anyway, just rambling now. Thanks for working on these and thinking about all this!

@meooow25
Copy link
Contributor Author

Here's something which GHC optimizes well and it works with zero heap-allocations (other than the array).

A correction: it allocates Ints. GHC only gets it down to Int# -> Set a -> State# RealWorld -> (# State# RealWorld, Int #). Note the last boxed Int. Oh well, we could drop down to unboxed Int# if we wanted to.

Setting b to Int (and incrementing in the callback) gives us a monadic fold that exposes the index. If this approach works (I'm not 100% sure it does), it has the advantage of always being defined with foldr, so it's always fusion friendly.

Yeah this does not work as well. It is just what we have after this PR, and it requires closures being allocated as you saw above.

  foldr fromListN_f fromListN_z (Set.toList s)                  -- fromListN_f fromListN_z added in this PR
= foldr fromListN_f fromListN_z (build (\c n -> Set.foldr c n s))
= Set.foldr fromListN_f fromListN_z s                           -- after build/fold
= Set.foldr (\x k i -> ...writeArray... k (i+1)) (\i -> ...pure...) s 0

Which is the foldlM definition.

Being able to use Set's foldMap is the key.

go Tip = mempty
go (Bin _ x l r) = go l <> f x <> go r

becomes

 $sgo3_s926 [Occ=LoopBreaker, Dmd=SC(S,C(1,C(1,!P(L,L!P(L)))))]
   :: Int# -> Set a_s8Tg -> State# RealWorld -> (# State# RealWorld, Int #)
 [LclId[StrictWorker([~, !])], Arity=3, Str=<L><1L><L>, Unf=OtherCon []]
 $sgo3_s926
   = \ (sc2_s923 :: Int#) (sc3_s922 :: Set a_s8Tg) (eta1_X5 [OS=OneShot] :: State# RealWorld) ->
       case sc3_s922 of {
         Bin bx2_X7 k1_X8 ds10_X9 ds11_Xa ->
           case bx2_X7 of {
             __DEFAULT ->
               case $sgo3_s926 sc2_s923 ds10_X9 eta1_X5 of { (# ipv4_Xd, ipv5_Xe #) ->
               case ipv5_Xe of { I# x1_Xg ->
               case <# x1_Xg bx1_a81f of {
                 __DEFAULT -> case lvl67_r98e of {};
                 1# ->
                   case writeArray#
                          @Lifted
                          @(PrimState (ST RealWorld))
                          @a_s8Tg
                          ipv1_a8L4
                          x1_Xg
                          k1_X8
                          (ipv4_Xd `cast` <Co:4> :: ...)
                   of s'#1_Xi
                   { __DEFAULT ->
                   $sgo3_s926 (+# x1_Xg 1#) ds11_Xa (s'#1_Xi `cast` <Co:3> :: ...)
                   }
               }
               }
               };
             1# -> <special leaf case, not relevant>
               }
           };
         Tip -> (# eta1_X5, I# sc2_s923 #)
       }; } in

And no closures necessary! (unlike with foldr)

I would take another PR with a arrayFromFoldableN. Building arrays from other foldable data structures isn't extremely common, but I've needed something like this more than once, and I've always rolled it by hand. I think that this library is a reasonable place for it.

In that case I can look into it. Would you keep both fromListN and fromFoldableN? If so, how do we explain the reason for this easily? Another option is to have

fromFoldMapN :: (forall m. Monoid m => (a -> m) -> f -> m) -> Int -> f -> Array a

Which is perhaps a little intimidating but clearer in intent. It can also be used with monomorphic collections.

I think the most serious performance gains would show up for Array and SmallArray. If you are building a PrimArray from a Set, you've already boxed all the elements when you built the set, which leaves a lot of performance on the table.

That's true, but there are unboxed tree-like structures (IntSet comes to mind, there must be others) that could be converted to PrimArray faster this way. Though I don't know for certain, haven't tried it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fusion-friendly *fromListN
4 participants