Index out of range in `ghc-events show` and dependencies of `ghc-events` #109

MangoIV · 2024-08-13T14:06:58Z

Hi! I have had a couple of problems with eventlog2html and hs-speedscope recently and they seem to be a problem either with the library or the eventlog that the ghc RTS emits. I can more or less (sometimes it doesn't happen) reliably reproduce this error with the following program:

module Main where

import Data.ByteString (ByteString)
import qualified Data.ByteString as BS
import Data.Foldable (for_)
import Data.Monoid (Endo (..))

largeEndo :: Endo ByteString
largeEndo = Endo \x -> BS.replicate 100 97 <> x

manyEndos :: Endo ByteString -> Endo ByteString
manyEndos = mconcat . replicate 10_000

main :: IO ()
main = for_ [1 .. 30] \_ ->
  BS.putStr $ appEndo (manyEndos largeEndo) mempty

compile with
ghc -rtsopts -prof -fprof-late -O0 ./bla.hs
run with
./bla +RTS -hc -p -l-au

I have not tried to further reduce the example.

This happens on ghc 9.6, 9.8 and 9.10 at least, according to @TeofilC on ghc-events HEAD, and then at least on ghc-events 0.19.0.1.

It affects not only ghc-events but also hs-speedscope and eventlog2html.

related issue on eventlog2html issue tracker:
mpickering/eventlog2html#136

The text was updated successfully, but these errors were encountered:

MangoIV · 2024-08-13T14:34:58Z

I think we're failing in GHC.RTS.Events.Binary.getEvent :: EventParsers -> Get (Maybe Event) where we try to index into the parsers and instead of getting an EventTypeNum within range, we get a really big one.

TeofilC · 2024-08-13T15:22:25Z

Running ghc-events with an eventlog from the reproducer triggers an assertion failure at this line:

ghc-events/src/GHC/RTS/Events/Binary.hs

Lines 758 to 765 in 2168f61

    
           assert 
        
             ((fromIntegral payloadLen :: Int) == sum 
        
               [ 4 
        
               , 8 -- ticks 
        
               , 1 -- stack depth 
        
               , fromIntegral profStackDepth * 4 
        
               ]) 
        
             (return ())

It seems like the payloadLen is wrong for certain EVENT_PROF_SAMPLE_COST_CENTRE events

MangoIV · 2024-08-13T15:28:23Z

jup, the last four parses before it blows up look like this:

evSpec: HeapProfSampleCostCentre {heapProfId = 0, heapProfResidency = 912, heapProfStackDepth = 1, heapProfStack = [78]}, etRef: 163
evSpec: ProfSampleCostCentre {profCapset = 369098752, profTicks = 4333765376, profStackDepth = 0, profCcsStack = []}, etRef: 167
evSpec: HeapProfSampleString {heapProfId = 0, heapProfResidency = 2029359963648, heapProfLabel = "\SO"}, etRef: 164
evSpec: CreateThread {thread = 22784}, etRef: 0

It looks like the parse of ProfSampleCostCentre introducest the corruption.

MangoIV · 2024-08-13T15:34:10Z

@TeofilC is it possible that this is the Ccs stack being too deep for the profStackDepth Word8? And then it overflows and we don't read the Vector to end?

MangoIV · 2024-08-13T15:43:30Z

That doesn't seem to be it, you're right, the payloadLen before that is already completely off.

TeofilC · 2024-08-13T15:51:11Z

Yes on the GHC side we truncate to 255, which should be fine

MangoIV · 2024-08-13T16:19:20Z

So afaiu this means that either the parsing of the header on the ghc-events side or the writing of the header on ghc RTS side goes wrong. (At least the previous events look fine, so I'm not assuming that they're already introducing the corruption)

TeofilC · 2024-08-13T17:07:31Z

Strangely enough I can't seem to reproduce this with -threaded. I get the impression that an EVENT_PROF_SAMPLE_COST_CENTRE and an EVENT_HEAP_PROF_SAMPLE_COST_CENTRE event being written at the same time is the cause of this. Yet, I'm not sure how that would be possible with the non-threaded runtime!

MangoIV · 2024-08-13T17:15:10Z

Also don’t they acquire the global eventBuf lock before writing?

TeofilC · 2024-08-13T17:15:46Z

I also can't reproduce if I only do heap profiling or time profiling with the non-threaded RTS.

So it seems like we need to be doing both with the non-threaded RTS.

I think it's highly likely that somehow we try to write both a heap sample and a time sample at the same time to the eventlog

TeofilC · 2024-08-13T17:16:02Z

Also don’t they acquire the global eventBuf lock before writing?

Ah but that only exists for the threaded RTS

MangoIV · 2024-08-13T17:18:13Z

I think it's highly likely that somehow we try to write both a heap sample and a time sample at the same time to the eventlog

and the -threaded safe guards against that because it does proper locking while the non-threaded RTS doesn’t but is still somehow concurrent? That’s weird

TeofilC · 2024-08-13T17:25:27Z

What seems to be happening is:

we are running a heap profile
we write half of a heap sample event
the time profile timer triggers
we pause what we are doing, and write a time sample event in the middle of our heap event
we unpause and finish writing our heap event

So we end up with something garbled.

This story is backed up by putting a bunch of traces inside the eventlog printing functions in the RTS. This is the order of events they suggest

MangoIV · 2024-08-13T17:32:05Z

So there must be a context switch somewhere in dumpCensus, even with the non-threaded RTS?

MangoIV · 2024-08-13T18:05:54Z

Is it possible that this is happening because the time profile is running asynchronously? (see initTimer -> initTicker -> createAttachedOSThread with handle_tick -> handleProfTick -> traceProfSampleCostCentre) maybe there should be a lock for writing to the eventlog?

MangoIV · 2024-08-13T18:13:27Z

so maybe it would work if we'd just keep the ACQUIRE_LOCK stuff in existence, even in the non-threaded runtime? 👀

MangoIV · 2024-08-14T12:49:41Z

@TeofilC I don't think that you observation is generally right - I was going to try if I can completely circumvent the issue by using -threaded or using only one of the two traces but I cannot - with a more elaborate example, which is probably a bit too bulky to share, I still really often (~30% of the time) get this problem.

mpickering · 2024-08-14T13:45:00Z

Is there perhaps some similar issue where the initialisation events (cost centre definitions) are being posted to the output, and that is interupted by a heap profile event before all of the definitions are dumped.

Does it happen if you are not using a profiled executable? (ie, don't compile with -prof and using -hT -l)

mpickering · 2024-08-14T13:52:11Z

Another thing to try is a longer profiling interval (-i10), does that fix the issue?

TeofilC · 2024-08-14T16:47:56Z

Interesting @MangoIV . It sounds like there's potentially multiple bugs. In your larger example, maybe you could try to find the last few events before the eventlog gets corrupted. That might help suggest which event is going wrong

TeofilC · 2024-08-15T11:01:22Z

I've written up the bug we found here: https://gitlab.haskell.org/ghc/ghc/-/issues/25165
We should keep looking for the other issues, but I wanted to make sure we didn't forget about this one

MangoIV · 2024-08-20T08:27:56Z

@mpickering

i10 doesn't fix the issue, this made it so that I got one of the eventlogs through, so I guess it decreases the probability for failure but most still fail
another thing to note is that I'm killing the application in question with keyboard signal, so that may be something to consider, as well
profiling-detail is late
it seems to happen at least a lot less often, if not not at all with -hT -l

mpickering · 2024-08-20T09:08:41Z

This seems to confirm what @TeofilC suggested about time profiling events interrupting the writing of other events and leading to corruption

MangoIV · 2024-08-20T09:13:29Z

it also happens when only using -hy, no time profiling. Is time profiling still doing things when we only request heap profiling with -l?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index out of range in `ghc-events show` and dependencies of `ghc-events` #109

Index out of range in `ghc-events show` and dependencies of `ghc-events` #109

MangoIV commented Aug 13, 2024 •

edited

Loading

MangoIV commented Aug 13, 2024

TeofilC commented Aug 13, 2024 •

edited

Loading

MangoIV commented Aug 13, 2024 •

edited

Loading

MangoIV commented Aug 13, 2024 •

edited

Loading

MangoIV commented Aug 13, 2024

TeofilC commented Aug 13, 2024

MangoIV commented Aug 13, 2024

TeofilC commented Aug 13, 2024

MangoIV commented Aug 13, 2024

TeofilC commented Aug 13, 2024

TeofilC commented Aug 13, 2024

MangoIV commented Aug 13, 2024

TeofilC commented Aug 13, 2024 •

edited

Loading

MangoIV commented Aug 13, 2024 •

edited

Loading

MangoIV commented Aug 13, 2024 •

edited

Loading

MangoIV commented Aug 13, 2024 •

edited

Loading

MangoIV commented Aug 14, 2024

mpickering commented Aug 14, 2024 •

edited

Loading

mpickering commented Aug 14, 2024

TeofilC commented Aug 14, 2024

TeofilC commented Aug 15, 2024

MangoIV commented Aug 20, 2024

mpickering commented Aug 20, 2024

MangoIV commented Aug 20, 2024

Index out of range in ghc-events show and dependencies of ghc-events #109

Index out of range in ghc-events show and dependencies of ghc-events #109

Comments

MangoIV commented Aug 13, 2024 • edited Loading

MangoIV commented Aug 13, 2024

TeofilC commented Aug 13, 2024 • edited Loading

MangoIV commented Aug 13, 2024 • edited Loading

MangoIV commented Aug 13, 2024 • edited Loading

MangoIV commented Aug 13, 2024

TeofilC commented Aug 13, 2024

MangoIV commented Aug 13, 2024

TeofilC commented Aug 13, 2024

MangoIV commented Aug 13, 2024

TeofilC commented Aug 13, 2024

TeofilC commented Aug 13, 2024

MangoIV commented Aug 13, 2024

TeofilC commented Aug 13, 2024 • edited Loading

MangoIV commented Aug 13, 2024 • edited Loading

MangoIV commented Aug 13, 2024 • edited Loading

MangoIV commented Aug 13, 2024 • edited Loading

MangoIV commented Aug 14, 2024

mpickering commented Aug 14, 2024 • edited Loading

mpickering commented Aug 14, 2024

TeofilC commented Aug 14, 2024

TeofilC commented Aug 15, 2024

MangoIV commented Aug 20, 2024

mpickering commented Aug 20, 2024

MangoIV commented Aug 20, 2024

Index out of range in `ghc-events show` and dependencies of `ghc-events` #109

Index out of range in `ghc-events show` and dependencies of `ghc-events` #109

MangoIV commented Aug 13, 2024 •

edited

Loading

TeofilC commented Aug 13, 2024 •

edited

Loading

MangoIV commented Aug 13, 2024 •

edited

Loading

MangoIV commented Aug 13, 2024 •

edited

Loading

TeofilC commented Aug 13, 2024 •

edited

Loading

MangoIV commented Aug 13, 2024 •

edited

Loading

MangoIV commented Aug 13, 2024 •

edited

Loading

MangoIV commented Aug 13, 2024 •

edited

Loading

mpickering commented Aug 14, 2024 •

edited

Loading