-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cockroachdb crashed in Go runtime during test run: s.allocCount != s.nelems
#1146
Comments
s.allocCount != s.nelems
s.allocCount != s.nelems
s.allocCount != s.nelems
s.allocCount != s.nelems
(part 1/5) The error described above is the most common of several different failure modes that I've suspected (hoped?) were the same root cause. I think I've finally got this one nailed. My write-up is pretty long and I've split it across five comments:
Background on this problemRecall that Omicron's test suite has 160+ tests that spin up transient instances of CockroachDB. In some fraction of these attempts, CockroachDB exits unexpectedly with one of several errors from the Go runtime:
We've seen a number of issues that seemed like they could also be related (mentioned below). These three are by far the most common that I've seen, and in that order (i.e., the Sometimes these failures happen after the database has already opened up its listening port. When we're lucky, it happens during the Most of the testing has been with CockroachDB 22.1.9 using Go 1.17.13. We've also seen these failures from the Go test suite from 1.16 up through at least 1.19.2. I've also seen them with CockroachDB v22.2.0 (recently released) using Go 1.19. I fumbled around for a while doing what I'd call heuristic debugging: looking for patterns and commonalities and guessing at possible causes. What really broke this open was taking a specific failure, diving into the code (the Go runtime memory subsystem), adding instrumentation, poring over data, and iterating. I chose this failure in particular since it was the most common. Background on the Go memory subsystemThis is of course a gross simplification, but here are some highlights:
That's a little unintuitive. Let's walk through an example.
Importantly: "allocBits" is not updated after an allocation! It's only updated during GC, and then it's set directly to "gcmarkBits". This works because we never look at "allocBits" for items with index less than "freeIndex". |
(part 2/5) Digging into this specific failureThe message reported is:
That's coming from this Go code. What's happening here is that we're trying to do an allocation from this mspan, and we found that it's full ( So, how many things were allocated from this mspan? From the initial failures, we couldn't tell. When this problem occurs, Go dumps out the message shown above and the Goroutine stack traces, but we get no core file and no information about the history of this mspan. The Go runtime provides the GODEBUG=allocfreetrace=1 environment variable setting to trace allocations. Unfortunately I found it way too expensive to use here. The script also traces frees, which was a little tricky. There's not normally a "free" operation per se. The result is this D script. I ran this with After a bunch of iteration and loops to try to reproduce, I managed to catch this failure mode while Go was instrumented with DTrace and got a core file from it:
Okay! What does the mspan look like? First, we have to find it. The stack trace that Go printed includes this frame:
Now, we can print that mspan:
That looks plausible -- it's got the right spanclass (from the stack trace), allocCount and nelems (from the error message). It's got the right freeIndex, too. I looked through the DTrace output for this failure, looking for sweeps of this mspan:
It looks like an mspan with this address has been swept twice, but both times it was a different mspan (different range, element size, etc.). It's never been swept in its current state. Just to be sure, I checked for any errors or drops from DTrace:
and I believe the DTrace output file is complete (not truncated) because it contains all the stderr writes of Go dumping its stack traces (which happens at the end). So, the assertion is complaining that we've got a span with no free items but allocCount is too low. So what is allocated? There are two ways to look at it. First, I enumerated the addresses covered by the mspan, and for each one, checked whether there's an allocation and/or free for that address in the DTrace output. The easiest way to do this was to tell mdb about a "my_buffer" type with the same size as the elements in this mspan and have it enumerate the addresses in an array of
Then I searched for each one in the trace output:
The very first address has some false positives. We have a 8112-byte allocation that returned c000850000 -- I infer that this is the allocation for the memory that became the mspan we're inspecting. Then we swept fffffc7fee310c70, which appears to be that single-element 8192-byte mspan. Then we swept an unrelated span that just happened to end at c000850000. I think we can ignore all of those -- this is essentially saying that c000850000 was never allocated from the mspan we're interested in. Then notice that we didn't allocate a bunch of other addresses (e.g., c000850090, c000850120), but we did allocate some later ones. This seems weird. We never freed any of the addresses and, again, we don't seem to have ever swept this mspan. I summarized it like this:
I also confirmed by hand that they addresses were allocated in increasing order of addresses:
I decided to take a look at allocBits for this span. I'd expected these bits to be all zeroes because, again, it seems like this span has never been swept, and it looks to me like these are only ever set during sweep. But what I found is that the allocBits exactly match what the DTrace output shows about which of these are allocated -- except for one thing:
Now, I expected bits 56-63 to be 0, but they shouldn't matter anyway. The rest of these bits align exactly with the unallocated items. This is surprising to me on two levels: if this mspan has never been swept, I'd expect these to be all zeroes. If for some reason it has been swept and these accurately reflect what's allocated, they appear to be inverted, right? I also checked
So that's pretty self-consistent (though I'm not sure why it took 55 shifts and not 56). There's a lot that's confusing here:
I found one more compelling piece of information, which is that the addresses for allocBits and gcmarkBits are adjacent:
That's significant because under normal operation, these two bitfields are allocated from different arenas, which come from different mappings. The only time they'd be adjacent is after an mspan is initialized but before the first time it's swept. Between the DTrace output showing no sweep, the DTrace output showing all allocations in address order, and this data point, I think we can be confident that this mspan has never been swept. So it seems as though we allocated from this span sparsely, which is strange, given the way we said the allocator works. Why would that be? Maybe our tracing somehow missed the allocations? I wonder if these supposedly unallocated things were referenced anywhere? Here's one way to grep the dump the whole address space to a text file:
Now, indeed, some of those things that were never allocated are not referenced anywhere:
By contrast, this one that was allocated is referenced elsewhere:
This is all consistent with the idea that these just were skipped over. But why? And what's the deal with allocBits being so close to consistent with the allocated addresses, but backwards? The solution may be obvious to the reader, but wasn't to me. I'd been assuming that the mspan was correctly initialized and some behavior caused us to skip some elements and I'd misunderstood the code that manages allocBits. While discussing with @bnaecker, we realized that if allocBits were set to this pattern when it was initialized, then everything else makes sense, including both the pattern of allocations and the allocCount being 27, even though we're apparently out of free slots. In other words, these bits are inverted from what's actually allocated because they caused the allocations to look like that. To play this out, suppose the initial allocBits for some reason were 01010101... Then we'd expect to allocate address 0, then address 2, then address 4, etc. By the end, the addresses allocated would be exactly the 0 bits. So at this point, it seems likely that one of these things happened:
If they were clobbered, there might be other other references to the allocBits address in the core file. Are there?
Nope -- that's the reference within the mspan itself. What about the adjacent word?
Nope. The first hit there is the reference within the mspan, and the second is line describing what's at the address we grep'd for. (There's no corresponding line in the previous example because the way we dumped this, only every other word-aligned address is labeled.) What about the bit pattern? It doesn't look like ASCII or any other obvious value. Does it exist elsewhere in the core file?
My jaw dropped. There's a ton of this pattern! 1,920 elements to be exact:
That's pretty surprising for a bit pattern that supposedly represents the set of objects allocated from one particular 8KiB page. Looking closer, it's really a 32-byte pattern:
So that's almost 64KiB (32 bytes times 1920 rows) -- 61,440 bytes, to be more precise. So...who's referencing the start of that range?
So it's referred-to by fffffc7fee311540. But I don't know what that is either:
but clearly this bit pattern is present in a lot of places and it seems Let's take another look at where this memory comes from. They're allocated via I see that this function seems to work by copying data from %xmm15 all over the buffer it's given. This sounds suspiciously like a bug that @mknyszek had mentioned was found on Linux some time ago. If these registers were somehow non-zero, that could totally explain a bunch of stuff having the same bit pattern. I had previously found that setting At this point I wondered if we're somehow not preserving this register? @rmustacc and I explored this. %xmm isn't a problem, but the code also uses %ymm0, and that turns out to be a problem. Robert helped me write a C program to test this behavior. We confirmed the bug and filed illumos#15254. So we've confirmed that this is a problem. And it can explain all the data here: if, during span allocation, while allocating allocBits, That sounds like a lot of "if"s, but then again, this problem isn't very reproducible to begin with. Still, can we convince ourselves that's what did happen here? |
(part 3/5) Trying to prove this problem is the same problemRobert had the idea to trace the point where the kernel delivers a signal and examine %rip. This could tell us whether it's at least possible that we're interrupting this function. I used this D script:
There are two histograms back-to-back there: the first is a count of actual instructions interrupted, while the second is a count by function name. We interrupted I thought I'd try the experiment of using DTrace to instrument an instruction inside
which corresponds to these instructions, as viewed in mdb:
I initially tried to catch it partway through this loop after some number of iterations, then I just tried to get it on entry to the loop:
and eventually I wound up instrumenting the instruction immediately after we cleared %ymm0:
which is a few instructions earlier than the above:
I was hoping this would reliably reproduce the problem...but it doesn't. I did manage to get some hits (failures with exactly this error message) -- probably 2-3 times out of a hundred runs. That's a lot more than it usually happens, but it wasn't the satisfying dead-reproducibility I'd hoped for. I relaxed the script even further:
That triggered quite a few of this new failure mode:
which, based on the message alone, looks pretty expected for a problem where supposedly zeroed memory is not zeroed. While doing this, I also saw an instance of the "marked free object in span" error. This too is expected -- see my separate comment about other failure modes below. Between seeing that a typical run of Still, it'd be nice to convince ourselves that there's no other problem here. On Robert's suggestion, I tried disabling the kernel's support for avx2 by utting I thought about building a new Go runtime with a modified
which is +0x7e here:
If I just nop out the six bytes at
Now it looks like this:
With this in place, if there are no other issues here, then the test suite should run indefinitely. Recall that on my test machine, running this chunk of the Omicron test suite in a loop reliably fails within 3-4 minutes. I confirmed this with the usual binary just to be sure, then switched to my modified binary. It ran for 98m before dying on #1936. I reran it overnight and it ran for 736m (12h+) and then died again on the same issue. So I'm feeling pretty good about this! (And also eager to fix #1936 and see how long the test suite can run in a loop.) |
(part 4/5) Explaining other failure modesAbove, I dissected a particular instance of this failure mode (
|
(part 5/5) Explaining other data pointsWhile debugging this problem, I took a lot of wrong turns. But this resulted in some data that's worth reviewing in light of this explanation. We had trouble reproducing this problem on Linux. Well, that checks out. (We now think this is an illumos bug.) The problem reproduces on Go 1.19. Go 1.19 appears to still use %ymm0 in memclrNoHeapPointers. We don't seem to see this problem in CI. Helios CI runs on AWS hosts, which tend to run on older instance types, which may well not support avx2. (thanks @jclulow) We had trouble reproducing this problem on Intel systems. I don't have much data about this except what's reported in golang/go#53289, which does not say much about the systems. It's quite possible they don't support avx2. Thanks!It feels cheesy to say, but debugging this one has been quite a saga and I had a lot of help over the last several months from talking this over with so many people. Thanks everyone who helped! |
The signal memory corruption reminds me of golang/go#35777. Great write up and very informative! |
To help look for other instances of this, I wrote a Rust program to scan through a memory dump (created as described above) looking for runs consistent with illumos issue 15254: https://github.com/oxidecomputer/illumos-15254-look-for-runs. On this dump, it emits:
That 1920-count run is exactly what we'd hope it would find, and there are few false positives. |
Robert discovered that in order to trigger the problem where %ymm0 is corrupted, there needs to be another context switch (e.g., going off-CPU in the signal handler). This may explain why it wasn't so readily reproducible by just raising a signal during the vulnerable instruction window. It may also explain why anecdotally it seemed like |
Since my previous comments included quite a lot of detail, here's a more concise, organized summary. For more details, see above. SymptomsThe most common failure modes for this issue are fatal errors from the Go runtime reporting:
The underlying problem can in principle cause all kinds of memory corruption, particularly in Go-allocated / Go-managed memory. As with other kinds of memory corruption, the first report of trouble may be far away from where the problem happened and a long time later. So all sorts of failure modes are possible, including SIGSEGV, other fatal runtime errors, and non-fatal errors. Diagnosing a failure as this issueAlthough this bug could be responsible for almost any unexpected program behavior, it's not clear how common the issue is, so it's not necessarily a safe bet that any baffling behavior could be attributed to this issue. If you're running a Go program, you probably want to err on the side of applying the workaround below and seeing if the problem goes away. If you have a core dump from a specific failure and want to determine if it's this issue, you can look for large regions of memory that you'd expected to be zeros that instead contain a repeating 32-byte pattern of which the low 16 bytes are zeroes but the high 16 bytes are not all zeros. I wrote a very simple tool that accepts a memory dump in the form output above and looks for this pattern. If you find this, particularly if a pointer in this region that led to the fatal failure (e.g., a SIGSEGV, or the allocBits/gcmarkBits of whatever mspan triggered the above fatal errors), that's a good sign that this problem is to blame. There are other uses of %ymm registers in the runtime, too, and it's possible there are other failures that could be caused by this without that signature. I don't have any shortcuts for diagnosing them. Root causeThe underlying problem relates to two issues. The first is illumos bug 15254, which causes corruption of the %ymm (AVX) CPU registers in some pretty specific conditions. You have to have:
If you have all of these things, then the regular code can find the high 128 bits of a %ymm register clobbered by bits stored in that register by the signal handler. This manifests in Go because memclr() stores 0 into the 256-bit register, then These registers are also used in the analog of memcpy ( A second issue (that likely affects fewer functions in the Go runtime, but does affect WorkaroundAn easy workaround that appears pretty safe is to set Note that this does not guarantee that these corruption problems can't happen. Both issues an still happen if the Go program takes a signal at the wrong time. But the flag makes it much less likely by stopping Go from sending itself multiple signals per second. FixThe proper fix will involve an illumos update (which includes the kernel and userland libraries). This work is in progress as of this writing. |
Should we force Otherwise, it seems very likely we'll keep hitting this issue, especially since the underlying illumos bug has not been resolved yet. |
This is a good question. When we believe a fix is forthcoming and the interim cost isn't too painful, I usually don't commit a workaround because it's one more thing we have to remember to remove before we ship. If we forget, we might wind up with different problems (like Goroutine starvation) or leave a time bomb for ourselves (if for some reason the fix is not correct but we don't notice because of the workaround). In this case, I think the fix is expected on the order of weeks from now. So for me the problem isn't painful enough to warrant committing the workaround. I don't feel strongly about it. If we do decide to commit the workaround, I think we should put that into the SMF manifest, yes, and into test-utils/dev/db rather than .envrc. As far as I know, .envrc is only used when individual developers set up direnv for their interactive shells. I think that's just for convenience -- we can't really assume it's being used. |
More to the problemWith a fix in hand for illumos#15254, @rmustacc and I set out to confirm that it would address this issue. To our (deep) dismay, it did not! Though we didn't do any sort of statistical comparison, the problem appeared as readily reproducible as before. The problem we found was a problem, and very near the scene of the crime, but there was something we were still missing. After several more rounds of tracing and reproduction, Robert identified a separate bug: illumos#15367. While very much in the same area, the bug predates the work on 15254. One big plus: the behavior of the system after hitting #15367 does differ in a meaningful way between the AMD and Intel CPUs that Robert tested, which means this problem explains the data point that we were only able to reproduce this on AMD systems. (The reason has to do with how the AMD and Intel CPUs decide whether to save the XMM region of the FPU during To summarize #15367 as I understand it, if any thread:
then after the signal handler returns, the thread will find the xmm registers contain the last-saved non-zero values of these registers. This complex set of conditions explains why this wasn't readily reproducible when I tried to just raise a signal at a time I thought would result in corruption. We have not yet confirmed that the fix for #15367 makes this problem go away. That's one of the next steps. |
We remain hopeful that #15367 was a big part of the underlying cause here:
It'd be nice to explain why my previous workaround (sending memclrNoHeapPointers down the XMM path instead of the YMM one) worked. Note that the %ymm path uses VZEROUPPER, which zeroes the %ymm and %zmm ranges of all the AVX registers. If %xmm1 through %xmm15 were zero before we entered this |
Closing this because this appears resolved by the above fixes. We can reopen if we start seeing it again. |
Great write-up! I just listened to the podcast episode that references this issue. If I've understood correctly, shouldn't allocCount = 3 in step 6 of this comment? I.e. allocCount is incremented after an allocation, even though allocBits is not. |
Thanks! Yes I think that's right. I've updated the comment. Thanks for the correction. |
There's a lot of detail in this report. For a summary of this problem, the root cause, and a workaround, see this comment below.
Again trying to reproduce #1130, I found a different issue that caused CockroachDB to exit early:
Test log file:
The CockroachDB output:
This appears to be dying inside the Go runtime memory allocator. It looks like golang/go#45775, which unfortunately was closed last year as not-reproducible.
The text was updated successfully, but these errors were encountered: