Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package deadlock the whole Go process, 90% of the time, under recent macOS #3

Open
folays opened this issue Sep 14, 2023 · 3 comments

Comments

@folays
Copy link

folays commented Sep 14, 2023

FYI, This package nearly doesn't work since some recent versions of MacOS (at least Intel ones).

The symptom is : 90% of the runs, the progression of the Go programs stops, yet use 100% CPU. Like a deadlock, or livelock more so.

It probably depends on the "activity" of the Go program, but as an example, in my case, on a 20 seconds run, I have ~90 % of failure. On ~10% of my runs, it gets to completion, and when it does complete, I get the deep C symbols in my profiles.

See golang/go#45558 (comment)

[...] deadlock is happening because the program is trying to fetch a stack trace while calling the C function calloc. That can happen if a signal is received during the call to calloc. That will cause a deadlock if the cgosymbolizer code itsef calls calloc or malloc. The github.com/benesch/cgosymbolizer code uses macOS libunwind, and at least some versions of macOS libunwind call malloc. So it's not surprising that a deadlock can occur.

Also golang/go#45558 (comment)

On macOS, unwind.h is an Apple/LLVM-provided implementation that was introduced with Big Sur according to this comment in unwind.h [...]

The above issue also suggests :

You might be able to fix the problem by changing the file cgosymbolizer_darwin.c, the function cgo_traceback, to simply return if arg->sig_ctx != 0.

Which indeed fix the symptoms, but the traces (at least in my case) won't have C symbols, as if I didn't use this package.

I don't have any solution, I just figured out it should be mentioned somewhere.

This package is still useful, but sadly "luck" is involved, have to run it multiple times and avoid the "hang" symptom...

FWIW I'm on MacOS Ventura 13.4.1 (Intel CPU).

@benesch
Copy link
Owner

benesch commented Sep 16, 2023 via email

@folays
Copy link
Author

folays commented Sep 18, 2023

I just come back to trying multiple hours to find a solution, but I failed.

Hope you don't mind me layout here what I've read / tried ;
I'm ok to talk here "alone" and not expect an answer from you, since you're no longer using Go as much.
It's for future readers and hope someone other will find a solution.

https://maskray.me/blog/2020-11-08-stack-unwinding

  • Learnt me that
  • Learnt me that there is multiple implem oflibunwind :
    • libgcc (for gcc compiler)
    • libunwind (for llvm/clang)
    • glibc's internal
    • maybe HP
  • I've tried tons of combinations involving -fno-omit-frame-pointer / -f(no)asynchronous-unwind-tables / -fexceptions
    • Also tried --rtlib=compiler-rt --unwindlib=libunwind
  • I've tried #define UNW_LOCAL_ONLY before your #include <linunwind.h>
  • For ELF (mach uses Mach-O), a .eh_frame section is used.
  • Sometimes the section is `.debug_frame'
  • .eh_frame needs a linear scan to convert PC to the entry of the debug symbols
    • .eh_frame_hdr provides a binary search index table

/!\ I've read somewhere that .eh_frame_hdr could prevent malloc() from libunwind, because .eh_frame_hdr enable optimisations (binary search tree) that would let libunwind not needing to allocate anymore.

Of course clang (a least MacOS's default one) does not support --eh-frame-hdr :(

I think I understood that it's the linker which do all the work of generating those debug sections. I'm not even sure.

https://clang.llvm.org/docs/UsersManual.html

-femit-dwarf-unwind=
When to emit DWARF unwind (EH frame) info. This is a Mach-O-specific option

  • I've also tried -femit-dwarf-unwind=always

I'm not even sure IF llvm/libunwind is == to https://www.nongnu.org/libunwind/

https://reviews.llvm.org/D11897

  • I 've tried #define _LIBUNWIND_NO_HEAP before including libunwind.h

All compiler flags I told above, I've tried to put them as much as possible in CGO_CFLAGS and the rest in CGO_LDFLAGS. With a go build -v -x which made me see that everything was recompiled each time.

https://faultlore.com/blah/compact-unwinding/

https://lists.nongnu.org/archive/html/libunwind-devel/2016-02/msg00027.html

The only known work around is to implement your own dl_iterate_phdr that doesn't call malloc by hooking into low level APIs that notify you every time a new shared object is loaded.

The attached patch adds the function 'unw_set_iterate_phdr_function' to the libunwind API, allowing a custom implementation of dl_iterate_phdr() to be hooked in. That implementation would need to maintain a local cache (or however else you wanted to manage it) to remove the need for taking the glibc loader lock.

  • plus 2 people which for this dl_iterate_phdr cached the results (using the patch)

https://www.mail-archive.com/[email protected]/msg02305.html

We recently ran into an issue on where a signal handler was calling unw_step() and deadlocked because the thread it interrupted was inside of dl_iterate_phdr (holding its lock) and unw_step ended up calling dl_iterate_phdr too, causing a deadlock. We found a few threads mentioning this in the past: [...]

I confirm that "returning early" in cgo_traceback(), just after either unw_getcontext() and unw_init_local() and the first 3 unw_step() do not expose the bug. Some more unw_step() triggers it, so I guess that the diagnosis above is correct :

  • unw_step() probably loads dl_iterate_phdr() which mallocs.
  • I also observe that sigCtx is non-zero in the cases of the bug apparition, since a if (arg->six_ctx != 0) return; makes the bug disappear.

As said in the link above :

It's not resolved as far as I know. The only known work around is to implement your own dl_iterate_phdr that doesn't call malloc by hooking into low level APIs that notify you every time a new shared object is loaded.
The main trade-off is upfront/fixed memory cost vs pay-for-what-you-use (current glibc approach). It's been a while, so I may not be remembering all the details right.

So I think that the libunwind choose to not pay an upfront cost, and will at run-time call dl_iterate_phdr only when needed.

Which makes me realise that maaaaybe at Go init() time, call a C function which would look over all dynamic librairies linked to the program, and call some unw_*() functions to force an upfront call to dl_iterate_phdr()... ?

Other random ideas / things to try

  • try to link with something other than the default clang on macOS. Maybe upstream clang, maybe gcc's ld.
  • try to do the equivalent of LD_PRELOAD'ing malloc() to detect a reentry and forces a return NULL (fail allocation), I think I've read libunwind would not mind
  • try to find the address of the malloc()'s mutex, and make cgo_traceback trylock it, and early return if it's locked
  • try to find if at Go init() time, we can call a C function which would call some unw_*() function with a fake PC pointing to each of those C libraries, to force an upfront call to dl_iterate_phdr() ?

My computer : MacOS Ventura 13.4.1, Intel CPU, clang 14.0.3.

@folays
Copy link
Author

folays commented Sep 18, 2023

(on macOS Intel, latest Ventura, Xcode/clang are the latest official from Apple)

Sorry, still laying out here some random attempts and discussion here.

I'm still showing at the bottom how to recompile and link with a locally git fetched uptream llvm's libunwind there, with sources then showing in the lldb debugger.

attempt to hook and divert reentrant calls to malloc() to a mmap(NULL)

I've worked on https://github.com/folays/cgosymbolizer_macos/tree/test which could be used on a local build with a :

git clone https://github.com/folays/cgosymbolizer_macos.git
cd cgosymbolizer_macos/
git checkout test

and a go work use /path/to/cgosymbolizer_macos/

The intention was to hook *alloc() calls at RUNTIME (WHICH WORK) without LD_PRELOAD equivalent.
I've used a (roughly speaking) static array of spinlocks[65536] :

  • trylock()
    • if trylock() failed, print some message to stdout about reentrancy + normal_lock()
  • calls system *alloc() (and return it at then end of the hook)
  • unlock()

The intention was that if an reentrant *alloc() was ongoing, and if it was "rare" (only for a few unlucky calls to malloc() from deep dl_iterate_phdr() , to return some allocations from a mmap(NULL) ...

lldb (macOS equiv. to gdb) shown me some SEGFAULT in libunwind at address 0x0 when unwinding deep from LuaJIT calls.
I guess there is also maybe with LuaJIT not correctly generating unwinding data for JITed code.

(I'm sure I get my calls to LuaJIT right tho)

attempt to recompile libunwind to have debug symbols of it, when using lldb

cd /tmp/ # please do it there, some path below reuses it
git clone https://github.com/llvm/llvm-project.git
cd llvm-project/

cmake -G Ninja -S runtimes -B build -DLLVM_ENABLE_RUNTIMES="libunwind" -DCMAKE_BUILD_TYPE="Debug" # Configure
ninja -C build unwind # Build
ninja -C build check-unwind # Test

Build the Go program with this libunwind instead of clang's apple's distributed one :

# probably some flags are not needed here. I'm beginning to be tired.
#  it mainly uses -rpath and -L and -l to try to force /tmp/llvm-project/build/lib/libunwind
# it compiles ./cmd/your-go-package-main/ to /tmp/program.bin
CGO_CFLAGS="-g -fno-omit-frame-pointer" CGO_LDFLAGS="-unwindlib=libunwind -Wl,-rpath,/tmp/llvm-project/build/lib/ -L/tmp/llvm-project/build/lib/ -lunwind" go build -gcflags=all='-N -l' -v -x -o /tmp/program.bin ./cmd/your-go-package-main/

# now run /tmp/program.bin ; Personally I run it like this :
GODEBUG=asyncpreemptoff=1,cgocheck=0 /tmp/program.bin

I get a little more output with this locally rebuilt libunwind. I've got on stdout (or stderr, didn't checked) some :

libunwind: malformed __unwind_info at 0x1019FFE68 bad second level page

  • The program gets execution farther than with default libunwind (it still hangs after some time, but works longer, and sometimes I get lucky and it completes before hanging)
    • I get the full C stack in the profiles output when it completes
  • I have roughly 25 of those messages

Also, when it hangs (which still happens regularly, but less often), lldb nows show me the source code :

# note : not always thread #1.
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)

# top of the stack
    frame #0: 0x000000010214f460 libunwind.1.0.dylib`libunwind::LocalAddressSpace::get32(this=0x000000010215c9b9, addr=0) at AddressSpace.hpp:175:5
   156 	/// LocalAddressSpace is used as a template parameter to UnwindCursor when
   157 	/// unwinding a thread in the same process.  The wrappers compile away,
   158 	/// making local unwinds fast.
   159 	class _LIBUNWIND_HIDDEN LocalAddressSpace {
   160 	public:
   [...]
   173 	  uint32_t         get32(pint_t addr) { <<< in this function of the class
   174 	    uint32_t val;
-> 175 	    memcpy(&val, (void *)addr, sizeof(val));
   176 	    return val;
   177 	  }

# called from :
frame #1: 0x000000010214a908 libunwind.1.0.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::decodeFDE(addressSpace=0x000000010215c9b9, fdeStart=0, fdeInfo=0x000000c000010d68, cieInfo=0x000000c000010d30, useCIEInfo=false) at DwarfParser.hpp:177:43
   168 	/// Parse a FDE into a CIE_Info and an FDE_Info. If useCIEInfo is
   169 	/// true, treat cieInfo as already-parsed CIE_Info (whose start offset
   170 	/// must match the one specified by the FDE) rather than parsing the
   171 	/// one indicated within the FDE.
   172 	template <typename A>
   173 	const char *CFI_Parser<A>::decodeFDE(A &addressSpace, pint_t fdeStart,
   174 	                                     FDE_Info *fdeInfo, CIE_Info *cieInfo,
   175 	                                     bool useCIEInfo) {
   176 	  pint_t p = fdeStart;
-> 177 	  pint_t cfiLength = (pint_t)addressSpace.get32(p);
   178 	  p += 4;
   179 	  if (cfiLength == 0xffffffff) {
   180 	    // 0xffffffff means length is really next 8 bytes
   181 	    cfiLength = (pint_t)addressSpace.get64(p);
   182 	    p += 8;
   183 	  }
   [...]

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x000000010214f460 libunwind.1.0.dylib`libunwind::LocalAddressSpace::get32(this=0x000000010215c9b9, addr=0) at AddressSpace.hpp:175:5
    frame #1: 0x000000010214a908 libunwind.1.0.dylib`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::decodeFDE(addressSpace=0x000000010215c9b9, fdeStart=0, fdeInfo=0x000000c000010d68, cieInfo=0x000000c000010d30, useCIEInfo=false) at DwarfParser.hpp:177:43
    frame #2: 0x000000010214cc32 libunwind.1.0.dylib`libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::stepWithDwarf(addressSpace=0x000000010215c9b9, pc=140703308999942, fdeStart=0, registers=0x000000c0000114a8, isSignalFrame=0x000000c000011599, stage2=false) at DwarfInstructions.hpp:196:7
    frame #3: 0x000000010214cb0d libunwind.1.0.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::stepWithDwarfFDE(this=0x000000c000011498, stage2=false) at UnwindCursor.hpp:1020:12
    frame #4: 0x000000010214ca1c libunwind.1.0.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::stepWithCompactEncoding(this=0x000000c000011498, stage2=false) at UnwindCursor.hpp:1032:14
    frame #5: 0x000000010214bb46 libunwind.1.0.dylib`libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step(this=0x000000c000011498, stage2=false) at UnwindCursor.hpp:2903:20
    frame #6: 0x000000010214a29c libunwind.1.0.dylib`::__unw_step(cursor=0x000000c000011498) at libunwind.cpp:182:14
    frame #7: 0x0000000100e78fda loop.bin`cgo_traceback + 394
    frame #8: 0x0000000100e79f57 loop.bin`x_cgo_callers + 71
    frame #9: 0x00007ff80ade75ed libsystem_platform.dylib`_sigtramp + 29
    frame #10: 0x00007ff80ac06d06 libsystem_malloc.dylib`_malloc_zone_malloc + 1
    frame #11: 0x0000000100f8861b loop.bin`sqlite3MemMalloc + 27
    frame #12: 0x0000000100ebefe6 loop.bin`mallocWithAlarm + 230
    frame #13: 0x0000000100e9884a loop.bin`sqlite3Malloc + 90
    frame #14: 0x0000000100e987d4 loop.bin`sqlite3_malloc + 68
    frame #15: 0x0000000100e9402a loop.bin`decNumberToSQLite3Blob + 26
    frame #16: 0x0000000100e95f17 loop.bin`decimalAdd + 263
    frame #17: 0x0000000100e90631 loop.bin`decimalAddFunc + 113
    frame #18: 0x0000000100eec958 loop.bin`sqlite3VdbeExec + 41368
    frame #19: 0x0000000100ea443c loop.bin`sqlite3Step + 588
    frame #20: 0x0000000100e9f387 loop.bin`sqlite3_step + 103
    frame #21: 0x0000000100e7b63d loop.bin`_lfsq_sqlite3_step + 29
    frame #22: 0x0000000100e7af15 loop.bin`l_lfsq_stmt_execute + 133
    frame #23: 0x00000001021ff3d6 libluajit-5.1.2.1.1694940244.dylib`___lldb_unnamed_symbol343 + 68
    frame #24: 0x000000010221926a libluajit-5.1.2.1.1694940244.dylib`lua_pcall + 154
    frame #25: 0x0000000100e7c697 loop.bin`ylua_pcall + 39
    frame #26: 0x0000000100e7cbde loop.bin`_cgo_7e8f3114fa7f_Cfunc_ylua_pcall + 62
    frame #27: 0x000000010007ed08 loop.bin`runtime.asmcgocall.abi0 + 104
    frame #28: 0x000000010000d77f loop.bin`runtime.cgocall + 127
    frame #29: 0x0000000100b3390b loop.bin`github.com/folays/luajit-go._Cfunc_ylua_pcall.abi0 + 75
    frame #30: 0x0000000100b3b8b8 loop.bin`github.com/folays/luajit-go.(*State).Y_lua_pcall + 88
    frame #31: 0x0000000100b3d679 loop.bin`github.com/folays/luajit-go.(*State)._run + 569
    frame #32: 0x0000000100b3d1f2 loop.bin`github.com/folays/luajit-go.(*State)._runFuncWithResults + 242
    frame #33: 0x0000000100b3d065 loop.bin`github.com/folays/luajit-go.(*State).RunFuncWithResults + 165
[... my Go program stack...]
    frame #41: 0x000000010029eeed loop.bin`github.com/spf13/cobra.(*Command).execute + 3341
    frame #42: 0x000000010029ff4f loop.bin`github.com/spf13/cobra.(*Command).ExecuteC + 2511
    frame #43: 0x000000010029f3f2 loop.bin`github.com/spf13/cobra.(*Command).Execute + 50
    frame #44: 0x0000000100e78dc6 loop.bin`main.main + 38
    frame #45: 0x000000010004a107 loop.bin`runtime.main + 615
    frame #46: 0x000000010007f061 loop.bin`runtime.goexit.abi0 + 1

So... two mixed problems there I think :

  1. lots of Go projects (which do not use LuaJIT) stopped to use https://github.com/benesch/cgosymbolizer/ due to some deadlock issues (or more so CPU hang 100%)
  2. I may have an additional program with LuaJIT which, even if I were to find a solution to (1), would still exhibit the program due to using LuaJIT with JIT'ed code in the stack

Maybe the rebuilt llvm which outputs some libunwind: malformed __unwind_info at 0x.... bad second level page could be of interest (skip continuing on the stack ?).

I also don't know why the CPU goes to 100% whereas the symptoms seems to be libunwind catching a SEGFAULT due to doing a memcpy(NULL, ...) + segfault. I guess it don't die for some reasons, whilst other threads are maybe in a retry loop.

Note that every time that I had the problem, lldb shown me a EXC_BAD_ACCESS (code=1, address=0x0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants