Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openblas_nolapack hangs when loaded using JavaCPP in a Java+SBT+Play Project on Win11 #5109

Open
jxtps opened this issue Feb 5, 2025 · 1 comment

Comments

@jxtps
Copy link

jxtps commented Feb 5, 2025

This is the bug report from hell. It occurs only when the moons (openblas_nolapack, JavaCPP) and the stars (Win11 + Java + SBT + Play Framework) align.

I originally filed this in JavaCPP bytedeco/javacpp-presets#1203 where the workaround was "don't use openblas". Ok. That worked while MKL was available, but apparently that's now being dropped by JavaCPP ( bytedeco/javacpp-presets#1575 (comment) ), so I'm rapidly heading up that creek and have no idea where I left the paddle.

I can readily reproduce the issue on my local Win11 machine (and there's a small repro project in the JavaCPP issue). Great. It hangs on:

Debug: Loading C:\Users\admin\.javacpp\cache\openblas-0.3.28-1.5.12-20250124.032029-44-windows-x86_64.jar\org\bytedeco\openblas\windows-x86_64\libopenblas_nolapack.dll

I have even attached WinDbg to the process, identified the relevant thread, and taken a stack dump:

0:152> k
 # Child-SP          RetAddr               Call Site
00 00000054`a8ffa5d8 00007ffa`358c66a1     ntdll!NtFsControlFile+0x14
01 00000054`a8ffa5e0 00007ffa`357624ba     KERNELBASE!PeekNamedPipe+0xf1
02 00000054`a8ffa6b0 00007ffa`357044a0     ucrtbase!common_stat_handle_file_opened<_stat64>+0x15e
03 00000054`a8ffa760 00007ffa`357618a8     ucrtbase!<lambda_3e61fc1153d2eec3991e8733eecb5419>::operator()+0x58
04 00000054`a8ffa7d0 00007ffa`35761b58     ucrtbase!__crt_seh_guarded_call<int>::operator()<<lambda_d6a03b27cb314eb65d447ab85fffcbf2>,<lambda_3e61fc1153d2eec3991e8733eecb5419> &,<lambda_8d9723598c44aced2bc47669cc68e4e1> >+0x44
05 00000054`a8ffa800 00007ff9`8a480143     ucrtbase!common_fstat<_stat64>+0xc0
06 00000054`a8ffa880 00007ff9`8a3ee4ae     libgfortran_5!gfortrani_xrealloc+0xb273
07 00000054`a8ffa920 00007ff9`8a59034e     libgfortran_5!gfortrani_init_units+0x5e
08 00000054`a8ffa960 00007ff9`8a2dc7f2     libgfortran_5!ynf+0x6e
09 00000054`a8ffa990 00007ff9`8a2d12dd     libgfortran_5!backtrace_vector_release+0x202
0a 00000054`a8ffa9d0 00007ffa`38418b8f     libgfortran_5+0x12dd
0b 00000054`a8ffaa20 00007ffa`3845d63d     ntdll!LdrpCallInitRoutine+0x6b
0c 00000054`a8ffaa90 00007ffa`3845d3ee     ntdll!LdrpInitializeNode+0x1c9
0d 00000054`a8ffabe0 00007ffa`3845d460     ntdll!LdrpInitializeGraphRecurse+0x42
0e 00000054`a8ffac20 00007ffa`3841db1d     ntdll!LdrpInitializeGraphRecurse+0xb4
0f 00000054`a8ffac60 00007ffa`38418e30     ntdll!LdrpPrepareModuleForExecution+0xc5
10 00000054`a8ffaca0 00007ffa`384090cc     ntdll!LdrpLoadDllInternal+0x20c
11 00000054`a8ffad40 00007ffa`3841a74a     ntdll!LdrpLoadDll+0xb0
12 00000054`a8ffaf00 00007ffa`3587b732     ntdll!LdrLoadDll+0xfa
13 00000054`a8ffaff0 00007ffa`358777d1     KERNELBASE!LoadLibraryExW+0x172
14 00000054`a8ffb060 00007ffa`358d20ef     KERNELBASE!LoadLibraryExA+0x31
15 00000054`a8ffb0a0 00007ff9`8f01225a     KERNELBASE!LoadLibraryA+0x3f
16 00000054`a8ffb0d0 00007ff9`8edcecc0     jvm!c2v_getFlagValue+0x214cca
17 00000054`a8ffb130 00007ffa`20a23fa5     jvm!JVM_LoadLibrary+0xd0
18 00000054`a8ffb9a0 000001bd`368602d6     java!Java_jdk_internal_loader_NativeLibraries_load+0x89
19 00000054`a8ffbb30 000001cd`c28fcc41     0x000001bd`368602d6
1a 00000054`a8ffbb38 000001cd`c2066de0     0x000001cd`c28fcc41
1b 00000054`a8ffbb40 000001cd`c220d6a0     0x000001cd`c2066de0
1c 00000054`a8ffbb48 000001ce`1dcae438     0x000001cd`c220d6a0
1d 00000054`a8ffbb50 000001bd`00000000     0x000001ce`1dcae438
1e 00000054`a8ffbb58 000001cd`00000001     0x000001bd`00000000
1f 00000054`a8ffbb60 00000000`00000000     0x000001cd`00000001

o3-micro-high has this to say about that:

The stack trace shows that the hang occurs during the native library’s initialization—specifically, within the Fortran runtime code (libgfortran) that OpenBLAS uses. Key points include:

Where the hang occurs:
The thread is stuck in a Windows system call (NtFsControlFile via PeekNamedPipe) that’s invoked during a file-status check (common_stat_handle_file_opened). This call is part of the Fortran runtime’s unit initialization (e.g. in gfortrani_init_units).

Implication:
The Fortran runtime (libgfortran_5) appears to be performing some file/pipe I/O (likely to initialize Fortran I/O units) that hangs when it calls into the Windows API. This suggests that something in the initialization code is either incompatible with Windows 11 or is misbehaving in the context of your Java process.

Where the library is loaded:
The hang happens during the JVM’s library loading process (via JVM_LoadLibrary), so it’s not your Java code per se but the native library’s (OpenBLAS’s) initialization that is problematic.

In summary, the dump indicates that the native library (openblas_nolapack), via its Fortran runtime initialization, is hanging on a Windows file-system call. This points to a potential issue with the library’s (or Fortran runtime’s) initialization code on Windows 11, such as waiting indefinitely for a file/pipe status that never returns. You might need to check for known compatibility issues with OpenBLAS on Windows 11 or consider using a different build/configuration of the library.

Any suggestions? Thanks!

Versions: it (still) happens with openblas-0.3.28.

@martin-frbg
Copy link
Collaborator

Not sure what to make of that - gfortran is "only" needed for LAPACK, and I'm not aware that it would be trying to open any files on initialization (except perhaps pseudo-files for reading environment variables or other system parameters). If you can build OpenBLAS yourself in this context, you could try building with NOFORTRAN=1 (which would result in an older version of LAPACK - but one translated to C - being used). Or if you do not expect to call any LAPACK function at all, only using BLAS, you could even compile with NO_LAPACK=1.
What is known to have caused problems in the past is the small stacksize that Java provides by default (or used to provide in the past), so perhaps the trace you got is misleading and you are seeing some kind of heap-stack-collision as OpenBLAS tries to allocate memory buffers ? In that case you could try starting your java environment with a larger -M (if that still makes sense today)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants