-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[regression] 0.3.29 test failure on ARMV6 #5104
Comments
It seems to be a threading issue:
But the test can pass in serial mode:
Update: seems to be pthread related. OpenMP is fine:
|
Curious that you would get an error in tbmv (only). Did you change compilers between 0.3.28 and 29 perchance ? |
I did not change the compiler. It's gcc-14 |
I see - still seems weird that it is STBMV of all functions. I won't be able to reproduce this or eventually locate the commit that caused this until late next week. (Nothing immediately obvious that wouldn't also mess up all of GEMM) |
I'm increasingly certain that none of the files/functions relevant for multithreaded TBMV were changed between 0.3.28 and 0.3.29... could be this is an older bug, a missing memory barrier somewhere or something like that, that has only a low probability to happen. Is the armhf machine an actual hardware or something like qemu ? |
The Recently I have seen some regression bugs in gcc-14 which leads to internal compiler error when compiling pytorch (downgrading to gcc-13 solves the issue). Here, with the speculation that it might be something wrong from the toolchain side, I tested with gcc-13 but the same issue persists.
|
Docker container running a 32bit OS on the armv8 hardware ? |
It's I'm trying to do a bisect. Will update later. |
I'm on a train with variable network quality, but from checking file dates and commit logs anything remotely relevant would have happened before 0.3.28. |
My local git bisect result is
By looking at d9f368d, I agree with you that the problem is introduced somewhere in the past. |
Might be unrelated, but: OpenBLAS got recently updated to 0.3.29 in FreeBSD ports as well, and i'm experiencing a similar threading/timing test failure building it locally (via FBSD ports infrastructure), but it's
The compiler is |
@xmirya probably unrelated given the difference in architectures, but what is your hardware please (the optimized BLAS kernels are different for individual cpu models) ? (SIGBUS instead of simply producing bad results could mean an access to unaligned data where the instruction used requires data alignment) |
It's i5-2410M, it has SSE* and AVX, but no AVX2 or higher, the build is done with
added to |
@xmirya that would appear to be a standard Sandy Bridge target - I cannot reproduce the failure on Ryzen5 but can take an actual SandyBridge system out of storage sometime next week |
Thanks, would be grateful if you could |
openblas
0.3.29
fails to build on Debian's armhf machine: https://buildd.debian.org/status/fetch.php?pkg=openblas&arch=armhf&ver=0.3.29%2Bds-1&stamp=1738175608&raw=0It seems to be a regression in the
cblas_stbmv
function:Debian uses a modified source tree. But I can reproduce this problem with the upstream git repo:
The error message reads:
This is a regression because
0.3.28
does not fail the test. I checked the diff between0.3.28
and0.3.29
but did not find anything straightforward. Any idea?The text was updated successfully, but these errors were encountered: