-
Notifications
You must be signed in to change notification settings - Fork 13.3k
significant mul_add perf regression since nightly-2025-03-06 #140452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you clarify what flags this was built with if any? The idea was that LLVM should usually be able to lower the intrinsics directly to asm if known to be available. If it just happens to be missing a case that it should know about, then we can add it with asm in libm for now. But if this is one of the cases where glibc is using ifuncs to select features at runtime, things get a bit less clear. Here is the full list of functions that will likely be using our version now, for reference https://github.com/rust-lang/compiler-builtins/blob/fdbefb39d5bb0b95b29b821247044c8aaf436160/compiler-builtins/src/math/mod.rs#L21 |
It's using ifunc to select an implementation: https://github.com/bminor/glibc/blob/77930e0447e0b37a129db0e13c6c6f5e60a3019e/sysdeps/x86_64/fpu/multiarch/s_fma.c#L49 |
@Amanieu is there any reasonable way to do this in our libm? We could do runtime feature detection via cpuid, I’m not sure of any easy way to emulate ifuncs. |
Double checking, with +fma the single instruction is used https://rust.godbolt.org/z/YzTfT5Poa |
what one could do is make subsequent calls will use the updated pointer value which can go directly to the |
no specific target feature flags. i don't want to rely on target-cpu=native |
Get performance closer to the glibc implementations by adding assembly fma routines, with runtime feature detection so they are used even if not compiled with `+fma` (as the distributed standard library is often not). Glibc uses ifuncs, this implementation stores a function pointer in an atomic. Results of CPU flags are also cached in order to avoid repeating the startup time in calls to different functions. The feature detection code is a slightly simplified version of `std-detect`. Fixes: rust-lang/rust#140452 once synced
Get performance closer to the glibc implementations by adding assembly fma routines, with runtime feature detection so they are used even if not compiled with `+fma` (as the distributed standard library is often not). Glibc uses ifuncs, this implementation stores a function pointer in an atomic. Results of CPU flags are also cached in order to avoid repeating the startup time in calls to different functions. The feature detection code is a slightly simplified version of `std-detect`. Fixes: rust-lang/rust#140452 once synced
Get performance closer to the glibc implementations by adding assembly fma routines, with runtime feature detection so they are used even if not compiled with `+fma` (as the distributed standard library is often not). Glibc uses ifuncs, this implementation stores a function pointer in an atomic. Results of CPU flags are also cached in order to avoid repeating the startup time in calls to different functions. The feature detection code is a slightly simplified version of `std-detect`. Musl sources were used as a reference [1]. Fixes: rust-lang/rust#140452 once synced [1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
Get performance closer to the glibc implementations by adding assembly fma routines, with runtime feature detection so they are used even if not compiled with `+fma` (as the distributed standard library is often not). Glibc uses ifuncs, this implementation stores a function pointer in an atomic. Results of CPU flags are also cached in order to avoid repeating the startup time in calls to different functions. The feature detection code is a slightly simplified version of `std-detect`. Musl sources were used as a reference [1]. Fixes: rust-lang/rust#140452 once synced [1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
Get performance closer to the glibc implementations by adding assembly fma routines, with runtime feature detection so they are used even if not compiled with `+fma` (as the distributed standard library is often not). Glibc uses ifuncs, this implementation stores a function pointer in an atomic. Results of CPU flags are also cached in order to avoid repeating the startup time in calls to different functions. The feature detection code is a slightly simplified version of `std-detect`. Musl sources were used as a reference [1]. Fixes: rust-lang/rust#140452 once synced [1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
Get performance closer to the glibc implementations by adding assembly fma routines, with runtime feature detection so they are used even if not compiled with `+fma` (as the distributed standard library is often not). Glibc uses ifuncs, this implementation stores a function pointer in an atomic. Results of CPU flags are also cached in order to avoid repeating the startup time in calls to different functions. The feature detection code is a slightly simplified version of `std-detect`. Musl sources were used as a reference [1]. Fixes: rust-lang/rust#140452 once synced [1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
Candidate patch so hopefully we can avoid reverting anything rust-lang/compiler-builtins#896 |
Get performance closer to the glibc implementations by adding assembly fma routines, with runtime feature detection so they are used even if not compiled with `+fma` (as the distributed standard library is often not). Glibc uses ifuncs, this implementation stores a function pointer in an atomic. Results of CPU flags are also cached in order to avoid repeating the startup time in calls to different functions. The feature detection code is a slightly simplified version of `std-detect`. Musl sources were used as a reference [1]. Fixes: rust-lang/rust#140452 once synced [1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
Get performance closer to the glibc implementations by adding assembly fma routines, with runtime feature detection so they are used even if not compiled with `+fma` (as the distributed standard library is often not). Glibc uses ifuncs, this implementation stores a function pointer in an atomic. Results of CPU flags are also cached in order to avoid repeating the startup time in calls to different functions. The feature detection code is a slightly simplified version of `std-detect`. Musl sources were used as a reference [1]. Fixes: rust-lang/rust#140452 once synced [1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
Get performance closer to the glibc implementations by adding assembly fma routines, with runtime feature detection so they are used even if not compiled with `+fma` (as the distributed standard library is often not). Glibc uses ifuncs, this implementation stores a function pointer in an atomic. Results of CPU flags are also cached in order to avoid repeating the startup time in calls to different functions. The feature detection code is a slightly simplified version of `std-detect`. Musl sources were used as a reference [1]. Fixes: rust-lang/rust#140452 once synced [1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
Get performance closer to the glibc implementations by adding assembly fma routines, with runtime feature detection so they are used even if not compiled with `+fma` (as the distributed standard library is often not). Glibc uses ifuncs, this implementation stores a function pointer in an atomic. Results of CPU flags are also cached in order to avoid repeating the startup time in calls to different functions. The feature detection code is a slightly simplified version of `std-detect`. Musl sources were used as a reference [1]. Fixes: rust-lang/rust#140452 once synced [1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
Get performance closer to the glibc implementations by adding assembly fma routines, with runtime feature detection so they are used even if not compiled with `+fma` (as the distributed standard library is often not). Glibc uses ifuncs, this implementation stores a function pointer in an atomic. Results of CPU flags are also cached in order to avoid repeating the startup time in calls to different functions. The feature detection code is a slightly simplified version of `std-detect`. Musl sources were used as a reference [1]. Fixes: rust-lang/rust#140452 once synced [1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
Get performance closer to the glibc implementations by adding assembly fma routines, with runtime feature detection so they are used even if not compiled with `+fma` (as the distributed standard library is often not). Glibc uses ifuncs, this implementation stores a function pointer in an atomic. Results of CPU flags are also cached in order to avoid repeating the startup time in calls to different functions. The feature detection code is a slightly simplified version of `std-detect`. Musl sources were used as a reference [1]. Fixes: rust-lang/rust#140452 once synced [1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
Get performance closer to the glibc implementations by adding assembly fma routines, with runtime feature detection so they are used even if not compiled with `+fma` (as the distributed standard library is often not). Glibc uses ifuncs, this implementation stores a function pointer in an atomic. Results of CPU flags are also cached in order to avoid repeating the startup time in calls to different functions. The feature detection code is a slightly simplified version of `std-detect`. Musl sources were used as a reference [1]. Fixes: rust-lang/rust#140452 once synced [1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
Includes the following changes: * Use runtime feature detection for fma routines on x86 [1] Fixes: rust-lang#140452 [1]: rust-lang/compiler-builtins#896
Builtins version sync #140648 |
Update `compiler-builtins` to 0.1.157 Includes the following changes: * Use runtime feature detection for fma routines on x86 [1] Fixes: rust-lang#140452 [1]: rust-lang/compiler-builtins#896
rust uses compiler_builtins fma instead of the one from libc since nightly 2025-03-06 (2025-03-05 is fine)
my suspicion is linkage changes somewhere but i haven't been following the changes in compiler_builtins
code
i ran this code under
perf
on linux x86_64 (perf record -g cargo run
)the code loops forever so i interrupt it manually after a couple seconds
on 2025-03-05, the profiling (
perf report -g
) shows that most of the time is spent in__fma_fma3
. on 2025-03-06 it's now spending most of the time incompiler_builtins::math::libm::fma::fma
, which is a lot slower since it emulatesfma
with the limited sse instructions instead ofvfmadd231sd
version it worked on
nightly-2025-03-05
version with regression
nightly-2025-03-06
@rustbot modify labels: +regression-from-stable-to-nightly -regression-untriaged
The text was updated successfully, but these errors were encountered: