Skip to content

significant mul_add perf regression since nightly-2025-03-06 #140452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sarah-quinones opened this issue Apr 29, 2025 · 9 comments · May be fixed by #140648
Open

significant mul_add perf regression since nightly-2025-03-06 #140452

sarah-quinones opened this issue Apr 29, 2025 · 9 comments · May be fixed by #140648
Labels
C-bug Category: This is a bug. P-high High priority regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@sarah-quinones
Copy link

sarah-quinones commented Apr 29, 2025

rust uses compiler_builtins fma instead of the one from libc since nightly 2025-03-06 (2025-03-05 is fine)

my suspicion is linkage changes somewhere but i haven't been following the changes in compiler_builtins

code

i ran this code under perf on linux x86_64 (perf record -g cargo run)

the code loops forever so i interrupt it manually after a couple seconds

use core::hint::black_box as bb;

fn main() {
    loop {
        bb(f64::mul_add(1.0, 1.0, bb(1.0)));
    }
}

on 2025-03-05, the profiling (perf report -g) shows that most of the time is spent in __fma_fma3. on 2025-03-06 it's now spending most of the time in compiler_builtins::math::libm::fma::fma, which is a lot slower since it emulates fma with the limited sse instructions instead of vfmadd231sd

version it worked on

nightly-2025-03-05

version with regression

nightly-2025-03-06

@rustbot modify labels: +regression-from-stable-to-nightly -regression-untriaged

@sarah-quinones sarah-quinones added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Apr 29, 2025
@rustbot rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. and removed regression-untriaged Untriaged performance or correctness regression. labels Apr 29, 2025
@nikic
Copy link
Contributor

nikic commented Apr 29, 2025

@tgross35

@jieyouxu jieyouxu added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Apr 29, 2025
@tgross35
Copy link
Contributor

Could you clarify what flags this was built with if any?

The idea was that LLVM should usually be able to lower the intrinsics directly to asm if known to be available. If it just happens to be missing a case that it should know about, then we can add it with asm in libm for now. But if this is one of the cases where glibc is using ifuncs to select features at runtime, things get a bit less clear.

Here is the full list of functions that will likely be using our version now, for reference https://github.com/rust-lang/compiler-builtins/blob/fdbefb39d5bb0b95b29b821247044c8aaf436160/compiler-builtins/src/math/mod.rs#L21

@Amanieu
Copy link
Member

Amanieu commented Apr 29, 2025

@tgross35
Copy link
Contributor

@Amanieu is there any reasonable way to do this in our libm? We could do runtime feature detection via cpuid, I’m not sure of any easy way to emulate ifuncs.

@tgross35
Copy link
Contributor

Double checking, with +fma the single instruction is used https://rust.godbolt.org/z/YzTfT5Poa

@sarah-quinones
Copy link
Author

what one could do is make fma a static AtomicPtr that gets initialized with a function that updates the pointer, then returns the result

subsequent calls will use the updated pointer value which can go directly to the vfmadd path. i don't know how this compares to ifuncs in terms of perf but it should be better than the status quo

@sarah-quinones
Copy link
Author

Could you clarify what flags this was built with if any?

no specific target feature flags. i don't want to rely on target-cpu=native

tgross35 added a commit to tgross35/compiler-builtins that referenced this issue Apr 30, 2025
Get performance closer to the glibc implementations by adding assembly
fma routines, with runtime feature detection so they are used even if
not compiled with `+fma` (as the distributed standard library is often
not). Glibc uses ifuncs, this implementation stores a function pointer
in an atomic.

Results of CPU flags are also cached in order to avoid repeating the
startup time in calls to different functions. The feature detection code
is a slightly simplified version of `std-detect`.

Fixes: rust-lang/rust#140452 once synced
tgross35 added a commit to tgross35/compiler-builtins that referenced this issue Apr 30, 2025
Get performance closer to the glibc implementations by adding assembly
fma routines, with runtime feature detection so they are used even if
not compiled with `+fma` (as the distributed standard library is often
not). Glibc uses ifuncs, this implementation stores a function pointer
in an atomic.

Results of CPU flags are also cached in order to avoid repeating the
startup time in calls to different functions. The feature detection code
is a slightly simplified version of `std-detect`.

Fixes: rust-lang/rust#140452 once synced
tgross35 added a commit to tgross35/compiler-builtins that referenced this issue Apr 30, 2025
Get performance closer to the glibc implementations by adding assembly
fma routines, with runtime feature detection so they are used even if
not compiled with `+fma` (as the distributed standard library is often
not). Glibc uses ifuncs, this implementation stores a function pointer
in an atomic.

Results of CPU flags are also cached in order to avoid repeating the
startup time in calls to different functions. The feature detection code
is a slightly simplified version of `std-detect`.

Musl sources were used as a reference [1].

Fixes: rust-lang/rust#140452 once synced

[1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
tgross35 added a commit to tgross35/compiler-builtins that referenced this issue Apr 30, 2025
Get performance closer to the glibc implementations by adding assembly
fma routines, with runtime feature detection so they are used even if
not compiled with `+fma` (as the distributed standard library is often
not). Glibc uses ifuncs, this implementation stores a function pointer
in an atomic.

Results of CPU flags are also cached in order to avoid repeating the
startup time in calls to different functions. The feature detection code
is a slightly simplified version of `std-detect`.

Musl sources were used as a reference [1].

Fixes: rust-lang/rust#140452 once synced

[1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
tgross35 added a commit to tgross35/compiler-builtins that referenced this issue Apr 30, 2025
Get performance closer to the glibc implementations by adding assembly
fma routines, with runtime feature detection so they are used even if
not compiled with `+fma` (as the distributed standard library is often
not). Glibc uses ifuncs, this implementation stores a function pointer
in an atomic.

Results of CPU flags are also cached in order to avoid repeating the
startup time in calls to different functions. The feature detection code
is a slightly simplified version of `std-detect`.

Musl sources were used as a reference [1].

Fixes: rust-lang/rust#140452 once synced

[1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
tgross35 added a commit to tgross35/compiler-builtins that referenced this issue Apr 30, 2025
Get performance closer to the glibc implementations by adding assembly
fma routines, with runtime feature detection so they are used even if
not compiled with `+fma` (as the distributed standard library is often
not). Glibc uses ifuncs, this implementation stores a function pointer
in an atomic.

Results of CPU flags are also cached in order to avoid repeating the
startup time in calls to different functions. The feature detection code
is a slightly simplified version of `std-detect`.

Musl sources were used as a reference [1].

Fixes: rust-lang/rust#140452 once synced

[1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
@tgross35
Copy link
Contributor

Candidate patch so hopefully we can avoid reverting anything rust-lang/compiler-builtins#896

@jieyouxu jieyouxu removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Apr 30, 2025
@Amanieu Amanieu added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Apr 30, 2025
tgross35 added a commit to tgross35/compiler-builtins that referenced this issue May 1, 2025
Get performance closer to the glibc implementations by adding assembly
fma routines, with runtime feature detection so they are used even if
not compiled with `+fma` (as the distributed standard library is often
not). Glibc uses ifuncs, this implementation stores a function pointer
in an atomic.

Results of CPU flags are also cached in order to avoid repeating the
startup time in calls to different functions. The feature detection code
is a slightly simplified version of `std-detect`.

Musl sources were used as a reference [1].

Fixes: rust-lang/rust#140452 once synced

[1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
tgross35 added a commit to tgross35/compiler-builtins that referenced this issue May 1, 2025
Get performance closer to the glibc implementations by adding assembly
fma routines, with runtime feature detection so they are used even if
not compiled with `+fma` (as the distributed standard library is often
not). Glibc uses ifuncs, this implementation stores a function pointer
in an atomic.

Results of CPU flags are also cached in order to avoid repeating the
startup time in calls to different functions. The feature detection code
is a slightly simplified version of `std-detect`.

Musl sources were used as a reference [1].

Fixes: rust-lang/rust#140452 once synced

[1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
tgross35 added a commit to tgross35/compiler-builtins that referenced this issue May 1, 2025
Get performance closer to the glibc implementations by adding assembly
fma routines, with runtime feature detection so they are used even if
not compiled with `+fma` (as the distributed standard library is often
not). Glibc uses ifuncs, this implementation stores a function pointer
in an atomic.

Results of CPU flags are also cached in order to avoid repeating the
startup time in calls to different functions. The feature detection code
is a slightly simplified version of `std-detect`.

Musl sources were used as a reference [1].

Fixes: rust-lang/rust#140452 once synced

[1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
tgross35 added a commit to tgross35/compiler-builtins that referenced this issue May 1, 2025
Get performance closer to the glibc implementations by adding assembly
fma routines, with runtime feature detection so they are used even if
not compiled with `+fma` (as the distributed standard library is often
not). Glibc uses ifuncs, this implementation stores a function pointer
in an atomic.

Results of CPU flags are also cached in order to avoid repeating the
startup time in calls to different functions. The feature detection code
is a slightly simplified version of `std-detect`.

Musl sources were used as a reference [1].

Fixes: rust-lang/rust#140452 once synced

[1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
tgross35 added a commit to tgross35/compiler-builtins that referenced this issue May 1, 2025
Get performance closer to the glibc implementations by adding assembly
fma routines, with runtime feature detection so they are used even if
not compiled with `+fma` (as the distributed standard library is often
not). Glibc uses ifuncs, this implementation stores a function pointer
in an atomic.

Results of CPU flags are also cached in order to avoid repeating the
startup time in calls to different functions. The feature detection code
is a slightly simplified version of `std-detect`.

Musl sources were used as a reference [1].

Fixes: rust-lang/rust#140452 once synced

[1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
tgross35 added a commit to tgross35/compiler-builtins that referenced this issue May 1, 2025
Get performance closer to the glibc implementations by adding assembly
fma routines, with runtime feature detection so they are used even if
not compiled with `+fma` (as the distributed standard library is often
not). Glibc uses ifuncs, this implementation stores a function pointer
in an atomic.

Results of CPU flags are also cached in order to avoid repeating the
startup time in calls to different functions. The feature detection code
is a slightly simplified version of `std-detect`.

Musl sources were used as a reference [1].

Fixes: rust-lang/rust#140452 once synced

[1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
tgross35 added a commit to tgross35/compiler-builtins that referenced this issue May 3, 2025
Get performance closer to the glibc implementations by adding assembly
fma routines, with runtime feature detection so they are used even if
not compiled with `+fma` (as the distributed standard library is often
not). Glibc uses ifuncs, this implementation stores a function pointer
in an atomic.

Results of CPU flags are also cached in order to avoid repeating the
startup time in calls to different functions. The feature detection code
is a slightly simplified version of `std-detect`.

Musl sources were used as a reference [1].

Fixes: rust-lang/rust#140452 once synced

[1]: https://github.com/bminor/musl/blob/c47ad25ea3b484e10326f933e927c0bc8cded3da/src/math/x32/fma.c
@tgross35 tgross35 reopened this May 4, 2025
tgross35 added a commit to tgross35/rust that referenced this issue May 4, 2025
Includes the following changes:

* Use runtime feature detection for fma routines on x86 [1]

Fixes: rust-lang#140452

[1]: rust-lang/compiler-builtins#896
@tgross35 tgross35 linked a pull request May 4, 2025 that will close this issue
@tgross35
Copy link
Contributor

tgross35 commented May 4, 2025

Builtins version sync #140648

tgross35 added a commit to tgross35/rust that referenced this issue May 5, 2025
Update `compiler-builtins` to 0.1.157

Includes the following changes:

* Use runtime feature detection for fma routines on x86 [1]

Fixes: rust-lang#140452

[1]: rust-lang/compiler-builtins#896
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. P-high High priority regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants