Skip to content

Add support for Ampere AmpereOne processors #5309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

davidz-ampere
Copy link

Add support for Ampere processors(AmpereOne AC3, AC4), which are arm8.6+ ISA combability.

@martin-frbg
Copy link
Collaborator

Thank you. I expect that the compiler options will offer some performance benefit over treating them the same as the NeoverseN1 reference design, but it should be possible to simplify the KERNEL files to an include $(KERNELDIR)/KERNEL.NEOVERSEN1 (and build on that if necessary) ? And furthermore it might be acceptable to have these cpus mapped to NEOVERSEN1 in DYNAMIC_ARCH builds, rather than have another set of near-identical BLAS kernels in the multi-cpu library ? (And lastly, are the One and 1A different enough in features relevant to OpenBLAS to warrant two types - my (limited) understanding so far was that the next big change would come with 1B ?)

@davidz-ampere
Copy link
Author

I compared the HPL test with the library build from the default develop branch and with this patch on the same AmpereOne 192c system:
default:
Column=000000192 Fraction= 0.1% Gflops=1.901e+03
Column=000000384 Fraction= 0.2% Gflops=1.901e+03
Column=000000576 Fraction= 0.2% Gflops=1.914e+03
Column=000000768 Fraction= 0.3% Gflops=1.924e+03
Column=000000960 Fraction= 0.4% Gflops=1.929e+03
Column=000001152 Fraction= 0.5% Gflops=1.934e+03
Column=000001344 Fraction= 0.5% Gflops=1.937e+03
Column=000001536 Fraction= 0.6% Gflops=1.939e+03
Column=000001728 Fraction= 0.7% Gflops=1.942e+03
Column=000001920 Fraction= 0.8% Gflops=1.944e+03
Column=000002112 Fraction= 0.8% Gflops=1.946e+03
Column=000002304 Fraction= 0.9% Gflops=1.948e+03
Column=000002496 Fraction= 1.0% Gflops=1.949e+03

with this patch:
Column=000000192 Fraction= 0.1% Gflops=3.140e+03
Column=000000384 Fraction= 0.2% Gflops=3.068e+03
Column=000000576 Fraction= 0.2% Gflops=3.081e+03
Column=000000768 Fraction= 0.3% Gflops=3.088e+03
Column=000000960 Fraction= 0.4% Gflops=3.091e+03
Column=000001152 Fraction= 0.5% Gflops=3.088e+03
Column=000001344 Fraction= 0.5% Gflops=3.089e+03
Column=000001536 Fraction= 0.6% Gflops=3.090e+03
Column=000001728 Fraction= 0.7% Gflops=3.093e+03
Column=000001920 Fraction= 0.8% Gflops=3.094e+03
Column=000002112 Fraction= 0.8% Gflops=3.095e+03
Column=000002304 Fraction= 0.9% Gflops=3.095e+03
Column=000002496 Fraction= 1.0% Gflops=3.094e+03

Yes, Martin, I agree with you, I should use include $(KERNELDIR)/KERNEL.NEOVERSEN1 to avoid duplicate code. Let me fix it.

@martin-frbg
Copy link
Collaborator

Thanks, the speedup is quite impressive, but then I guess a purely default build would currently treat it as a very basic ARMV8 (-march=armv8-a only). Could you comment on performance relative to NEOVERSEN1 target ? (No exact numbers needed if they're not out in the open yet, basically just if they need to be included in the multiple-cpu DYNAMIC_ARCH builds in their own right, or if it would be sufficient to have NEOVERSEN1 stand in for them - I can do that part of the code, but I can't do the benchmarks to see if it's worth making that build of the library bigger)

@davidz-ampere
Copy link
Author

Yes, the default build will use "-march=armv8-a" and the kernel file might also be the general one.
I force build a library from NEOVERSEN1 by adding #define FORCE_NEOVERSEN1 in to getarch.c.
The performance is almost the same as the lib with this patch.
That makes sense because the kernel file is the same.

As you mentioned, there will be next gen Ampere processor, is there any potential complexity if putting AMPERE1/AMPERE1A into NEOVERSEN1 DYNAMIC_ARCH?

@martin-frbg
Copy link
Collaborator

My thinking is simply that if these two AmpereOne processors are basically a faster NeoverseN1 compatible and there is no compelling reason to write dedicated BLAS kernels for them in the near future (?), we could avoid code duplication and just match the new cpuids to NEOVERSEN1 like we do for Altra. For DYNAMIC_ARCH, this would mean keeping the library relatively small. For dedicated builds, it would avoid the extra sets of GEMM parameters and having to check that we add "ifdef AMPEREONE" whereever there is conditional code for Neoverse already - the only user-visible drawback would be that the library will not be named libopenblas_ampereone

@davidz-ampere
Copy link
Author

Although we are using the kernel file of NeoverseN1, but these processors are quite different from the NeoverseN1 core. Altra is based on N1 core, which is armv8.2 based. And ampere1 and ampere1a is armv8.6 based. So it is not a good idea to simply map the ampere cpu id to NEOVERSEN1.

I didn't consider the DYNAMIC_ARCH in these commit. If that is necessary, I can make another PR to support it.

@martin-frbg
Copy link
Collaborator

ampere1 and ampere1a is armv8.6 based. So it is not a good idea to simply map the ampere cpu id to NEOVERSEN1.

right, but 8.6 probably won't do much good when the code doesn't make use of it. That was why I asked how much of the performance gain vs. generic ARMV8 was from treating it like N1 compared to what the correct -march/mtune added - and I understood your reply to be that the vast majority of the speedup came from claiming it's N1.
I'm aware that 8.6 should provide opportunities for BFLOAT16 and FP16 kernels, which would make the case for at least AmpereOne, but I'm still in the dark whether we need 1A as a separate target - the impression I get from my sources is that the 1a is a refresh with no dramatic changes in fp instruction set or latencies ? (Here I'm just trying to avoid code bloat and an inflation of very similar targets - we made that mistake in the early days of arm64 support with all the Cortex A models (compare how on x86_64 all the AVX2-only refreshes are handled by one HASWELL target)

I didn't consider the DYNAMIC_ARCH in these commit. If that is necessary, I can make another PR to support it.

DYNAMIC_ARCH is what typically gets used by third-party distributors - EasyBuild/EESSI in the HPC environment or the Python "wheels" for NumPy/SciPy, and anybody who does not want to build OpenBLAS on each target in a heterogeneous (though arm64) environment. This is why I put the simple forwarding to N1 into 0.3.30 as a last- minute patch.

@martin-frbg martin-frbg added this to the 0.3.31 milestone Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants