Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-34699 - mhnsw: support aarch64 SIMD instructions #3671

Open
wants to merge 1 commit into
base: 11.7
Choose a base branch
from

Conversation

svoj
Copy link
Contributor

@svoj svoj commented Dec 1, 2024

  • The Jira issue number for this PR is: MDEV-34699

Description

SIMD implementations of bloom filters and dot product calculation.

Release Notes

None.

How can this PR be tested?

mtr

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@svoj svoj requested a review from vuvova December 1, 2024 17:03
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

#define DEFAULT_IMPLEMENTATION
#ifdef __aarch64__
#include <arm_neon.h>
#undef DEFAULT_IMPLEMENTATION
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a default implementation must always exist, that's why it's "default", for a case when CPU doesn't support necessary SIMD commands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function multiversioning works only for x86, there's no default target on other platforms:
https://gcc.gnu.org/wiki/FunctionMultiVersioning

This support is available in GCC 4.8 and later. Support is only available in C++ for i386 targets.

It is under #ifdef __aarch64__, which guarantees NEON presence. It presumes ARMv8 at least. If we need to support this optimisation on earlier platforms (ARMv7?), it can be tricky.

There are also more advanced SVE/SVE2, but they don't seem to be widely available, e.g. there's none on our bb machine. And if they're available, they don't seem to offer registers larger than NEON does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, okay. Please, add a comment that "Function multiversioning works only for x86" somewhere there, or, may be, more verbose, like

/*
  use gcc function multiversioning to optimize for a specific CPU with run-time detection.
  works only for x86, for other architectures we provide only one implementation for now
*/

There's MDEV-34804 — it should life the above limitation

uint64x2_t step4= veorq_u64(step1, veorq_u64(step2, step3));
uint64x2_t step5;
step5= vsetq_lane_u64(vgetq_lane_u64(step4, 0) * prime_mx2, step4, 0);
step5= vsetq_lane_u64(vgetq_lane_u64(step4, 1) * prime_mx2, step5, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange. it doesn't have 64-bit vector multiplication at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I couldn't find any. And I don't seem to be alone, google for "neon multiply 64".

uint64x2_t vecMaskByteIdx= vshrq_n_u64(vecMaskIdx, 3);
int64x2_t vecMaskBitIdx=
vsubq_s64(vdupq_n_s64(0),
vreinterpretq_s64_u64(vandq_u64(vecMaskIdx, vdupq_n_u64(0x7))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain this one, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's easy: it is equivalent to -(int64_t) (maskIdx & 7), almost what default implementation does.

Negative is to turn left shift into right shift (vshlq_u64(uint64x2_t a, int64x2_t b)). There's no direct right shift call, it only supports shift by scalar: vshrq_n_u64(uint64x2_t a, const int n).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment please, something like

/* shift right in NEON is implemented as shift left by a negative value. do the negation here */

uint64x2_t vecRawMasks= vdupq_n_u64(*reinterpret_cast<const uint64_t*>
(masks + vgetq_lane_u64(vecMaskByteIdx, 0)));
vecRawMasks= vsetq_lane_u64(*reinterpret_cast<const uint64_t*>
(masks + vgetq_lane_u64(vecMaskByteIdx, 1)), vecRawMasks, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, does this neon specific implementation even make sense? only for two 64-bit values at a time, multiplication of array lookups are done per value, not simd. Perhaps a default implementation of the bloom filter will be just as fast? It'd make the code simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did ask myself the same question in this commit comment:

Performance improvement (microbenchmark) for bloom filters is less exciting,
within 10-30% ballpark depending on compiler options and load.

I could try shuffling intrinsics to gain better improvement if you feel it is worthy (considering how BF are used and if they're bottleneck anyway). Generally I tend to agree we should put it on a shelve, unless you see good reason to have it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10-30% isn't that bad and the change is highly local, one header file only. may be it's worth it after all


There seem to be no performance difference between vmull+vmull_high and
vmull+vmlal2_high implementations. The second implementation is preserved
just for reference.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's drop the other implementation :) just keep the comment "There seem to be ... implementation"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll let the other implementation gather dust as jira comment.

#endif
v1+= NEON_dims;
v2+= NEON_dims;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you look at the assembly code for this function? Can you paste it here, please? (not in the code, in the comment on a PR).

Copy link
Member

@vuvova vuvova Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another question, would be good to compare with the code that compiler would generate without explicit use of intrinsics, but with explicit vectorization. That is, with something like

typedef float v128i __attribute__((vector_size(NEONbytes)));
v128i *p1= (v128i*)v1, *p2=(v128i*)v2, d=0;
for (size_t i= 0; i < (len + NEON_dims - 1) / NEON_dims; i++, p1++, p2++)
  d+=*p1 * *p2;

This compiles into (https://godbolt.org/z/GjnsqoEhq)

dot_product(void const*, void const*, int):
        add     w4, w2, 14
        adds    w3, w2, 7
        csel    w4, w4, w3, mi
        asr     w4, w4, 3
        cmp     w2, 0
        ble     .L4
        movi    v0.4s, 0
        mov     x2, 0
        mov     w3, 0
.L3:
        ldr     q2, [x0, x2]
        add     w3, w3, 1
        ldr     q1, [x1, x2]
        add     x2, x2, 16
        mla     v0.4s, v2.4s, v1.4s
        cmp     w3, w4
        blt     .L3
        dup     s1, v0.s[1]
        add     v0.2s, v1.2s, v0.2s
        fmov    w0, s0
        ret
.L4:
        mov     w0, 0
        ret

So, it seems the compiler decided not to vectorize it?
Your explicit code is

dot_product(short const*, short const*, int):
        add     w5, w2, 14
        adds    w3, w2, 7
        csel    w5, w5, w3, mi
        mov     x6, x0
        asr     w5, w5, 3
        cmp     w2, 0
        ble     .L4
        mov     x3, 0
        mov     w4, 0
        mov     w0, 0
.L3:
        ldr     q0, [x6, x3]
        add     w4, w4, 1
        ldr     q2, [x1, x3]
        add     x3, x3, 16
        smull   v1.4s, v0.4h, v2.4h
        smull2  v0.4s, v0.8h, v2.8h
        saddlv  d1, v1.4s
        saddlv  d0, v0.4s
        add     d0, d1, d0
        fmov    x2, d0
        add     w0, w0, w2
        sxth    w0, w0
        cmp     w4, w5
        blt     .L3
        ret
.L4:
        mov     w0, 0
        ret

the #ifdef 0 implementation differs in

        smull   v0.4s, v2.4h, v1.4h
        smlal2  v0.4s, v2.8h, v1.8h
        saddlv  d0, v0.4s
        fmov    x2, d0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno, I tried adding dot_product() w/o explicit vectorization and adding -ftree-vectorize. It produced something very complex that had vector instructions. Probably it couldn't find a way to get your example vectorized due to widening or whatnot.

Copy link
Member

@vuvova vuvova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems my only request here is to add a couple of comments.
presuming it'll be done, I'll approve the PR right away to reduce back-and-forth

@vuvova
Copy link
Member

vuvova commented Dec 3, 2024

just to clarify, "approved" means (see https://mariadb.com/kb/en/mariadb-quality-development-rules/) that you push into, say, bb-11.8-MDEV-34699-vector-arm and change the MDEV status to in-testing

SIMD implementations of bloom filters and dot product calculation.

A microbenchmark shows 1.7x dot product performance improvement compared to
regular -O2/-O3 builds and 2.4x compared to builds with auto-vectorization
disabled.

Performance improvement (microbenchmark) for bloom filters is less exciting,
within 10-30% ballpark depending on compiler options and load.

Misc implementation notes:
CalcHash: no _mm256_shuffle_epi8(), use explicit XOR/shift.
CalcHash: no 64bit multiplication, do scalar multiplication.
ConstructMask/Query: no _mm256_i64gather_epi64, access array elements explicitly.
Query: no _mm256_movemask_epi8, accumulate bits manually.
@svoj
Copy link
Contributor Author

svoj commented Dec 3, 2024

@vuvova I think I resolved all your requests. I don't think development rules apply to me, since I have no write access (which I'm very happy about in moments like this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants