MDEV-34699 - mhnsw: support aarch64 SIMD instructions #3671

svoj · 2024-12-01T17:03:46Z

The Jira issue number for this PR is: MDEV-34699

Description

SIMD implementations of bloom filters and dot product calculation.

Release Notes

None.

How can this PR be tested?

mtr

Basing the PR against the correct MariaDB version

This is a new feature or a refactoring, and the PR is based against the main branch.
This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

CLAassistant · 2024-12-01T17:03:52Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

vuvova · 2024-12-02T18:01:59Z

sql/bloom_filters.h

-#define DEFAULT_IMPLEMENTATION
+#ifdef __aarch64__
+#include <arm_neon.h>
+#undef DEFAULT_IMPLEMENTATION


I think a default implementation must always exist, that's why it's "default", for a case when CPU doesn't support necessary SIMD commands.

Function multiversioning works only for x86, there's no default target on other platforms:
https://gcc.gnu.org/wiki/FunctionMultiVersioning

This support is available in GCC 4.8 and later. Support is only available in C++ for i386 targets.

It is under #ifdef __aarch64__, which guarantees NEON presence. It presumes ARMv8 at least. If we need to support this optimisation on earlier platforms (ARMv7?), it can be tricky.

There are also more advanced SVE/SVE2, but they don't seem to be widely available, e.g. there's none on our bb machine. And if they're available, they don't seem to offer registers larger than NEON does.

oh, okay. Please, add a comment that "Function multiversioning works only for x86" somewhere there, or, may be, more verbose, like

/* use gcc function multiversioning to optimize for a specific CPU with run-time detection. works only for x86, for other architectures we provide only one implementation for now */

There's MDEV-34804 — it should life the above limitation

vuvova · 2024-12-02T18:15:44Z

sql/bloom_filters.h

+    uint64x2_t step4= veorq_u64(step1, veorq_u64(step2, step3));
+    uint64x2_t step5;
+    step5= vsetq_lane_u64(vgetq_lane_u64(step4, 0) * prime_mx2, step4, 0);
+    step5= vsetq_lane_u64(vgetq_lane_u64(step4, 1) * prime_mx2, step5, 1);


strange. it doesn't have 64-bit vector multiplication at all?

Correct, I couldn't find any. And I don't seem to be alone, google for "neon multiply 64".

vuvova · 2024-12-02T18:23:18Z

sql/bloom_filters.h

+    uint64x2_t vecMaskByteIdx= vshrq_n_u64(vecMaskIdx, 3);
+    int64x2_t vecMaskBitIdx=
+      vsubq_s64(vdupq_n_s64(0),
+                vreinterpretq_s64_u64(vandq_u64(vecMaskIdx, vdupq_n_u64(0x7))));


could you explain this one, please?

That's easy: it is equivalent to -(int64_t) (maskIdx & 7), almost what default implementation does.

Negative is to turn left shift into right shift (vshlq_u64(uint64x2_t a, int64x2_t b)). There's no direct right shift call, it only supports shift by scalar: vshrq_n_u64(uint64x2_t a, const int n).

add a comment please, something like

/* shift right in NEON is implemented as shift left by a negative value. do the negation here */

vuvova · 2024-12-02T18:25:45Z

sql/bloom_filters.h

+    uint64x2_t vecRawMasks= vdupq_n_u64(*reinterpret_cast<const uint64_t*>
+                              (masks + vgetq_lane_u64(vecMaskByteIdx, 0)));
+    vecRawMasks= vsetq_lane_u64(*reinterpret_cast<const uint64_t*>
+                   (masks + vgetq_lane_u64(vecMaskByteIdx, 1)), vecRawMasks, 1);


hmm, does this neon specific implementation even make sense? only for two 64-bit values at a time, multiplication of array lookups are done per value, not simd. Perhaps a default implementation of the bloom filter will be just as fast? It'd make the code simpler.

I did ask myself the same question in this commit comment:

Performance improvement (microbenchmark) for bloom filters is less exciting,
within 10-30% ballpark depending on compiler options and load.

I could try shuffling intrinsics to gain better improvement if you feel it is worthy (considering how BF are used and if they're bottleneck anyway). Generally I tend to agree we should put it on a shelve, unless you see good reason to have it.

10-30% isn't that bad and the change is highly local, one header file only. may be it's worth it after all

vuvova · 2024-12-02T18:27:08Z

sql/vector_mhnsw.cc

+
+  There seem to be no performance difference between vmull+vmull_high and
+  vmull+vmlal2_high implementations. The second implementation is preserved
+  just for reference.


let's drop the other implementation :) just keep the comment "There seem to be ... implementation"

Alright, I'll let the other implementation gather dust as jira comment.

vuvova · 2024-12-02T18:31:00Z

sql/vector_mhnsw.cc

+#endif
+      v1+= NEON_dims;
+      v2+= NEON_dims;
+    }


did you look at the assembly code for this function? Can you paste it here, please? (not in the code, in the comment on a PR).

another question, would be good to compare with the code that compiler would generate without explicit use of intrinsics, but with explicit vectorization. That is, with something like

typedef float v128i __attribute__((vector_size(NEONbytes))); v128i *p1= (v128i*)v1, *p2=(v128i*)v2, d=0; for (size_t i= 0; i < (len + NEON_dims - 1) / NEON_dims; i++, p1++, p2++) d+=*p1 * *p2;

This compiles into (https://godbolt.org/z/GjnsqoEhq)

dot_product(void const*, void const*, int): add w4, w2, 14 adds w3, w2, 7 csel w4, w4, w3, mi asr w4, w4, 3 cmp w2, 0 ble .L4 movi v0.4s, 0 mov x2, 0 mov w3, 0 .L3: ldr q2, [x0, x2] add w3, w3, 1 ldr q1, [x1, x2] add x2, x2, 16 mla v0.4s, v2.4s, v1.4s cmp w3, w4 blt .L3 dup s1, v0.s[1] add v0.2s, v1.2s, v0.2s fmov w0, s0 ret .L4: mov w0, 0 ret

So, it seems the compiler decided not to vectorize it?
Your explicit code is

dot_product(short const*, short const*, int): add w5, w2, 14 adds w3, w2, 7 csel w5, w5, w3, mi mov x6, x0 asr w5, w5, 3 cmp w2, 0 ble .L4 mov x3, 0 mov w4, 0 mov w0, 0 .L3: ldr q0, [x6, x3] add w4, w4, 1 ldr q2, [x1, x3] add x3, x3, 16 smull v1.4s, v0.4h, v2.4h smull2 v0.4s, v0.8h, v2.8h saddlv d1, v1.4s saddlv d0, v0.4s add d0, d1, d0 fmov x2, d0 add w0, w0, w2 sxth w0, w0 cmp w4, w5 blt .L3 ret .L4: mov w0, 0 ret

the #ifdef 0 implementation differs in

smull v0.4s, v2.4h, v1.4h smlal2 v0.4s, v2.8h, v1.8h saddlv d0, v0.4s fmov x2, d0

Dunno, I tried adding dot_product() w/o explicit vectorization and adding -ftree-vectorize. It produced something very complex that had vector instructions. Probably it couldn't find a way to get your example vectorized due to widening or whatnot.

vuvova

it seems my only request here is to add a couple of comments.
presuming it'll be done, I'll approve the PR right away to reduce back-and-forth

vuvova · 2024-12-03T13:48:12Z

just to clarify, "approved" means (see https://mariadb.com/kb/en/mariadb-quality-development-rules/) that you push into, say, bb-11.8-MDEV-34699-vector-arm and change the MDEV status to in-testing

SIMD implementations of bloom filters and dot product calculation. A microbenchmark shows 1.7x dot product performance improvement compared to regular -O2/-O3 builds and 2.4x compared to builds with auto-vectorization disabled. Performance improvement (microbenchmark) for bloom filters is less exciting, within 10-30% ballpark depending on compiler options and load. Misc implementation notes: CalcHash: no _mm256_shuffle_epi8(), use explicit XOR/shift. CalcHash: no 64bit multiplication, do scalar multiplication. ConstructMask/Query: no _mm256_i64gather_epi64, access array elements explicitly. Query: no _mm256_movemask_epi8, accumulate bits manually.

svoj · 2024-12-03T19:01:44Z

@vuvova I think I resolved all your requests. I don't think development rules apply to me, since I have no write access (which I'm very happy about in moments like this).

svoj requested a review from vuvova December 1, 2024 17:03

svoj force-pushed the pr-11.7-MDEV-34699 branch from cb7f363 to 49be3ad Compare December 1, 2024 18:34

vuvova reviewed Dec 2, 2024

View reviewed changes

cvicentiu added the MariaDB Corporation label Dec 3, 2024

vuvova approved these changes Dec 3, 2024

View reviewed changes

svoj force-pushed the pr-11.7-MDEV-34699 branch from 49be3ad to 77c14e7 Compare December 3, 2024 18:52

svoj force-pushed the pr-11.7-MDEV-34699 branch from 77c14e7 to beee055 Compare December 3, 2024 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDEV-34699 - mhnsw: support aarch64 SIMD instructions #3671

MDEV-34699 - mhnsw: support aarch64 SIMD instructions #3671

svoj commented Dec 1, 2024

CLAassistant commented Dec 1, 2024

vuvova Dec 2, 2024

svoj Dec 2, 2024

vuvova Dec 3, 2024

vuvova Dec 2, 2024

svoj Dec 2, 2024

vuvova Dec 2, 2024

svoj Dec 2, 2024

vuvova Dec 3, 2024

vuvova Dec 2, 2024

svoj Dec 2, 2024

vuvova Dec 3, 2024

vuvova Dec 2, 2024

svoj Dec 2, 2024

vuvova Dec 2, 2024

vuvova Dec 3, 2024 •

edited

Loading

svoj Dec 3, 2024

vuvova left a comment

vuvova commented Dec 3, 2024

svoj commented Dec 3, 2024

MDEV-34699 - mhnsw: support aarch64 SIMD instructions #3671

Are you sure you want to change the base?

MDEV-34699 - mhnsw: support aarch64 SIMD instructions #3671

Conversation

svoj commented Dec 1, 2024

Description

Release Notes

How can this PR be tested?

Basing the PR against the correct MariaDB version

PR quality check

CLAassistant commented Dec 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuvova Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuvova left a comment

Choose a reason for hiding this comment

vuvova commented Dec 3, 2024

svoj commented Dec 3, 2024

vuvova Dec 3, 2024 •

edited

Loading