-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDEV-34699 - mhnsw: support aarch64 SIMD instructions #3671
base: 11.7
Are you sure you want to change the base?
Conversation
|
cb7f363
to
49be3ad
Compare
#define DEFAULT_IMPLEMENTATION | ||
#ifdef __aarch64__ | ||
#include <arm_neon.h> | ||
#undef DEFAULT_IMPLEMENTATION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a default implementation must always exist, that's why it's "default", for a case when CPU doesn't support necessary SIMD commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function multiversioning works only for x86, there's no default
target on other platforms:
https://gcc.gnu.org/wiki/FunctionMultiVersioning
This support is available in GCC 4.8 and later. Support is only available in C++ for i386 targets.
It is under #ifdef __aarch64__
, which guarantees NEON presence. It presumes ARMv8 at least. If we need to support this optimisation on earlier platforms (ARMv7?), it can be tricky.
There are also more advanced SVE/SVE2, but they don't seem to be widely available, e.g. there's none on our bb machine. And if they're available, they don't seem to offer registers larger than NEON does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, okay. Please, add a comment that "Function multiversioning works only for x86" somewhere there, or, may be, more verbose, like
/*
use gcc function multiversioning to optimize for a specific CPU with run-time detection.
works only for x86, for other architectures we provide only one implementation for now
*/
There's MDEV-34804 — it should life the above limitation
uint64x2_t step4= veorq_u64(step1, veorq_u64(step2, step3)); | ||
uint64x2_t step5; | ||
step5= vsetq_lane_u64(vgetq_lane_u64(step4, 0) * prime_mx2, step4, 0); | ||
step5= vsetq_lane_u64(vgetq_lane_u64(step4, 1) * prime_mx2, step5, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange. it doesn't have 64-bit vector multiplication at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, I couldn't find any. And I don't seem to be alone, google for "neon multiply 64".
uint64x2_t vecMaskByteIdx= vshrq_n_u64(vecMaskIdx, 3); | ||
int64x2_t vecMaskBitIdx= | ||
vsubq_s64(vdupq_n_s64(0), | ||
vreinterpretq_s64_u64(vandq_u64(vecMaskIdx, vdupq_n_u64(0x7)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain this one, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's easy: it is equivalent to -(int64_t) (maskIdx & 7)
, almost what default implementation does.
Negative is to turn left shift into right shift (vshlq_u64(uint64x2_t a, int64x2_t b)
). There's no direct right shift call, it only supports shift by scalar: vshrq_n_u64(uint64x2_t a, const int n)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment please, something like
/* shift right in NEON is implemented as shift left by a negative value. do the negation here */
uint64x2_t vecRawMasks= vdupq_n_u64(*reinterpret_cast<const uint64_t*> | ||
(masks + vgetq_lane_u64(vecMaskByteIdx, 0))); | ||
vecRawMasks= vsetq_lane_u64(*reinterpret_cast<const uint64_t*> | ||
(masks + vgetq_lane_u64(vecMaskByteIdx, 1)), vecRawMasks, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, does this neon specific implementation even make sense? only for two 64-bit values at a time, multiplication of array lookups are done per value, not simd. Perhaps a default implementation of the bloom filter will be just as fast? It'd make the code simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did ask myself the same question in this commit comment:
Performance improvement (microbenchmark) for bloom filters is less exciting,
within 10-30% ballpark depending on compiler options and load.
I could try shuffling intrinsics to gain better improvement if you feel it is worthy (considering how BF are used and if they're bottleneck anyway). Generally I tend to agree we should put it on a shelve, unless you see good reason to have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10-30% isn't that bad and the change is highly local, one header file only. may be it's worth it after all
sql/vector_mhnsw.cc
Outdated
|
||
There seem to be no performance difference between vmull+vmull_high and | ||
vmull+vmlal2_high implementations. The second implementation is preserved | ||
just for reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's drop the other implementation :) just keep the comment "There seem to be ... implementation"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I'll let the other implementation gather dust as jira comment.
#endif | ||
v1+= NEON_dims; | ||
v2+= NEON_dims; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you look at the assembly code for this function? Can you paste it here, please? (not in the code, in the comment on a PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another question, would be good to compare with the code that compiler would generate without explicit use of intrinsics, but with explicit vectorization. That is, with something like
typedef float v128i __attribute__((vector_size(NEONbytes)));
v128i *p1= (v128i*)v1, *p2=(v128i*)v2, d=0;
for (size_t i= 0; i < (len + NEON_dims - 1) / NEON_dims; i++, p1++, p2++)
d+=*p1 * *p2;
This compiles into (https://godbolt.org/z/GjnsqoEhq)
dot_product(void const*, void const*, int):
add w4, w2, 14
adds w3, w2, 7
csel w4, w4, w3, mi
asr w4, w4, 3
cmp w2, 0
ble .L4
movi v0.4s, 0
mov x2, 0
mov w3, 0
.L3:
ldr q2, [x0, x2]
add w3, w3, 1
ldr q1, [x1, x2]
add x2, x2, 16
mla v0.4s, v2.4s, v1.4s
cmp w3, w4
blt .L3
dup s1, v0.s[1]
add v0.2s, v1.2s, v0.2s
fmov w0, s0
ret
.L4:
mov w0, 0
ret
So, it seems the compiler decided not to vectorize it?
Your explicit code is
dot_product(short const*, short const*, int):
add w5, w2, 14
adds w3, w2, 7
csel w5, w5, w3, mi
mov x6, x0
asr w5, w5, 3
cmp w2, 0
ble .L4
mov x3, 0
mov w4, 0
mov w0, 0
.L3:
ldr q0, [x6, x3]
add w4, w4, 1
ldr q2, [x1, x3]
add x3, x3, 16
smull v1.4s, v0.4h, v2.4h
smull2 v0.4s, v0.8h, v2.8h
saddlv d1, v1.4s
saddlv d0, v0.4s
add d0, d1, d0
fmov x2, d0
add w0, w0, w2
sxth w0, w0
cmp w4, w5
blt .L3
ret
.L4:
mov w0, 0
ret
the #ifdef 0
implementation differs in
smull v0.4s, v2.4h, v1.4h
smlal2 v0.4s, v2.8h, v1.8h
saddlv d0, v0.4s
fmov x2, d0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dunno, I tried adding dot_product()
w/o explicit vectorization and adding -ftree-vectorize
. It produced something very complex that had vector instructions. Probably it couldn't find a way to get your example vectorized due to widening or whatnot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems my only request here is to add a couple of comments.
presuming it'll be done, I'll approve the PR right away to reduce back-and-forth
just to clarify, "approved" means (see https://mariadb.com/kb/en/mariadb-quality-development-rules/) that you push into, say, bb-11.8-MDEV-34699-vector-arm and change the MDEV status to in-testing |
49be3ad
to
77c14e7
Compare
SIMD implementations of bloom filters and dot product calculation. A microbenchmark shows 1.7x dot product performance improvement compared to regular -O2/-O3 builds and 2.4x compared to builds with auto-vectorization disabled. Performance improvement (microbenchmark) for bloom filters is less exciting, within 10-30% ballpark depending on compiler options and load. Misc implementation notes: CalcHash: no _mm256_shuffle_epi8(), use explicit XOR/shift. CalcHash: no 64bit multiplication, do scalar multiplication. ConstructMask/Query: no _mm256_i64gather_epi64, access array elements explicitly. Query: no _mm256_movemask_epi8, accumulate bits manually.
77c14e7
to
beee055
Compare
@vuvova I think I resolved all your requests. I don't think development rules apply to me, since I have no write access (which I'm very happy about in moments like this). |
Description
SIMD implementations of bloom filters and dot product calculation.
Release Notes
None.
How can this PR be tested?
mtr
Basing the PR against the correct MariaDB version
main
branch.PR quality check