-
-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize has_tabs_or_newline for NEON #639
Conversation
src/unicode.cpp
Outdated
running = vorrq_u8(vorrq_u8(running, vorrq_u8(vceqq_u8(word, mask1), | ||
vceqq_u8(word, mask2))), | ||
vceqq_u8(word, mask3)); | ||
running = vorrq_u8(running, vceqq_u8(vqtbl1q_u8(rnt, word), word)); | ||
} | ||
return vmaxvq_u8(running) != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if that's important, but for any
- you don't need u8
- max_u64
will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to @dougallj, on Apple Silicon, this would make no difference, at least as far as the latency of the SIMD instruction is concerned. I assume that the same is true on other platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately some other cores have much less consistent reductions, for example the Cortex-A78: https://developer.arm.com/documentation/102160/latest/
"ASIMD max/min, reduce, 16B" is listed as latency 4, throughput 1/2,
"ASIMD max/min, reduce, 4H/4S" is listed as latency 2, throughput 1, and
"ASIMD max/min, basic and pairwise" is listed as latency 2, throughput 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with Neoverse. Ok. I will change the code.
src/unicode.cpp
Outdated
const uint8x16_t mask1 = vmovq_n_u8('\r'); | ||
const uint8x16_t mask2 = vmovq_n_u8('\n'); | ||
const uint8x16_t mask3 = vmovq_n_u8('\t'); | ||
static uint8_t rnt_array[16] = {1, 0, 0, 0, 0, 0, 0, 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For read only algorithms, I tend to align data accesses.
That also allows handling of tails for free (aligned load for 16 bytes never crashes, there is a a way to convince ASAN to go away).
Here is some code in folly that does it, it's Appache-2, I don't know licenses, but you probably can use it.
Here is the same code in eve, it's Boost license. You can 100% use it, but it has eve's idiocincratisis.
https://github.com/jfalcou/eve/blob/a2e2cf539e36e9a3326800194ad5206a8ef3f5b7/include/eve/module/algo/algo/for_each_iteration.hpp#L148
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not looked at the links. This being said, if the input is at least 16 bytes, we already handle the tail efficiently. If the input is less than 16 bytes, then you will read beyond your buffer if you try to use SIMD. You can tell some sanitizers not to report the error, but to my knowledge, there is no way to tell every single bound checker (valgrind, default visual studio bound checker, etc.) one might use to ignore the issue. And, it is unclear whether a SIMD algorithm over a string of less than 16 bytes is warranted. Scalar code can be faster on small inputs.
src/unicode.cpp
Outdated
const uint8x16_t mask1 = vmovq_n_u8('\r'); | ||
const uint8x16_t mask2 = vmovq_n_u8('\n'); | ||
const uint8x16_t mask3 = vmovq_n_u8('\t'); | ||
static uint8_t rnt_array[16] = {1, 0, 0, 0, 0, 0, 0, 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aleternatively:
a) can you just allocate the url aligned to 64 bytes or smth?
b) How big are urls in general? I presume between 8 and 16 bytes is also common? Could maybe 8 byte arm version be helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no control over how the string was allocated. It is passed to us by the user.
@anonrig I think we can merge at will. |
@lemire fuzzing is failing with the proposed change |
The change only affects ARM, and the fuzzer runs on x64. I believe that oss-fuzz is busted, I have received many invalid bug reports and I have asked oss-fuzz to dismiss them. They are all of this type |
@anonrig I am not preoccupied by the fuzzer failing, but I still marked this PR as a draft because there might be an another issue... unrelated to the PR. Investigating... |
@anonrig Should be good now. |
@DenisYaroshevskiy suggested this optimization. He meant it for x64, but we do not currently support runtime dispatching nor do we expect our users to compile for a specific hardware, so we are limited to SSE2, which is not powerful enough for this trick.
On Apple M2, using
benchdata
. I repeat the benchmarks twice.Before:
After
It seems that this can reduce the number of instructions by maybe 1%. The speed gain is similarly small.