Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize has_tabs_or_newline for NEON #639

Merged
merged 6 commits into from
May 1, 2024
Merged

Conversation

lemire
Copy link
Member

@lemire lemire commented Apr 30, 2024

@DenisYaroshevskiy suggested this optimization. He meant it for x64, but we do not currently support runtime dispatching nor do we expect our users to compile for a specific hardware, so we are limited to SSE2, which is not powerful enough for this trick.

On Apple M2, using benchdata. I repeat the benchmarks twice.

Before:

BasicBench_AdaURL_href              24067654 ns     24065069 ns           29 GHz=3.50365 cycle/byte=9.4421 cycles/url=820.133 instructions/byte=40.1304 instructions/cycle=4.25016 instructions/ns=14.8911 instructions/url=3.4857k ns/url=234.079 speed=361.025M/s time/byte=2.76989ns time/url=240.591ns url/s=4.15644M/s
BasicBench_AdaURL_aggregator_href   16092015 ns     16089636 ns           44 GHz=3.50451 cycle/byte=6.29615 cycles/url=546.879 instructions/byte=28.065 instructions/cycle=4.45749 instructions/ns=15.6213 instructions/url=2.43771k ns/url=156.05 speed=539.981M/s time/byte=1.85192ns time/url=160.856ns url/s=6.21673M/s
BasicBench_AdaURL_href              23550776 ns     23549200 ns           30 GHz=3.40822 cycle/byte=9.29203 cycles/url=807.098 instructions/byte=40.2779 instructions/cycle=4.33467 instructions/ns=14.7735 instructions/url=3.49851k ns/url=236.81 speed=368.934M/s time/byte=2.71051ns time/url=235.433ns url/s=4.24749M/s
BasicBench_AdaURL_aggregator_href   16198858 ns     16197000 ns           43 GHz=3.49614 cycle/byte=6.38115 cycles/url=554.262 instructions/byte=28.249 instructions/cycle=4.42694 instructions/ns=15.4772 instructions/url=2.45368k ns/url=158.535 speed=536.401M/s time/byte=1.86428ns time/url=161.93ns url/s=6.17553M/s

After

BasicBench_AdaURL_href              24435351 ns     24317655 ns           29 GHz=3.49896 cycle/byte=9.30687 cycles/url=808.387 instructions/byte=40.1032 instructions/cycle=4.30899 instructions/ns=15.077 instructions/url=3.48333k ns/url=231.036 speed=357.275M/s time/byte=2.79896ns time/url=243.116ns url/s=4.11327M/s
BasicBench_AdaURL_aggregator_href   16043040 ns     16042409 ns           44 GHz=3.47943 cycle/byte=6.27539 cycles/url=545.075 instructions/byte=27.8984 instructions/cycle=4.44569 instructions/ns=15.4685 instructions/url=2.42323k ns/url=156.656 speed=541.57M/s time/byte=1.84648ns time/url=160.384ns url/s=6.23504M/s
BasicBench_AdaURL_href              24023714 ns     23998931 ns           29 GHz=3.45169 cycle/byte=9.32467 cycles/url=809.934 instructions/byte=40.0756 instructions/cycle=4.2978 instructions/ns=14.8347 instructions/url=3.48093k ns/url=234.648 speed=362.02M/s time/byte=2.76228ns time/url=239.929ns url/s=4.16789M/s
BasicBench_AdaURL_aggregator_href   16024929 ns     16024568 ns           44 GHz=3.5045 cycle/byte=6.27804 cycles/url=545.306 instructions/byte=27.8983 instructions/cycle=4.4438 instructions/ns=15.5733 instructions/url=2.42323k ns/url=155.602 speed=542.173M/s time/byte=1.84443ns time/url=160.206ns url/s=6.24198M/s

It seems that this can reduce the number of instructions by maybe 1%. The speed gain is similarly small.

@lemire lemire requested a review from anonrig April 30, 2024 04:16
src/unicode.cpp Outdated
running = vorrq_u8(vorrq_u8(running, vorrq_u8(vceqq_u8(word, mask1),
vceqq_u8(word, mask2))),
vceqq_u8(word, mask3));
running = vorrq_u8(running, vceqq_u8(vqtbl1q_u8(rnt, word), word));
}
return vmaxvq_u8(running) != 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if that's important, but for any - you don't need u8 - max_u64 will do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to @dougallj, on Apple Silicon, this would make no difference, at least as far as the latency of the SIMD instruction is concerned. I assume that the same is true on other platforms.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately some other cores have much less consistent reductions, for example the Cortex-A78: https://developer.arm.com/documentation/102160/latest/

"ASIMD max/min, reduce, 16B" is listed as latency 4, throughput 1/2,
"ASIMD max/min, reduce, 4H/4S" is listed as latency 2, throughput 1, and
"ASIMD max/min, basic and pairwise" is listed as latency 2, throughput 2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with Neoverse. Ok. I will change the code.

src/unicode.cpp Outdated
const uint8x16_t mask1 = vmovq_n_u8('\r');
const uint8x16_t mask2 = vmovq_n_u8('\n');
const uint8x16_t mask3 = vmovq_n_u8('\t');
static uint8_t rnt_array[16] = {1, 0, 0, 0, 0, 0, 0, 0,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For read only algorithms, I tend to align data accesses.
That also allows handling of tails for free (aligned load for 16 bytes never crashes, there is a a way to convince ASAN to go away).

Here is some code in folly that does it, it's Appache-2, I don't know licenses, but you probably can use it.

https://github.com/facebook/folly/blob/dcdbd4c8d5f6511086b219dacb3e7aeb7889c662/folly/detail/SimdForEach.h#L55

https://github.com/facebook/folly/blob/dcdbd4c8d5f6511086b219dacb3e7aeb7889c662/folly/detail/SimdAnyOf.h#L76

Here is the same code in eve, it's Boost license. You can 100% use it, but it has eve's idiocincratisis.
https://github.com/jfalcou/eve/blob/a2e2cf539e36e9a3326800194ad5206a8ef3f5b7/include/eve/module/algo/algo/for_each_iteration.hpp#L148

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not looked at the links. This being said, if the input is at least 16 bytes, we already handle the tail efficiently. If the input is less than 16 bytes, then you will read beyond your buffer if you try to use SIMD. You can tell some sanitizers not to report the error, but to my knowledge, there is no way to tell every single bound checker (valgrind, default visual studio bound checker, etc.) one might use to ignore the issue. And, it is unclear whether a SIMD algorithm over a string of less than 16 bytes is warranted. Scalar code can be faster on small inputs.

src/unicode.cpp Outdated Show resolved Hide resolved
src/unicode.cpp Outdated
const uint8x16_t mask1 = vmovq_n_u8('\r');
const uint8x16_t mask2 = vmovq_n_u8('\n');
const uint8x16_t mask3 = vmovq_n_u8('\t');
static uint8_t rnt_array[16] = {1, 0, 0, 0, 0, 0, 0, 0,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aleternatively:
a) can you just allocate the url aligned to 64 bytes or smth?
b) How big are urls in general? I presume between 8 and 16 bytes is also common? Could maybe 8 byte arm version be helpful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no control over how the string was allocated. It is passed to us by the user.

src/unicode.cpp Outdated Show resolved Hide resolved
@lemire
Copy link
Member Author

lemire commented Apr 30, 2024

@anonrig I think we can merge at will.

@anonrig
Copy link
Member

anonrig commented May 1, 2024

@lemire fuzzing is failing with the proposed change

@lemire
Copy link
Member Author

lemire commented May 1, 2024

fuzzing is failing with the proposed change

The change only affects ARM, and the fuzzer runs on x64.

I believe that oss-fuzz is busted, I have received many invalid bug reports and I have asked oss-fuzz to dismiss them. They are all of this type MemorySanitizer: use-of-uninitialized-value and they have to do with the standard library.

@lemire lemire marked this pull request as draft May 1, 2024 18:05
@lemire
Copy link
Member Author

lemire commented May 1, 2024

@anonrig I am not preoccupied by the fuzzer failing, but I still marked this PR as a draft because there might be an another issue... unrelated to the PR. Investigating...

@lemire lemire marked this pull request as ready for review May 1, 2024 18:36
@lemire
Copy link
Member Author

lemire commented May 1, 2024

@anonrig Should be good now.

@anonrig anonrig merged commit 08aad0b into main May 1, 2024
35 of 66 checks passed
@anonrig anonrig deleted the opti_has_tabs_or_newline branch May 1, 2024 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants