optimize has_tabs_or_newline for NEON #639

lemire · 2024-04-30T04:16:36Z

@DenisYaroshevskiy suggested this optimization. He meant it for x64, but we do not currently support runtime dispatching nor do we expect our users to compile for a specific hardware, so we are limited to SSE2, which is not powerful enough for this trick.

On Apple M2, using benchdata. I repeat the benchmarks twice.

Before:

BasicBench_AdaURL_href              24067654 ns     24065069 ns           29 GHz=3.50365 cycle/byte=9.4421 cycles/url=820.133 instructions/byte=40.1304 instructions/cycle=4.25016 instructions/ns=14.8911 instructions/url=3.4857k ns/url=234.079 speed=361.025M/s time/byte=2.76989ns time/url=240.591ns url/s=4.15644M/s
BasicBench_AdaURL_aggregator_href   16092015 ns     16089636 ns           44 GHz=3.50451 cycle/byte=6.29615 cycles/url=546.879 instructions/byte=28.065 instructions/cycle=4.45749 instructions/ns=15.6213 instructions/url=2.43771k ns/url=156.05 speed=539.981M/s time/byte=1.85192ns time/url=160.856ns url/s=6.21673M/s

BasicBench_AdaURL_href              23550776 ns     23549200 ns           30 GHz=3.40822 cycle/byte=9.29203 cycles/url=807.098 instructions/byte=40.2779 instructions/cycle=4.33467 instructions/ns=14.7735 instructions/url=3.49851k ns/url=236.81 speed=368.934M/s time/byte=2.71051ns time/url=235.433ns url/s=4.24749M/s
BasicBench_AdaURL_aggregator_href   16198858 ns     16197000 ns           43 GHz=3.49614 cycle/byte=6.38115 cycles/url=554.262 instructions/byte=28.249 instructions/cycle=4.42694 instructions/ns=15.4772 instructions/url=2.45368k ns/url=158.535 speed=536.401M/s time/byte=1.86428ns time/url=161.93ns url/s=6.17553M/s

After

BasicBench_AdaURL_href              24435351 ns     24317655 ns           29 GHz=3.49896 cycle/byte=9.30687 cycles/url=808.387 instructions/byte=40.1032 instructions/cycle=4.30899 instructions/ns=15.077 instructions/url=3.48333k ns/url=231.036 speed=357.275M/s time/byte=2.79896ns time/url=243.116ns url/s=4.11327M/s
BasicBench_AdaURL_aggregator_href   16043040 ns     16042409 ns           44 GHz=3.47943 cycle/byte=6.27539 cycles/url=545.075 instructions/byte=27.8984 instructions/cycle=4.44569 instructions/ns=15.4685 instructions/url=2.42323k ns/url=156.656 speed=541.57M/s time/byte=1.84648ns time/url=160.384ns url/s=6.23504M/s

BasicBench_AdaURL_href              24023714 ns     23998931 ns           29 GHz=3.45169 cycle/byte=9.32467 cycles/url=809.934 instructions/byte=40.0756 instructions/cycle=4.2978 instructions/ns=14.8347 instructions/url=3.48093k ns/url=234.648 speed=362.02M/s time/byte=2.76228ns time/url=239.929ns url/s=4.16789M/s
BasicBench_AdaURL_aggregator_href   16024929 ns     16024568 ns           44 GHz=3.5045 cycle/byte=6.27804 cycles/url=545.306 instructions/byte=27.8983 instructions/cycle=4.4438 instructions/ns=15.5733 instructions/url=2.42323k ns/url=155.602 speed=542.173M/s time/byte=1.84443ns time/url=160.206ns url/s=6.24198M/s

It seems that this can reduce the number of instructions by maybe 1%. The speed gain is similarly small.

DenisYaroshevskiy · 2024-04-30T08:18:18Z

src/unicode.cpp

-    running = vorrq_u8(vorrq_u8(running, vorrq_u8(vceqq_u8(word, mask1),
-                                                  vceqq_u8(word, mask2))),
-                       vceqq_u8(word, mask3));
+    running = vorrq_u8(running, vceqq_u8(vqtbl1q_u8(rnt, word), word));
  }
  return vmaxvq_u8(running) != 0;


I don't know if that's important, but for any - you don't need u8 - max_u64 will do.

According to @dougallj, on Apple Silicon, this would make no difference, at least as far as the latency of the SIMD instruction is concerned. I assume that the same is true on other platforms.

Unfortunately some other cores have much less consistent reductions, for example the Cortex-A78: https://developer.arm.com/documentation/102160/latest/

"ASIMD max/min, reduce, 16B" is listed as latency 4, throughput 1/2,
"ASIMD max/min, reduce, 4H/4S" is listed as latency 2, throughput 1, and
"ASIMD max/min, basic and pairwise" is listed as latency 2, throughput 2.

Same with Neoverse. Ok. I will change the code.

DenisYaroshevskiy · 2024-04-30T08:23:47Z

src/unicode.cpp

-  const uint8x16_t mask1 = vmovq_n_u8('\r');
-  const uint8x16_t mask2 = vmovq_n_u8('\n');
-  const uint8x16_t mask3 = vmovq_n_u8('\t');
+  static uint8_t rnt_array[16] = {1, 0,  0, 0, 0,  0, 0, 0,


For read only algorithms, I tend to align data accesses.
That also allows handling of tails for free (aligned load for 16 bytes never crashes, there is a a way to convince ASAN to go away).

Here is some code in folly that does it, it's Appache-2, I don't know licenses, but you probably can use it.

https://github.com/facebook/folly/blob/dcdbd4c8d5f6511086b219dacb3e7aeb7889c662/folly/detail/SimdForEach.h#L55

https://github.com/facebook/folly/blob/dcdbd4c8d5f6511086b219dacb3e7aeb7889c662/folly/detail/SimdAnyOf.h#L76

Here is the same code in eve, it's Boost license. You can 100% use it, but it has eve's idiocincratisis.
https://github.com/jfalcou/eve/blob/a2e2cf539e36e9a3326800194ad5206a8ef3f5b7/include/eve/module/algo/algo/for_each_iteration.hpp#L148

I have not looked at the links. This being said, if the input is at least 16 bytes, we already handle the tail efficiently. If the input is less than 16 bytes, then you will read beyond your buffer if you try to use SIMD. You can tell some sanitizers not to report the error, but to my knowledge, there is no way to tell every single bound checker (valgrind, default visual studio bound checker, etc.) one might use to ignore the issue. And, it is unclear whether a SIMD algorithm over a string of less than 16 bytes is warranted. Scalar code can be faster on small inputs.

src/unicode.cpp

DenisYaroshevskiy · 2024-04-30T08:36:27Z

src/unicode.cpp

-  const uint8x16_t mask1 = vmovq_n_u8('\r');
-  const uint8x16_t mask2 = vmovq_n_u8('\n');
-  const uint8x16_t mask3 = vmovq_n_u8('\t');
+  static uint8_t rnt_array[16] = {1, 0,  0, 0, 0,  0, 0, 0,


Aleternatively:
a) can you just allocate the url aligned to 64 bytes or smth?
b) How big are urls in general? I presume between 8 and 16 bytes is also common? Could maybe 8 byte arm version be helpful?

We have no control over how the string was allocated. It is passed to us by the user.

src/unicode.cpp

lemire · 2024-04-30T15:57:48Z

@anonrig I think we can merge at will.

anonrig · 2024-05-01T12:38:48Z

@lemire fuzzing is failing with the proposed change

lemire · 2024-05-01T16:13:24Z

fuzzing is failing with the proposed change

The change only affects ARM, and the fuzzer runs on x64.

I believe that oss-fuzz is busted, I have received many invalid bug reports and I have asked oss-fuzz to dismiss them. They are all of this type MemorySanitizer: use-of-uninitialized-value and they have to do with the standard library.

lemire · 2024-05-01T18:06:17Z

@anonrig I am not preoccupied by the fuzzer failing, but I still marked this PR as a draft because there might be an another issue... unrelated to the PR. Investigating...

lemire · 2024-05-01T18:36:21Z

@anonrig Should be good now.

optimize has_tabs_or_newline for NEON

a584594

lemire requested a review from anonrig April 30, 2024 04:16

DenisYaroshevskiy reviewed Apr 30, 2024

View reviewed changes

src/unicode.cpp Outdated Show resolved Hide resolved

Update unicode.cpp

95bbc26

anonrig approved these changes Apr 30, 2024

View reviewed changes

adding description

8bf5a3d

fix: replace vmaxvq_u8 by vmaxvq_u32 (for performance)

1ca3536

lemire marked this pull request as draft May 1, 2024 18:05

lemire added 2 commits May 1, 2024 14:24

fix: rnt_array was wrong

b7236ca

fix: linting

8b0dd6b

lemire marked this pull request as ready for review May 1, 2024 18:36

anonrig merged commit 08aad0b into main May 1, 2024

anonrig deleted the opti_has_tabs_or_newline branch May 1, 2024 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

optimize has_tabs_or_newline for NEON #639

optimize has_tabs_or_newline for NEON #639

Uh oh!

lemire commented Apr 30, 2024

Uh oh!

DenisYaroshevskiy Apr 30, 2024

Uh oh!

lemire Apr 30, 2024

Uh oh!

dougallj May 1, 2024

Uh oh!

lemire May 1, 2024

Uh oh!

DenisYaroshevskiy Apr 30, 2024

Uh oh!

lemire Apr 30, 2024

Uh oh!

Uh oh!

DenisYaroshevskiy Apr 30, 2024

Uh oh!

lemire Apr 30, 2024

Uh oh!

Uh oh!

lemire commented Apr 30, 2024

Uh oh!

anonrig commented May 1, 2024

Uh oh!

lemire commented May 1, 2024

Uh oh!

lemire commented May 1, 2024

Uh oh!

lemire commented May 1, 2024

Uh oh!

Uh oh!

Uh oh!

optimize has_tabs_or_newline for NEON #639

optimize has_tabs_or_newline for NEON #639

Uh oh!

Conversation

lemire commented Apr 30, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lemire commented Apr 30, 2024

Uh oh!

anonrig commented May 1, 2024

Uh oh!

lemire commented May 1, 2024

Uh oh!

lemire commented May 1, 2024

Uh oh!

lemire commented May 1, 2024

Uh oh!

Uh oh!