Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bable stream benchmark broken #2419

Open
psychocoderHPC opened this issue Nov 7, 2024 · 4 comments
Open

bable stream benchmark broken #2419

psychocoderHPC opened this issue Nov 7, 2024 · 4 comments
Labels
Milestone

Comments

@psychocoderHPC
Copy link
Member

Your current bablestream implementation in the last release 1.2.0 is broken.
With #2299 the behaviour of the benchmark was changed.

Upstream bablestream is commuting the following code:

init:    a, b, c         // set to init value a = 0.1 ; b = 0.2 ; c = 0.0
copy:    c = a
mul:     b = scalar * c  // with scalar = 0.4
add:     c = a + b
triad:   a = b + scalar * c
nstream: a += b + scalar * c
dot:     sum = a * b     // reduce 

verification: check errSUm = abs((sum - goldSum)/goldSum)

Our current implementation:

init:    a, b, c         // set to init value a = 1.0 ; b = 0.0 ; c = 0.0
copy:    b = a
mul:     b = scalar * a  // with scalar = 2.0
add:     c = a + b
triad:   c = a + scalar * b
nstream: missing
dot:     sum = a * b     // reduce 

verification: 
  check if c == initA + scalar * scalar * initA
  check if b == scalar * initA
  check if a == initA

Note that Nstream is in the upstream code optional but we implemented it in the version before #2299

The problem is that you can not use the current implementation to compare the performance against other implementation because we implement the corresponding stream function but with different access orders to arrays which can affect the performance because of caching effects. This should not be a big issue for large arrays.
The copy kernel can be removed or wrongly implemented and the verification is still reporting all results are correct. The reason for this behaviour is that copy set b and the next mul operation is overwriting the result.
The same issue happened with the add operation.
In our implementation array a is never changed except in the init.

The upstream code is designed so that operations depend on each other, changing the order or making a mistake in the implementation will result in an error in the validation.

WARNING please do not use the bablestream implementation of alpaka release 1.2.0 for comparison against other bablestream implementation!

@psychocoderHPC psychocoderHPC added this to the 2.0.0 milestone Nov 7, 2024
@SimeonEhrig
Copy link
Member

I agree with @psychocoderHPC. In theory it could be possible, that the changed kernel do the same calculations like before, but this a really huge work to prove it. Therefore it is much easier to change the kernels like it was before that we are sure, that our benchmarks a comparable with the reference implementation.

Also the init values could be problematic. Who guaranties use, that the out-of-order execution of a CPU dose not skip the addition with 0 or the multiplication with 1 because it necessary. It is possible, that our implementation of the multiplication kernel is actual the same like the copy kernel, because each value is multiplied with 1.

@mehmetyusufoglu
Copy link
Contributor

mehmetyusufoglu commented Nov 8, 2024

@psychocoderHPC 1. current alpaka sums errors for verification. Your definition of error above, describing current situation, is not correct.
2. Secondly, Init values of b and c are not important they are overwritten.
3. Third, the caching effect you are talking about valid but it is not observed in the results at different machines therefore i did not notice it. [Order is important as far as I saw; even AMD used the Univ Bristol repository so kernel run order is standard. ]
4. The kernel nstream is not included in babelstream benchmark. It is optional and not one of 5 babelstream standard benchmark kernels. Nobody knows why it is there, it is not run and displayed at the results.
Thanks i mostly fixed.

@mehmetyusufoglu
Copy link
Contributor

I agree with @psychocoderHPC. In theory it could be possible, that the changed kernel do the same calculations like before, but this a really huge work to prove it. Therefore it is much easier to change the kernels like it was before that we are sure, that our benchmarks a comparable with the reference implementation.

Also the init values could be problematic. Who guaranties use, that the out-of-order execution of a CPU dose not skip the addition with 0 or the multiplication with 1 because it necessary. It is possible, that our implementation of the multiplication kernel is actual the same like the copy kernel, because each value is multiplied with 1.

@SimeonEhrig Since scalar is 2 multiplication is not equal to copy but i understand your point. Thanks.

@mehmetyusufoglu
Copy link
Contributor

Fixed by #2420

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants