-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] std::simd support in libcu++ #30
Comments
I clearly see the value vs the status-quo. But the advantage vs floatN from CUDA isn't clear to me. Can you tell it? |
I'm not sure I follow. Isn't |
I see 2 way to trigger vectorized loads:
and what I consider the status quo:
The first case remove the reinterpret_cast, but it limits the API to multiple of 2 elements. The second doesn't limit the API, but request ugly code. Does std::simd allows to keep a clean API and not request ugly code? |
Indeed. Instead of
We have
One of the other advantages of |
Great. Does it helps for the last few elements of the row that isn't a multiple of N? |
I think the real game changer of Here is a small portable kernel, using alpaka (I was collaborator) for kernel abstraction and LLAMA (author is me) for data layout abstraction, of an n-body simulation, updating particle positions based on their velocities: template<int ElementsPerThread>
struct MoveKernel
{
template<typename Acc, typename View>
ALPAKA_FN_HOST_ACC void operator()(const Acc& acc, View particles) const
{
const auto ti = alpaka::getIdx<alpaka::Grid, alpaka::Threads>(acc)[0];
const auto i = ti * ElementsPerThread;
llama::SimdN<Vec3, ElementsPerThread, MakeSizedBatch> pos;
llama::SimdN<Vec3, ElementsPerThread, MakeSizedBatch> vel;
llama::loadSimd(particles(i)(tag::Pos{}), pos);
llama::loadSimd(particles(i)(tag::Vel{}), vel);
llama::storeSimd(pos + vel * +timestep, particles(i)(tag::Pos{}));
}
}; Source: https://github.com/alpaka-group/llama/blob/develop/examples/alpaka/nbody/nbody.cpp#L221-L230 The My example above does more, which is not in scope of |
|
We should add a heterogeneous implementation of
std::simd
to libcu++.High-level goals:
int4/double2
simd::copy_from/copy_to
to standardize how vectorized load/stores should be done in device code (replace status quo)Tasks
<simd>
(see p1928)The text was updated successfully, but these errors were encountered: