Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] std::simd support in libcu++ #30

Open
2 tasks
jrhemstad opened this issue Apr 20, 2023 · 7 comments
Open
2 tasks

[EPIC] std::simd support in libcu++ #30

jrhemstad opened this issue Apr 20, 2023 · 7 comments
Assignees
Labels
2024-2025 goal feature request New feature or request. libcu++ For all items related to libcu++

Comments

@jrhemstad
Copy link
Collaborator

jrhemstad commented Apr 20, 2023

We should add a heterogeneous implementation of std::simd to libcu++.

High-level goals:

  • Works in host/device code
  • Replace the need for the CUDA vector types like int4/double2
  • Use simd::copy_from/copy_to to standardize how vectorized load/stores should be done in device code (replace status quo)
  • Exposure for CUDA SIMD intrinsics

Tasks

  • Review/discuss CUTLASS implementation of similar types
  • Participate in LEWG discussion on incorporating <simd> (see p1928)
@nouiz
Copy link

nouiz commented Jul 5, 2023

I clearly see the value vs the status-quo. But the advantage vs floatN from CUDA isn't clear to me. Can you tell it?

@jrhemstad
Copy link
Collaborator Author

I clearly see the value vs the status-quo. But the advantage vs floatN from CUDA isn't clear to me. Can you tell it?

I'm not sure I follow. Isn't floatN the status quo?

@nouiz
Copy link

nouiz commented Jul 5, 2023

I see 2 way to trigger vectorized loads:

__kernel__ f(float2* in){
use in directly.
}

and what I consider the status quo:

__kernel__ f(float* in){
...reinterpret_cast...
}

The first case remove the reinterpret_cast, but it limits the API to multiple of 2 elements. The second doesn't limit the API, but request ugly code.

Does std::simd allows to keep a clean API and not request ugly code?

@jrhemstad
Copy link
Collaborator Author

Does std::simd allows to keep a clean API and not request ugly code?

Indeed.

Instead of

__global__ f(float* in){
   float4 vec = *reinterpret_cast<float4*>(in);
}

We have

__global__ f(float* in){
   std::fixed_size_simd<float, 4> vec{in, std::vector_aligned}; 
}

One of the other advantages of std::simd over float4 is that simd types come with well-defined binary operators like operator+, whereas float4 does not. There is a whole host of machinery you get for free with a std::simd type that you would need to implement yourself with float4.

@nouiz
Copy link

nouiz commented Jul 6, 2023

Great. Does it helps for the last few elements of the row that isn't a multiple of N?

@miscco miscco added feature request New feature or request. libcu++ For all items related to libcu++ labels Jul 12, 2023
@miscco miscco self-assigned this Jul 12, 2023
@bernhardmgruber
Copy link
Contributor

bernhardmgruber commented Apr 19, 2024

I think the real game changer of std::simd in libcu++ is that it allows generic single-source portable SIMD programming. I can write a kernel and it will explicitely (guaranteed or compilation error) vectorize for a CPU target, and collapse to scalar code on a GPU target. This is a huge improvement over relying on auto-vectorization of scalar code, which is brittle, but compiles for CUDA and CPU targets. Also, barely any SIMD library supports CUDA (Kokkos SIMD is a notable exception). So explicit SIMD code is often locked onto CPU targets. The result is you have to again maintain two code paths when you want to target CPU and GPU, or write a (probably worse) SIMD abstraction library yourself than what we could provide here. I have written one myself:

Here is a small portable kernel, using alpaka (I was collaborator) for kernel abstraction and LLAMA (author is me) for data layout abstraction, of an n-body simulation, updating particle positions based on their velocities:

template<int ElementsPerThread>
struct MoveKernel
{
    template<typename Acc, typename View>
    ALPAKA_FN_HOST_ACC void operator()(const Acc& acc, View particles) const
    {
        const auto ti = alpaka::getIdx<alpaka::Grid, alpaka::Threads>(acc)[0];
        const auto i = ti * ElementsPerThread;
        llama::SimdN<Vec3, ElementsPerThread, MakeSizedBatch> pos;
        llama::SimdN<Vec3, ElementsPerThread, MakeSizedBatch> vel;
        llama::loadSimd(particles(i)(tag::Pos{}), pos);
        llama::loadSimd(particles(i)(tag::Vel{}), vel);
        llama::storeSimd(pos + vel * +timestep, particles(i)(tag::Pos{}));
    }
};

Source: https://github.com/alpaka-group/llama/blob/develop/examples/alpaka/nbody/nbody.cpp#L221-L230

The ElementsPerThread is the parameter choosing the behavior of llama::SimdN. If 1, the kernel collapses into scalar code. If >1, SIMD types are used and with the right compiler flags AVX2, AVX512 or NEON etc. is produced. The MakeSizedBatch is essentially a wrapper around xsimd::make_sized_batch_t<T, N>, which is the SIMD library I used. std::simd in libcu++ could entirely cover and standardize this use case.

My example above does more, which is not in scope of std::simd (yet), like creating SIMD-fied structs (Vec3 is a struct of 3 floats here) and abstracting load/store from data layouts (particles can be a struct-of-arrays container here as well).

@fbusato
Copy link
Contributor

fbusato commented Dec 4, 2024

std::simd has been approved for C++26 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024-2025 goal feature request New feature or request. libcu++ For all items related to libcu++
Projects
Status: No status
Development

No branches or pull requests

6 participants