[EPIC] std::simd support in libcu++ #30

jrhemstad · 2023-04-20T19:28:22Z

We should add a heterogeneous implementation of std::simd to libcu++.

High-level goals:

Works in host/device code
Replace the need for the CUDA vector types like int4/double2
Use simd::copy_from/copy_to to standardize how vectorized load/stores should be done in device code (replace status quo)
Exposure for CUDA SIMD intrinsics

Tasks

Review/discuss CUTLASS implementation of similar types
Participate in LEWG discussion on incorporating <simd> (see p1928)

The text was updated successfully, but these errors were encountered:

nouiz · 2023-07-05T15:06:36Z

I clearly see the value vs the status-quo. But the advantage vs floatN from CUDA isn't clear to me. Can you tell it?

jrhemstad · 2023-07-05T15:39:59Z

I clearly see the value vs the status-quo. But the advantage vs floatN from CUDA isn't clear to me. Can you tell it?

I'm not sure I follow. Isn't floatN the status quo?

nouiz · 2023-07-05T17:21:12Z

I see 2 way to trigger vectorized loads:

__kernel__ f(float2* in){
use in directly.
}

and what I consider the status quo:

__kernel__ f(float* in){
...reinterpret_cast...
}

The first case remove the reinterpret_cast, but it limits the API to multiple of 2 elements. The second doesn't limit the API, but request ugly code.

Does std::simd allows to keep a clean API and not request ugly code?

jrhemstad · 2023-07-06T01:51:29Z

Does std::simd allows to keep a clean API and not request ugly code?

Indeed.

Instead of

__global__ f(float* in){
   float4 vec = *reinterpret_cast<float4*>(in);
}

We have

__global__ f(float* in){
   std::fixed_size_simd<float, 4> vec{in, std::vector_aligned}; 
}

One of the other advantages of std::simd over float4 is that simd types come with well-defined binary operators like operator+, whereas float4 does not. There is a whole host of machinery you get for free with a std::simd type that you would need to implement yourself with float4.

nouiz · 2023-07-06T11:53:40Z

Great. Does it helps for the last few elements of the row that isn't a multiple of N?

bernhardmgruber · 2024-04-19T09:20:23Z

I think the real game changer of std::simd in libcu++ is that it allows generic single-source portable SIMD programming. I can write a kernel and it will explicitely (guaranteed or compilation error) vectorize for a CPU target, and collapse to scalar code on a GPU target. This is a huge improvement over relying on auto-vectorization of scalar code, which is brittle, but compiles for CUDA and CPU targets. Also, barely any SIMD library supports CUDA (Kokkos SIMD is a notable exception). So explicit SIMD code is often locked onto CPU targets. The result is you have to again maintain two code paths when you want to target CPU and GPU, or write a (probably worse) SIMD abstraction library yourself than what we could provide here. I have written one myself:

Here is a small portable kernel, using alpaka (I was collaborator) for kernel abstraction and LLAMA (author is me) for data layout abstraction, of an n-body simulation, updating particle positions based on their velocities:

template<int ElementsPerThread>
struct MoveKernel
{
    template<typename Acc, typename View>
    ALPAKA_FN_HOST_ACC void operator()(const Acc& acc, View particles) const
    {
        const auto ti = alpaka::getIdx<alpaka::Grid, alpaka::Threads>(acc)[0];
        const auto i = ti * ElementsPerThread;
        llama::SimdN<Vec3, ElementsPerThread, MakeSizedBatch> pos;
        llama::SimdN<Vec3, ElementsPerThread, MakeSizedBatch> vel;
        llama::loadSimd(particles(i)(tag::Pos{}), pos);
        llama::loadSimd(particles(i)(tag::Vel{}), vel);
        llama::storeSimd(pos + vel * +timestep, particles(i)(tag::Pos{}));
    }
};

Source: https://github.com/alpaka-group/llama/blob/develop/examples/alpaka/nbody/nbody.cpp#L221-L230

The ElementsPerThread is the parameter choosing the behavior of llama::SimdN. If 1, the kernel collapses into scalar code. If >1, SIMD types are used and with the right compiler flags AVX2, AVX512 or NEON etc. is produced. The MakeSizedBatch is essentially a wrapper around xsimd::make_sized_batch_t<T, N>, which is the SIMD library I used. std::simd in libcu++ could entirely cover and standardize this use case.

My example above does more, which is not in scope of std::simd (yet), like creating SIMD-fied structs (Vec3 is a struct of 3 floats here) and abstracting load/store from data layouts (particles can be a struct-of-arrays container here as well).

fbusato · 2024-12-04T20:04:28Z

std::simd has been approved for C++26 🎉

jrhemstad mentioned this issue Apr 20, 2023

[RFE] std::experimental::fixed_size_simd and arithmetic operators NVIDIA/libcudacxx#235

Closed

jrhemstad added this to CCCL Apr 20, 2023

miscco added feature request New feature or request. libcu++ For all items related to libcu++ labels Jul 12, 2023

miscco self-assigned this Jul 12, 2023

jollylili added the 2024-2025 goal label Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] std::simd support in libcu++ #30

[EPIC] std::simd support in libcu++ #30

jrhemstad commented Apr 20, 2023 •

edited

Loading

nouiz commented Jul 5, 2023

jrhemstad commented Jul 5, 2023

nouiz commented Jul 5, 2023

jrhemstad commented Jul 6, 2023

nouiz commented Jul 6, 2023

bernhardmgruber commented Apr 19, 2024 •

edited

Loading

fbusato commented Dec 4, 2024

[EPIC] std::simd support in libcu++ #30

[EPIC] std::simd support in libcu++ #30

Comments

jrhemstad commented Apr 20, 2023 • edited Loading

Tasks

nouiz commented Jul 5, 2023

jrhemstad commented Jul 5, 2023

nouiz commented Jul 5, 2023

jrhemstad commented Jul 6, 2023

nouiz commented Jul 6, 2023

bernhardmgruber commented Apr 19, 2024 • edited Loading

fbusato commented Dec 4, 2024

jrhemstad commented Apr 20, 2023 •

edited

Loading

bernhardmgruber commented Apr 19, 2024 •

edited

Loading