-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA]: Add a copy routine to support data copy between two mdspan
s
#2306
Comments
(Tentatively assigned to Federico as per our offline discussion 🙂) |
cc: @kmaehashi for vis |
Can you elaborate on how you envision this would work? This is necessarily a host API and NVRTC can't compile host-code. |
We have a C library now, don't we? 🙂 @jrhemstad @gevtushenko Correct me if I am wrong since I am not fluent enough in |
Another reason for NVRTC compatibility: I think to unblock nvmath-python, we should just focus on the D2D copies (between potentially two different memory layouts) for now, and let nvmath-python handles the remaining H2D/D2H parts which should be easy (just use |
So what you really mean is "Provide a solution that doesn't require pre-instantiating a lot of kernels and may internally use NVRTC to JIT compile specific kernel instantiations". By "NVRTC compatible" I understood you wanted it so someone could take |
I believe you are right. We should think of this new copy routine as if it were a CUB device-wide algorithm. What I originally had in mind is really just a kernel and I wanted to do pre-/post- processing as well as kernel compilation/launch myself, but I had forgotten that this does not fit in the compute paradigm anywhere in CCCL. Thanks for the clarifying questions. |
FYI, Apple MLX counterpart: ml-explore/mlx#1421 |
Is this a duplicate?
Area
libcu++
Is your feature request related to a problem? Please describe.
CUDA Python & nvmath-python need to have a copy routine added to CCCL for copying from one
mdspan
to another. The requirements for this copy routine include:mdspan
s covering either host or device tensors, so that H2D/D2H copies can be abstracted out by the same API.This is a blocker for nvmath-python to get rid of its mandatory dependency on CuPy (so that CuPy can in turn depend on nvmath-python, without hitting circular dependency issues).
We believe if src and dst are not overlapping, and if both resides on the device, there might be existing implementations from cuTENSOR (ex: cutensorPermute) based on which we can do a prototype. We can focus on functionalities first (right now the copy kernel used in nvmath-python is from CuPy), and in the future iterations improve the performance.
Describe the solution you'd like
Not sure what's the best solution, so just a thought: Perhaps offering an overload of
cuda::std::copy
that is specialized formdspan
?Describe alternatives you've considered
No response
Additional context
Once this routine is offered, a Python abstraction can be built in CUDA Python or elsewhere.
The text was updated successfully, but these errors were encountered: