[FEA]: Add a copy routine to support data copy between two `mdspan`s #2306

leofang · 2024-08-28T05:28:42Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

libcu++

Is your feature request related to a problem? Please describe.

CUDA Python & nvmath-python need to have a copy routine added to CCCL for copying from one mdspan to another. The requirements for this copy routine include:

This routine would copy contents from ndarray (or a N-D tensor) A with certain data type, shape & strides to ndarray B with the same dtype & shape but not necessarily same strides.
- Since the underlying ndarrays are strided, they are not necessarily contiguous in memory or share the same memory layout, thus a dedicated copy kernel is needed
This routine can handle mdspans covering either host or device tensors, so that H2D/D2H copies can be abstracted out by the same API.
- In the case of D2H copies, synchronous copies are fine
This routine should be JIT-compilable (by NVRTC) to serve Python users better

This is a blocker for nvmath-python to get rid of its mandatory dependency on CuPy (so that CuPy can in turn depend on nvmath-python, without hitting circular dependency issues).

We believe if src and dst are not overlapping, and if both resides on the device, there might be existing implementations from cuTENSOR (ex: cutensorPermute) based on which we can do a prototype. We can focus on functionalities first (right now the copy kernel used in nvmath-python is from CuPy), and in the future iterations improve the performance.

Describe the solution you'd like

Not sure what's the best solution, so just a thought: Perhaps offering an overload of cuda::std::copy that is specialized for mdspan?

Describe alternatives you've considered

No response

Additional context

Once this routine is offered, a Python abstraction can be built in CUDA Python or elsewhere.

The text was updated successfully, but these errors were encountered:

leofang · 2024-08-28T05:29:22Z

(Tentatively assigned to Federico as per our offline discussion 🙂)

leofang · 2024-08-28T05:34:09Z

This is a blocker for nvmath-python to get rid of its mandatory dependency on CuPy (so that CuPy can in turn depend on nvmath-python, without hitting circular dependency issues).

cc: @kmaehashi for vis

jrhemstad · 2024-08-28T15:50:34Z

This routine should be JIT-compilable (by NVRTC)

Can you elaborate on how you envision this would work? This is necessarily a host API and NVRTC can't compile host-code.

leofang · 2024-09-06T04:59:08Z

This is necessarily a host API and NVRTC can't compile host-code.

We have a C library now, don't we? 🙂

@jrhemstad @gevtushenko Correct me if I am wrong since I am not fluent enough in mdspan: Given that shape, strides, and dtype are all run-time properties in Python, if this were a host API we would have had to instantiate a whole lot of copy kernel instances, and even so it would not cover all possibilities. Therefore, I feel NVRTC compatibility (which is a requirement of the C library anyway) is necessary.

leofang · 2024-09-06T05:03:14Z

Another reason for NVRTC compatibility: I think to unblock nvmath-python, we should just focus on the D2D copies (between potentially two different memory layouts) for now, and let nvmath-python handles the remaining H2D/D2H parts which should be easy (just use cudaMemcpyAsync with a staging buffer) and is already what CuPy does for us today. And I presume a D2D copy can be achieved by a single kernel compiled by NVRTC.

jrhemstad · 2024-09-06T16:27:59Z

We have a C library now, don't we?

So what you really mean is "Provide a solution that doesn't require pre-instantiating a lot of kernels and may internally use NVRTC to JIT compile specific kernel instantiations".

By "NVRTC compatible" I understood you wanted it so someone could take cuda::copy(mdspan, mdspan) and compile it directly with NVRTC on their own. This wouldn't be feasible anymore than it is for someone to try and compile cub::DeviceReduce with NVRTC on their own.

leofang · 2024-09-06T16:49:38Z

I believe you are right. We should think of this new copy routine as if it were a CUB device-wide algorithm.

What I originally had in mind is really just a kernel and I wanted to do pre-/post- processing as well as kernel compilation/launch myself, but I had forgotten that this does not fit in the compute paradigm anywhere in CCCL. Thanks for the clarifying questions.

leofang · 2024-09-18T19:04:02Z

FYI, Apple MLX counterpart: ml-explore/mlx#1421

wphicks · 2024-10-02T21:29:39Z

For what it's worth, this was implemented here in RAFT (actual implementation here). It could be adapted for use outside of RAFT by switching from RAFT's resources object to just using an ordinary CUDA stream and a cuBLAS handle.

wphicks · 2024-10-02T21:36:04Z

Looking more closely, I remember now that we used mdarray for some paths of the implementation, so we would need to resolve #2474 in order to adapt the code directly. We also make use of the vocabulary types mentioned in #2476, but that is much easier to work around.

leofang added the feature request New feature or request. label Aug 28, 2024

github-project-automation bot added this to CCCL Aug 28, 2024

github-project-automation bot moved this to Todo in CCCL Aug 28, 2024

leofang assigned fbusato Aug 28, 2024

leofang mentioned this issue Nov 23, 2024

[CUDAX] Add copy_bytes and fill_bytes overloads for mdspan #2932

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Add a copy routine to support data copy between two `mdspan`s #2306

[FEA]: Add a copy routine to support data copy between two `mdspan`s #2306

leofang commented Aug 28, 2024 •

edited

Loading

leofang commented Aug 28, 2024 •

edited

Loading

leofang commented Aug 28, 2024

jrhemstad commented Aug 28, 2024

leofang commented Sep 6, 2024

leofang commented Sep 6, 2024 •

edited

Loading

jrhemstad commented Sep 6, 2024 •

edited

Loading

leofang commented Sep 6, 2024

leofang commented Sep 18, 2024

wphicks commented Oct 2, 2024

wphicks commented Oct 2, 2024

[FEA]: Add a copy routine to support data copy between two mdspans #2306

[FEA]: Add a copy routine to support data copy between two mdspans #2306

Comments

leofang commented Aug 28, 2024 • edited Loading

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

leofang commented Aug 28, 2024 • edited Loading

leofang commented Aug 28, 2024

jrhemstad commented Aug 28, 2024

leofang commented Sep 6, 2024

leofang commented Sep 6, 2024 • edited Loading

jrhemstad commented Sep 6, 2024 • edited Loading

leofang commented Sep 6, 2024

leofang commented Sep 18, 2024

wphicks commented Oct 2, 2024

wphicks commented Oct 2, 2024

[FEA]: Add a copy routine to support data copy between two `mdspan`s #2306

[FEA]: Add a copy routine to support data copy between two `mdspan`s #2306

leofang commented Aug 28, 2024 •

edited

Loading

leofang commented Aug 28, 2024 •

edited

Loading

leofang commented Sep 6, 2024 •

edited

Loading

jrhemstad commented Sep 6, 2024 •

edited

Loading