Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Add a copy routine to support data copy between two mdspans #2306

Open
1 task done
leofang opened this issue Aug 28, 2024 · 10 comments
Open
1 task done

[FEA]: Add a copy routine to support data copy between two mdspans #2306

leofang opened this issue Aug 28, 2024 · 10 comments
Assignees
Labels
feature request New feature or request.

Comments

@leofang
Copy link
Member

leofang commented Aug 28, 2024

Is this a duplicate?

Area

libcu++

Is your feature request related to a problem? Please describe.

CUDA Python & nvmath-python need to have a copy routine added to CCCL for copying from one mdspan to another. The requirements for this copy routine include:

  1. This routine would copy contents from ndarray (or a N-D tensor) A with certain data type, shape & strides to ndarray B with the same dtype & shape but not necessarily same strides.
    • Since the underlying ndarrays are strided, they are not necessarily contiguous in memory or share the same memory layout, thus a dedicated copy kernel is needed
  2. This routine can handle mdspans covering either host or device tensors, so that H2D/D2H copies can be abstracted out by the same API.
    • In the case of D2H copies, synchronous copies are fine
  3. This routine should be JIT-compilable (by NVRTC) to serve Python users better

This is a blocker for nvmath-python to get rid of its mandatory dependency on CuPy (so that CuPy can in turn depend on nvmath-python, without hitting circular dependency issues).

We believe if src and dst are not overlapping, and if both resides on the device, there might be existing implementations from cuTENSOR (ex: cutensorPermute) based on which we can do a prototype. We can focus on functionalities first (right now the copy kernel used in nvmath-python is from CuPy), and in the future iterations improve the performance.

Describe the solution you'd like

Not sure what's the best solution, so just a thought: Perhaps offering an overload of cuda::std::copy that is specialized for mdspan?

Describe alternatives you've considered

No response

Additional context

Once this routine is offered, a Python abstraction can be built in CUDA Python or elsewhere.

@leofang leofang added the feature request New feature or request. label Aug 28, 2024
@github-project-automation github-project-automation bot moved this to Todo in CCCL Aug 28, 2024
@leofang
Copy link
Member Author

leofang commented Aug 28, 2024

(Tentatively assigned to Federico as per our offline discussion 🙂)

@leofang
Copy link
Member Author

leofang commented Aug 28, 2024

This is a blocker for nvmath-python to get rid of its mandatory dependency on CuPy (so that CuPy can in turn depend on nvmath-python, without hitting circular dependency issues).

cc: @kmaehashi for vis

@jrhemstad
Copy link
Collaborator

This routine should be JIT-compilable (by NVRTC)

Can you elaborate on how you envision this would work? This is necessarily a host API and NVRTC can't compile host-code.

@leofang
Copy link
Member Author

leofang commented Sep 6, 2024

This is necessarily a host API and NVRTC can't compile host-code.

We have a C library now, don't we? 🙂

@jrhemstad @gevtushenko Correct me if I am wrong since I am not fluent enough in mdspan: Given that shape, strides, and dtype are all run-time properties in Python, if this were a host API we would have had to instantiate a whole lot of copy kernel instances, and even so it would not cover all possibilities. Therefore, I feel NVRTC compatibility (which is a requirement of the C library anyway) is necessary.

@leofang
Copy link
Member Author

leofang commented Sep 6, 2024

Another reason for NVRTC compatibility: I think to unblock nvmath-python, we should just focus on the D2D copies (between potentially two different memory layouts) for now, and let nvmath-python handles the remaining H2D/D2H parts which should be easy (just use cudaMemcpyAsync with a staging buffer) and is already what CuPy does for us today. And I presume a D2D copy can be achieved by a single kernel compiled by NVRTC.

@jrhemstad
Copy link
Collaborator

jrhemstad commented Sep 6, 2024

We have a C library now, don't we?

So what you really mean is "Provide a solution that doesn't require pre-instantiating a lot of kernels and may internally use NVRTC to JIT compile specific kernel instantiations".

By "NVRTC compatible" I understood you wanted it so someone could take cuda::copy(mdspan, mdspan) and compile it directly with NVRTC on their own. This wouldn't be feasible anymore than it is for someone to try and compile cub::DeviceReduce with NVRTC on their own.

@leofang
Copy link
Member Author

leofang commented Sep 6, 2024

I believe you are right. We should think of this new copy routine as if it were a CUB device-wide algorithm.

What I originally had in mind is really just a kernel and I wanted to do pre-/post- processing as well as kernel compilation/launch myself, but I had forgotten that this does not fit in the compute paradigm anywhere in CCCL. Thanks for the clarifying questions.

@leofang
Copy link
Member Author

leofang commented Sep 18, 2024

FYI, Apple MLX counterpart: ml-explore/mlx#1421

@wphicks
Copy link

wphicks commented Oct 2, 2024

For what it's worth, this was implemented here in RAFT (actual implementation here). It could be adapted for use outside of RAFT by switching from RAFT's resources object to just using an ordinary CUDA stream and a cuBLAS handle.

@wphicks
Copy link

wphicks commented Oct 2, 2024

Looking more closely, I remember now that we used mdarray for some paths of the implementation, so we would need to resolve #2474 in order to adapt the code directly. We also make use of the vocabulary types mentioned in #2476, but that is much easier to work around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request.
Projects
Status: Todo
Development

No branches or pull requests

4 participants