You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For small (1D) transforms that are part of a more complex workflow, it would be interesting to be able to call VkFFT from a kernel on data already residing in shared memory.
Would that be possible to implement? If yes it is a feature that could interest me in the future.
For instance, within my cuda kernel that works on some data, I would call
I assume the blocksize and data size must both be known at the planning stage for VkFFT to generate the function.
Alternatively, VkFFT could also return a string with the code of the device function, that could be included into another cuda program compiled at runtime with nvrtc.
[For instance, this would allow to do some filtering within a kernel:
generate some data -> FFT -> set some frequencies to zero -> IFFT -> further process data -> write to memory.]
The text was updated successfully, but these errors were encountered:
I have been thinking about how kernel inlining (and callbacks) can be handled in VkFFT for some time, as the runtime generation nature of VkFFT should allow it to be implemented smoothly. Mostly what I need for this is to learn how separate linking is done in different APIs and decide on how the interface will look like. Speaking of the latter, I think VkFFT can be modified to generate codelets for any sequence for an arbitrary fixed number of threads, so this won't be a constraint, but handling the input/output format (in/to registers, in/to shared memory) can be more tricky. I will try to come back to this soon, after finishing the separate complex data layout for the next release.
The user could provide the callback function in the form of a string of cuda/hip/vulkan/... code, with a specific name.
For instance (in cuda/hip): mycallback = "__device double2 vkfft_user_callback_read(int ix, iy, i...) { blabla; return double2(xr, xi); }";
Then pass it to vkfft at plan creation.
This would allow it to be inlined by the runtime compilation of vkfft. It leaves the user the responsibility for the callback function, so the constraints and prototype should be well documented, but this would allow a lot of flexibility, and does not seem to be an issue to me.
One could also imagine callbacks that could be run after read, or before write, working on the shared memory data:
post_process_callback = "__device void vkfft_pp_callback(double2* x, int nfft, int tid, int nthreads) { /* do anything with x[], use __syncthreads() if needed */ }";
with the advantage that the coalesced global memory reads and writes, optimized by vkfft would be preserved.
FYI, nvidia also provides cufftdx that would allow to do upsampling / downsampling in one upload, like ifft( zero_padding( fft(x) ) ) in one kernel. I've never tried it, but I might at some point. https://docs.nvidia.com/cuda/cufftdx/index.html
Hello,
For small (1D) transforms that are part of a more complex workflow, it would be interesting to be able to call VkFFT from a kernel on data already residing in shared memory.
Would that be possible to implement? If yes it is a feature that could interest me in the future.
For instance, within my cuda kernel that works on some data, I would call
I assume the blocksize and data size must both be known at the planning stage for VkFFT to generate the function.
Alternatively, VkFFT could also return a string with the code of the device function, that could be included into another cuda program compiled at runtime with nvrtc.
[For instance, this would allow to do some filtering within a kernel:
generate some data -> FFT -> set some frequencies to zero -> IFFT -> further process data -> write to memory.]
The text was updated successfully, but these errors were encountered: