device function performing FFT/DCT of data in shared memory (feature request / feasability) #173

nschaeff · 2024-04-15T20:59:03Z

Hello,

For small (1D) transforms that are part of a more complex workflow, it would be interesting to be able to call VkFFT from a kernel on data already residing in shared memory.
Would that be possible to implement? If yes it is a feature that could interest me in the future.

For instance, within my cuda kernel that works on some data, I would call

__shared__ double my_data[N];
double* aux_data_needed_by_vkfft = pointer_returnd_by_vkfft_when_planning;

vkfft_perform_dft_inplace(my_data,  aux_data_needed_by_vkfft);

I assume the blocksize and data size must both be known at the planning stage for VkFFT to generate the function.
Alternatively, VkFFT could also return a string with the code of the device function, that could be included into another cuda program compiled at runtime with nvrtc.

[For instance, this would allow to do some filtering within a kernel:
generate some data -> FFT -> set some frequencies to zero -> IFFT -> further process data -> write to memory.]

The text was updated successfully, but these errors were encountered:

DTolm · 2024-04-16T20:26:12Z

Hello,

I have been thinking about how kernel inlining (and callbacks) can be handled in VkFFT for some time, as the runtime generation nature of VkFFT should allow it to be implemented smoothly. Mostly what I need for this is to learn how separate linking is done in different APIs and decide on how the interface will look like. Speaking of the latter, I think VkFFT can be modified to generate codelets for any sequence for an arbitrary fixed number of threads, so this won't be a constraint, but handling the input/output format (in/to registers, in/to shared memory) can be more tricky. I will try to come back to this soon, after finishing the separate complex data layout for the next release.

Best regards,
Dmitrii

nschaeff · 2024-07-15T11:56:59Z

Here is an idea:

The user could provide the callback function in the form of a string of cuda/hip/vulkan/... code, with a specific name.
For instance (in cuda/hip):
mycallback = "__device double2 vkfft_user_callback_read(int ix, iy, i...) { blabla; return double2(xr, xi); }";

Then pass it to vkfft at plan creation.
This would allow it to be inlined by the runtime compilation of vkfft. It leaves the user the responsibility for the callback function, so the constraints and prototype should be well documented, but this would allow a lot of flexibility, and does not seem to be an issue to me.

One could also imagine callbacks that could be run after read, or before write, working on the shared memory data:

post_process_callback = "__device void vkfft_pp_callback(double2* x, int nfft, int tid, int nthreads) { /* do anything with x[], use __syncthreads() if needed */ }";

with the advantage that the coalesced global memory reads and writes, optimized by vkfft would be preserved.

FYI, nvidia also provides cufftdx that would allow to do upsampling / downsampling in one upload, like
ifft( zero_padding( fft(x) ) ) in one kernel. I've never tried it, but I might at some point.
https://docs.nvidia.com/cuda/cufftdx/index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

device function performing FFT/DCT of data in shared memory (feature request / feasability) #173

device function performing FFT/DCT of data in shared memory (feature request / feasability) #173

nschaeff commented Apr 15, 2024 •

edited

Loading

DTolm commented Apr 16, 2024

nschaeff commented Jul 15, 2024

device function performing FFT/DCT of data in shared memory (feature request / feasability) #173

device function performing FFT/DCT of data in shared memory (feature request / feasability) #173

Comments

nschaeff commented Apr 15, 2024 • edited Loading

DTolm commented Apr 16, 2024

nschaeff commented Jul 15, 2024

nschaeff commented Apr 15, 2024 •

edited

Loading