Failed to generate large CUDA kernel in GPU coder with FFT function inside

I am trying to get my code paralle in GPU.
I have converted the code with the "main.m" script as attached. But the mex code on GPU is much slower than the m code on CPU. I understand that the GPU is not suitable for such small data size. But it takes much much longer time on the GPU if bigger data size is used.
Then I check the profilling timeline. I find that many cuda kernel is created and the overall GPU utilization is low. After some debugging, I find that if the fft command is used, the GPU coder failed to generate large CUDA kernel.
I think that the perfermance can be improved significantly if the fft can be incoporate inside one CUDA kernel like the situation without fft. FFT is needed. I have try to search on Google, but nothing relative can be found. Can you provide any information about this or any solution? The output of gpuDevice is also provided in the attachment.
Here is the profilling timeline without fft.
Here is the profilling timeline with fft.

Antworten (1)

Justin Hontz
Justin Hontz am 18 Sep. 2024
Hi He,
In your M-code for RandCopy, the for loop cannot be executed as a GPU kernel (even with the coder.gpu.kernel pragma) because of the fft / ifft calls inside of the loop. This is because fft is implemented using its own specialized GPU kernel, and GPU Coder does not supported nested kernels execution. Consequently, the for loop runs sequentially, which explains why you see thousands of small kernel instaces within the performance analyzer timeline graph.
To improve the performance of your code, you will want to perform your computation using only a single fft / ifft call that operates on the entire input array instead of individual slices. Something like this should work:
Tmp = fft(Data,[],2);
Tmp = Tmp + (1 + 1i);
Tmp = Tmp * (1564 + 798i);
Data = ifft(Tmp,[],2);
After making the change on my end, the performance analyzer report shows a significant performance improvement, with the timeline graph looking similar to the original one without fft.

4 Kommentare

This is an minimal reproducable case of my problem. In the actual code, there are other complex operations before fft, between the fft and ifft, and after the ifft, which can not be conduct entirely as suggested. Besides, these complex operations can be generated as one CUDA kernel.
The number of the slices is extremely large, while the size of them is small (less than 200). So the GPU is prefered instand of the parfor command on CPU.
Is there any other solution that can massively parallel on slices?
I note that you said " GPU Coder does not supported nested kernels execution. ". Is it possible that I generate the CUDA coder twice (with and without fft) and combine them into one kernel manually?
Based on your description of the computation, the best approach would still likely be to rewrite your code similiar to the way that I described above to achieve optimal performance. That is, instead of performing a sequence of operations on each slice of the input array, perform a sequence of larger operations on the entire input array. With this approach, you would still be able to parallelize the computation over the slices. You may end up with multiple kernels being generated instead of just one, but this is unlikely to be a significant performance bottleneck as long as the entire computation can be performed on GPU.
Without seeing your full code, I cannot give any specific advice, though for certain individual operations, you may want to implement them using a for loop with coder.gpu.kernel pragma if they cannot be implemented efficiently using vectorized MATLAB toolbox functions.
Regarding nested kernel execution, combining the code manually is unlikely to work. This is because GPU Coder by default implements fft using the cuFFT API, which is likely not callable from device code. If you still wish to keep your code in its current form, you can also try disabling use of cuFFT from the coder config (see property EnableCUFFT of coder.GpuConfig) and see if that improves the situation.
I fully understand the benefit of calculation on the entire array, which is the way I am working for years. However, it is not suitable inherently. I haved tried to disable cuFFT in the coder config, which results thousands of memory copy between the host and device. Maybe it requires other optimzation.
It said:NVIDIA cuFFT introduces cuFFTDx APIs, device side API extensions for performing FFT calculations inside your CUDA kernel. Fusing numerical operations can decrease the latency and improve the performance of your application.
It seems like that the cuFFT can be called from device code. Hopefully you can show me how to use cuFFTDx in RandCopy.m. Perhaps that may be overly demanding.
GPU Coder currently does not support generating direct calls to the cuFFTDx API. With that said, however, you may still be able to indirectly call into the API in the generated code if you are willing to write your own CUDA wrapper function that directly uses the API. This can possibly be achieved by invoking the wrapper function inside the for loop of your M-code via coder.ceval. The call would look something like this:
coder.ceval('-gpudevicefcn', 'myFFTWrapper', coder.ref(data), ...);
The -gpudevicefcn flag indicates that the wrapper function is meant to be executed by a GPU thread rather than by the CPU.
Note that I have not tried using this approach on my end, so I cannot guarantee that such an approach would work correctly without issue.

Melden Sie sich an, um zu kommentieren.

Kategorien

Mehr zu Get Started with GPU Coder finden Sie in Hilfe-Center und File Exchange

Produkte

Version

R2024b

Gefragt:

He
am 18 Sep. 2024

Kommentiert:

am 19 Sep. 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by