Main Content


Pragma that maps function to GPU kernels



coder.gpu.kernelfun() is a global-level pragma that attempts to map all the computation within the function it resides in on to the GPU. Loops within this function are parallelized into GPU kernels only if they pass the parallel-loop analysis check. This analysis tries to prove that every loop iteration is independent of each other. In addition, the code generator does not create GPU kernels for simple loops, such as those with few iterations or that perform minimal computation. Parallelization of such loops can decrease code performance due to the overhead of kernel creation and memory transfer between the GPU and the CPU.

The kernelfun pragma does not require any input parameters. It generates kernels whose dimensions are computed automatically based on loop parameters.

This function is a code generation function. It has no effect in MATLAB®.


collapse all

This example shows how to use the kernelfun pragma in a function and generate CUDA® code.

In one file, write the entry-point function scalars that accepts two vector inputs x,y of size 1x4096 and one scalar input scale. The function has two for-loops of different iteration lengths, one for vector addition and one for finding the cumulative sum. Place the coder.gpu.kernelfun() pragma within the scalars function.

function [vout, sout1] = scalars(x,y,scale)
    sout1 = 0;
    vout = coder.nullcopy(zeros(1,1024));
    for i=1:1024
        vout(i) = x(i) + y(i);

    for i=1:4096
        sout1 = (x(i)*scale) + sout1;    

Use the codegen function to generate CUDA MEX function.

codegen -config coder.gpuConfig('mex')...
 -args {ones(1,4096,'double'),ones(1,4096,'double'),coder.typeof(0)}...
 -report scalars

GPU Coder creates kernels for vector addition and cumulative summation.

  scalars_kernel1<<<dim3(8U, 1U, 1U), dim3(128U, 1U, 1U)>>>(*gpu_y, *gpu_x,
  scalars_kernel2<<<dim3(4U, 1U, 1U), dim3(1024U, 1U, 1U)>>>(scale, *gpu_x,

The kernel for vector addition has a total of 1024 threads, one for adding each element. Similarly, the kernel for cumulative summation has a total of 4096 threads.

Version History

Introduced in R2017b