Pass GPU Inputs to Entry-Point Functions
This example shows how to configure GPU Coder™ to pass GPU inputs to entry-point functions and produce GPU outputs. When you create inputs on the GPU in the caller of the entry-point function and access them on the GPU in the entry-point function, you can avoid creating unnecessary copies of memory and outputs between the CPU and the GPU. This approach can improve the performance of generated code when you integrate it with code that produces and consumes data on a GPU. Additionally, this example demonstrates how to generate code for functions that accept GPU inputs of unknown size by using the emxArray
data type.
Third-Party Prerequisites
CUDA-enabled NVIDIA® GPU and compatible driver.
Verify GPU Environment
To verify that the compilers and libraries necessary for running this example are set up correctly, use the coder.checkGpuInstall
function.
envCfg = coder.gpuEnvConfig('host');
envCfg.BasicCodegen = 1;
envCfg.Quiet = 1;
coder.checkGpuInstall(envCfg);
The sobelEdgeDetection
Entry-Point Function
The sobelEdgeDetection
entry-point function is a Sobel edge detection algorithm that takes an image input and produces image output that shows the edges.
type sobelEdgeDetection.m
function outputImg = sobelEdgeDetection(inputImg) % % Copyright 2023 The MathWorks, Inc. coder.gpu.kernelfun(); inputSize = size(inputImg); outputSize = inputSize -2; outputImg = zeros(outputSize, 'like', inputImg); inputImg = double(inputImg); for colIdx = 1:outputSize(2) for rowIdx = 1:outputSize(1) hDiff = inputImg(rowIdx, colIdx) + 2* inputImg(rowIdx, colIdx+1) + inputImg(rowIdx,colIdx + 2) - ... inputImg(rowIdx + 2, colIdx) - 2* inputImg(rowIdx + 2, colIdx+1) - inputImg(rowIdx + 2,colIdx + 2); vDiff = inputImg(rowIdx, colIdx) + 2* inputImg(rowIdx + 1, colIdx) + inputImg(rowIdx + 2,colIdx) - ... inputImg(rowIdx, colIdx + 2) - 2* inputImg(rowIdx + 1, colIdx + 2) - inputImg(rowIdx + 2,colIdx + 2); diff = hDiff*hDiff + vDiff*vDiff; if diff > 3600 outputImg(rowIdx, colIdx) = 255; else outputImg(rowIdx, colIdx) = 0; end end end end
Generate GPU Code and Run gpuPerformanceAnalyzer
on CPU
Use coder.gpuConfig
to create a GPU code configuration object and use the codegen
command to generate a MEX function.
cfg = coder.gpuConfig("mex"); imRGB = imread("peppers.png"); imGray = rgb2gray(imRGB); codegen -config cfg -args {imGray} sobelEdgeDetection
Code generation successful.
gpuPerformanceAnalyzer("sobelEdgeDetection",{imGray},Config=cfg,OutFolder="sobleEdgeWithCPUIO");
### Starting GPU code generation Code generation successful: View report ### GPU code generation finished ### Starting application profiling ### Application profiling finished ### Starting profiling data processing ### Profiling data processing finished ### Showing profiling data
By default, GPU Coder expects the inputs from the CPU and produces the output on the CPU. It copies the data from the CPU to the GPU before running the computation on the GPU and copies the results back to CPU.
The GPU Performance Analyzer report shows that copying memory takes most of the time.
Generate GPU Code and Run gpuPerformanceAnalyzer
on GPU
The Sobel edge detection algorithm passes the input immediately to the GPU to compute the edges and produces the final results on the GPU. If an algorithm passes the inputs to and takes the outputs from the GPU, it does not need to copy memory from or to the CPU. GPU Coder can only produce outputs on the GPU when the GPU output types are supported.
Pass the inputs to the GPU by using the gpuArray
function.
imGrayGpu = gpuArray(imGray); codegen -config cfg -args {imGrayGpu} sobelEdgeDetection
Code generation successful.
You can also use coder.typeof
to represent the inputs on the GPU.
inputImg = coder.typeof(imGray, "Gpu", true); codegen -config cfg -args {inputImg} sobelEdgeDetection
Code generation successful.
Run gpuPerformanceAnalyzer
with the inputs and outputs on the GPU.
gpuPerformanceAnalyzer("sobelEdgeDetection",{imGrayGpu},Config=cfg,OutFolder="sobleEdgeWithGPUIO");
### Starting GPU code generation Code generation successful: View report ### GPU code generation finished ### Starting application profiling ### Application profiling finished ### Starting profiling data processing ### Profiling data processing finished ### Showing profiling data
With the inputs and outputs on the GPU, the entry-point function does not contain memory copy events between the CPU and GPU.
Use emxArray
data types as unbounded GPU Inputs in Generated Function Interfaces
emxArray
is a flexible data type used in generated code to handle arrays of varying sizes. You can use unbounded GPU inputs as the input type in GPU Coder, then construct emxArray
data as an input or output from the generated code.
Examine MATLAB Function
The
doubleElements
entry-point function doubles the elements of the input array.
type doubleElements.m
function Y = doubleElements(X) % % Copyright 2024 The MathWorks, Inc. coder.gpu.kernelfun(); Y = 2 * X; end
Generate Initial Source Code for Entry-Point Function Using Unknown Size GPU Inputs
Use coder.typeof
to create a GPU input type of unknown size.
cfg = coder.gpuConfig('lib'); inputs = {coder.typeof(int32(0), [inf inf], 'Gpu', true)}; codegen -config cfg -args inputs doubleElements.m -report
Code generation successful: View report
The function prototype for doubleElements
in the generated code is:
extern void doubleElements(const emxArray_int32_T *gpu_X,
emxArray_int32_T *gpu_Y);
Write a Customized Main File to Initialize the emxArray
Data
The file doubleElements_main.cu
demonstrates how to create emxArray
types on the GPU. To initialize an emxArray
type on the GPU, allocate and initialize the GPU memory for input, then construct the emxArray_int32_T
structure with the data
field pointing to the GPU buffer. It is optional to allocate memory for GPU output, especially when the output size is unknown at compile time, because the entry point function handles memory allocation.
This example also shows how you can pass differently sized inputs to the same entry-point function.
type doubleElements_main.cu
#include "doubleElements.h" #include "doubleElements_initialize.h" #include "MWCudaDimUtility.hpp" #include "MWCudaMemoryFunctions.hpp" #include <iostream> #include <iomanip> __global__ void initializeArray(int *array, int size) { uint64_T gThreadId = mwGetGlobalThreadIndex(); if (gThreadId < size) { array[gThreadId] = gThreadId; } } void processAndPrintArray(int32_T* size_in) { int totalSize = size_in[0] * size_in[1]; int *deviceBuffer; mwCudaMalloc(&deviceBuffer, totalSize * sizeof(int32_T)); int32_T threadsPerBlock = 256; int32_t blocksPerGrid = (totalSize + threadsPerBlock - 1) / threadsPerBlock; initializeArray<<<blocksPerGrid, threadsPerBlock>>>(deviceBuffer, totalSize); emxArray_int32_T gpu_in; emxArray_int32_T gpu_out; gpu_in.data = deviceBuffer; gpu_in.size = size_in; gpu_in.allocatedSize = totalSize * sizeof(int32_T); gpu_in.numDimensions = 2; gpu_in.canFreeData = false; int32_T size_out[2] = {0, 0}; gpu_out.data = static_cast<int32_T *>(nullptr); gpu_out.size = size_out; gpu_out.allocatedSize = 0; gpu_out.numDimensions = 2; gpu_out.canFreeData = false; doubleElements(&gpu_in, &gpu_out); int32_T* hostBuffer = new int32_T[totalSize]; cudaMemcpy(hostBuffer, gpu_out.data, totalSize * sizeof(int32_T), cudaMemcpyDeviceToHost); int dim1 = size_in[0]; int dim2 = size_in[1]; std::cout << "Doubled each element in a " << dim1 << " x " << dim2 << " matrix:" << std::endl; for (int i = 0; i < dim1; i++) { for (int j = 0; j < dim2; j++) { std::cout << std::setw(4) << hostBuffer[j + dim2 * i]; } std::cout << std::endl; } std::cout << std::endl; mwCudaFree(gpu_in.data); mwCudaFree(gpu_out.data); delete[] hostBuffer; } int main(int argc, char *argv[]) { doubleElements_initialize(); int32_T size_in_1[2] = {5, 5}; processAndPrintArray(size_in_1); int32_T size_in_2[2] = {10, 10}; processAndPrintArray(size_in_2); return 0; }
Generate an Executable
Configure code generation to compile the customized main file with the generated code. Generate the executable file.
cfg = coder.gpuConfig("exe"); cfg.CustomSource = "doubleElements_main.cu"; cfg.CustomInclude = "."; codegen -config cfg -args inputs doubleElements_main.cu doubleElements.m -report
Code generation successful: View report
The code generator produces an executable file named doubleElements
in the current working folder. Run the executable using these commands:
if ispc !doubleElements.exe else !./doubleElements end
Doubled each element in a 5 x 5 matrix: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 Doubled each element in a 10 x 10 matrix: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180 182 184 186 188 190 192 194 196 198