Main Content


Optimized GPU implementation of functions containing matrix-matrix operations



C = gpucoder.matrixMatrixKernel(fun,A,B) generates kernels from functions that contain GEMM-like operations. For example, matching feature points between two images by using:

  • The sum of absolute differences (SAD) — F() = @(a,b)abs(a-b)

  • The sum of squared differences (SSD) — F() = @(a,b)(a-b).*(a-b)

C = gpucoder.matrixMatrixKernel(___,orientation) has the optional argument orientation that specifies the orientation of A and B matrices.

C = gpucoder.matrixMatrixKernel(___,vectorizedSim) has the optional argument vectorizedSim that specifies use of vectorized operations during MATLAB® simulation and CPU code generation. The function handle fun must support vector inputs and take one row or column from A and one column or row from B and outputs a vector equivalent to arrayfun(FUN, A, B).


collapse all

This example performs a simple matrix-matrix multiplication and uses the matrixMatrixKernel design pattern to generate CUDA® code.

In one file, write an entry-point function matMul_nn that accepts two matrix inputs f1 and f2. Use the MATLAB function @times to multiply f1 and f2 element by element. The sign @ creates a handle to the function times. Insert the gpucoder.matrixMatrixKernel() statement. The input matrices are not transposed, therefore use the 'nn' option.

function scores = matMul_nn(f1, f2)
    scores = gpucoder.matrixMatrixKernel(@times, f1, f2, 'nn',true);

Use the codegen function to generate CUDA MEX function.

codegen -config coder.gpuConfig('mex') ...
    -args {ones(1024,1024,'double'),ones(1024,1024,'double')} ...
    -report matMul_nn

The generated CUDA code contains two kernels: matMul_nn_kernel1 for initializing the output matrix scores and matrixMatrixKernel that performs the times operation. The following is a snippet of the generated code.

  cudaMemcpy(*gpu_f2, cpu_f2, 8388608UL, cudaMemcpyHostToDevice);
  matMul_nn_kernel1<<<dim3(2048U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_f2,
  cudaMemcpy(*gpu_f1, cpu_f1, 8388608UL, cudaMemcpyHostToDevice);
  matrixMatrixKernel<<<1024U, 64U>>>(*gpu_f1, *gpu_B, *gpu_scores);
  cudaMemcpy(cpu_scores, *gpu_scores, 8388608UL, cudaMemcpyDeviceToHost);

Input Arguments

collapse all

Function to apply to the elements of the input arrays, specified as a function handle. fun is a handle to a user-defined function. It takes one row or column from matrix A and one row or column from matrix B and outputs a vector with the same type as the input. The output vector is then summed to compute a single scalar value in C.

Data Types: function_handle

Numeric inputs A and B must be either of the same size or have sizes that are compatible. For example, if A is an M-by-K matrix, B is a K-by-N matrix then C is an M-by-N matrix.

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | logical | half

Character vector or string composed of two characters, indicating the operation performed on the matrices A and B prior to matrix multiplication. Possible values are normal ('N'), transposed ('T'), or complex conjugate transpose ('C').

Possible values are:

  • 'nn' - Matrices A and B are normal.

  • 'nt' - Matrix B is transposed.

  • 'tn' - Matrix A is transposed.

  • 'tt' - Both matrices A and B are transposed.

Specify whether to use vectorized operation during MATLAB simulation and CPU code generation.

Output Arguments

collapse all

Product, returned as a scalar, vector, or matrix. Array D has the same number of rows as input A and the same number of columns as input B.

Version History

Introduced in R2017b