gpucoder.batchedMatrixMultiplyAdd

Optimized GPU implementation of batched matrix multiply with add operation

Syntax

[D1,D2] = gpucoder.batchedMatrixMultiplyAdd(A1,B1,C1,A2,B2,C2)

[D1,...,DN] = gpucoder.batchedMatrixMultiplyAdd(A1,B1,C1,...,AN,BN,CN)

___ = gpucoder.batchedMatrixMultiplyAdd(___,Name,Value)

Description

[D1,D2] = gpucoder.batchedMatrixMultiplyAdd(A1,B1,C1,A2,B2,C2) performs matrix-matrix multiplication and add of a batch of matrices A1,B1,C1 and A2,B2,C2. The gpucoder.batchedMatrixMultiplyAdd function performs matrix-matrix multiplication of the form:

$D = α A B + β C$

where $α$ and $β$ are scalar multiplication factors and A, B, C, and D are matrices with dimensions m-by-k, k-by-n, m-by-n, and m-by-n respectively.A and B can optionally be transposed or hermitian-conjugated. By default, $α$ and $β$ are set to one and the matrices are not transposed. To specify a different scalar multiplication factor and perform transpose operations on the input matrices, use the Name,Value pair arguments.

All the batches passed to the gpucoder.batchedMatrixMultiplyAdd function must be uniform. That is, all instances must have the same dimensions m,n,k.

[D1,...,DN] = gpucoder.batchedMatrixMultiplyAdd(A1,B1,C1,...,AN,BN,CN) performs matrix-matrix multiplication and add of multiple A, B, C matrices of the form:

$D_{i} = α A_{i} B_{i} + β C_{i} i = 1 \dots N$

___ = gpucoder.batchedMatrixMultiplyAdd(___,Name,Value) performs batched matrix multiply and add operation by using the options specified by one or more Name,Value pair arguments.

example

Examples

collapse all

Batched Matrix-Matrix Multiplication with Add

Perform a simple batched matrix-matrix multiplication with add and use the gpucoder.batchedMatrixMultiplyAdd function to generate CUDA^® code that calls corresponding cublas<t>gemmBatched APIs.

In one file, write an entry-point function myBatchMatMulAdd that accepts matrix inputs A1, B1, C1, A2, B2, and C2. Because the input matrices are not transposed, use the 'nn' option.

function [D1,D2] = myBatchMatMulAdd(A1,B1,C1,A2,B2,C2,alpha,beta)

[D1,D2] = gpucoder.batchedMatrixMultiplyAdd(A1,B1,C1,A2,B2,C2, ...
    'alpha',alpha,,'beta',beta,'transpose','nn');

end

To create a type for a matrix of doubles for use in code generation, use the coder.newtype function.

A1 = coder.newtype('double',[12,14],[0 0]);
A2 = coder.newtype('double',[12,14],[0 0]);
B1 = coder.newtype('double',[14,16],[0 0]);
B2 = coder.newtype('double',[14,16],[0 0]);
C1 = coder.newtype('double',[12,16],[0 0]);
C2 = coder.newtype('double',[12,16],[0 0]);
alpha = 0.3;
beta = 0.6;
inputs = {A1,B1,C1,A2,B2,C2,alpha,beta};

To generate a CUDA library, use the codegen function.

cfg = coder.gpuConfig('lib');
cfg.GpuConfig.EnableCUBLAS = true;
cfg.GpuConfig.EnableCUSOLVER = true;
cfg.GenerateReport = true;
codegen -config cfg-args inputs myBatchMatMulAdd

The generated CUDA code contains kernels myBatchMatMulAdd_kernelNN for initializing the input and output matrices. The code also contains the cublasDgemmBatched API calls to the cuBLAS library. The following code is a snippet of the generated code.

//
// File: myBatchMatMulAdd.cu
//
...
void myBatchMatMulAdd(const double A1[168], const double B1[224], const double
                      C1[192], const double A2[168], const double B2[224], const
                      double C2[192], double alpha, double beta, double D1[192],
                      double D2[192])
{
  double alpha1;
...

  myBatchMatMulAdd_kernel2<<<dim3(1U, 1U, 1U), dim3(224U, 1U, 1U)>>>(*gpu_B2,
    *gpu_B1, *gpu_input_cell_f4, *gpu_input_cell_f3);
  cudaMemcpy(gpu_C2, (void *)&C2[0], 1536UL, cudaMemcpyHostToDevice);
  cudaMemcpy(gpu_C1, (void *)&C1[0], 1536UL, cudaMemcpyHostToDevice);
  myBatchMatMulAdd_kernel3<<<dim3(1U, 1U, 1U), dim3(192U, 1U, 1U)>>>(*gpu_C2,
    *gpu_C1, gpu_r3, gpu_r2);
  myBatchMatMulAdd_kernel4<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>(gpu_r2,
    *gpu_out_cell);
  myBatchMatMulAdd_kernel5<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>(gpu_r3,
    *gpu_out_cell);
...

  cublasDgemmBatched(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N, 12, 16,
                     14, (double *)gpu_alpha1, (double **)gpu_Aarray, 12,
                     (double **)gpu_Barray, 14, (double *)gpu_beta1, (double **)
                     gpu_Carray, 12, 2);
  myBatchMatMulAdd_kernel6<<<dim3(1U, 1U, 1U), dim3(192U, 1U, 1U)>>>(*gpu_D2,
...

}

Input Arguments

collapse all

`A`, `B`, `C` — Operands
vectors | matrices

Operands, specified as vectors or matrices. A, B, and C must be 2-D arrays. The number of columns in A must be equal to the number of rows in B. The number of rows in A must be equal to the number of rows in C. The number of columns in B must be equal to the number of columns in C.

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: [D1,D2] = gpucoder.batchedMatrixMultiplyAdd(A1,B1,C1,A2,B2,C2,'alpha',0.3,'beta',0.6,'transpose','CC');

`alpha` — Scalar multiplication factor
1.0 (default) | scalar

Value of the scalar used for multiplication with A. Default value is one.

`beta` — Scalar multiplication factor
1.0 (default) | scalar

Value of the scalar used for multiplication with C. Default value is one.

`transpose` — Operation performed on input matrices
'NN' (default) | character vector | string

Character vector or string composed of two characters, indicating the operation performed on the matrices A and B prior to matrix multiplication. Possible values are normal ('N'), transposed ('T'), or complex conjugate transpose ('C').

Output Arguments

collapse all

`D` — Product
scalar | vector | matrix

Product, returned as a scalar, vector, or matrix. Array D has the same number of rows as input A and the same number of columns as input B.

Version History

Introduced in R2020a

gpucoder.batchedMatrixMultiplyAdd

Syntax

Description

Examples

Batched Matrix-Matrix Multiplication with Add

Input Arguments

`A`, `B`, `C` — Operands
vectors | matrices

Name-Value Arguments

`alpha` — Scalar multiplication factor
1.0 (default) | scalar

`beta` — Scalar multiplication factor
1.0 (default) | scalar

`transpose` — Operation performed on input matrices
'NN' (default) | character vector | string

Output Arguments

`D` — Product
scalar | vector | matrix

Version History

See Also

Apps

Functions

Objects

Topics

gpucoder.batchedMatrixMultiplyAdd

Syntax

Description

Examples

Batched Matrix-Matrix Multiplication with Add

Input Arguments

A, B, C — Operands vectors | matrices

Name-Value Arguments

alpha — Scalar multiplication factor 1.0 (default) | scalar

beta — Scalar multiplication factor 1.0 (default) | scalar

transpose — Operation performed on input matrices 'NN' (default) | character vector | string

Output Arguments

D — Product scalar | vector | matrix

Version History

See Also

Apps

Functions

Objects

Topics

`A`, `B`, `C` — Operands
vectors | matrices

`alpha` — Scalar multiplication factor
1.0 (default) | scalar

`beta` — Scalar multiplication factor
1.0 (default) | scalar

`transpose` — Operation performed on input matrices
'NN' (default) | character vector | string

`D` — Product
scalar | vector | matrix