Optimized GPU implementation of batched matrix multiply with add operation

## Syntax

``[D1,D2] = gpucoder.batchedMatrixMultiplyAdd(A1,B1,C1,A2,B2,C2)``
``[D1,...,DN] = gpucoder.batchedMatrixMultiplyAdd(A1,B1,C1,...,AN,BN,CN)``
``___ = gpucoder.batchedMatrixMultiplyAdd(___,Name,Value)``

## Description

````[D1,D2] = gpucoder.batchedMatrixMultiplyAdd(A1,B1,C1,A2,B2,C2)` performs matrix-matrix multiplication and add of a batch of matrices `A1,B1,C1` and `A2,B2,C2`. The `gpucoder.batchedMatrixMultiplyAdd` function performs matrix-matrix multiplication of the form: $D=\alpha AB+\beta C$where $\alpha$ and $\beta$ are scalar multiplication factors and `A`, `B`, `C`, and `D` are matrices with dimensions `m`-by-`k`, `k`-by-`n`, `m`-by-`n`, and `m`-by-`n` respectively.`A` and `B` can optionally be transposed or hermitian-conjugated. By default, $\alpha$ and $\beta$ are set to one and the matrices are not transposed. To specify a different scalar multiplication factor and perform transpose operations on the input matrices, use the `Name,Value` pair arguments.All the batches passed to the `gpucoder.batchedMatrixMultiplyAdd` function must be uniform. That is, all instances must have the same dimensions `m,n,k`.```
````[D1,...,DN] = gpucoder.batchedMatrixMultiplyAdd(A1,B1,C1,...,AN,BN,CN)` performs matrix-matrix multiplication and add of multiple `A`, `B`, `C` matrices of the form: ${D}_{i}=\alpha {A}_{i}{B}_{i}+\beta {C}_{i}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}i=1\dots N$```

example

````___ = gpucoder.batchedMatrixMultiplyAdd(___,Name,Value)` performs batched matrix multiply and add operation by using the options specified by one or more `Name,Value` pair arguments.```

## Examples

collapse all

Perform a simple batched matrix-matrix multiplication with add and use the `gpucoder.batchedMatrixMultiplyAdd` function to generate CUDA® code that calls appropriate `cublas<t>gemmBatched` APIs.

In one file, write an entry-point function `myBatchMatMulAdd` that accepts matrix inputs `A1`, `B1`, `C1`, `A2`, `B2`, and `C2`. Because the input matrices are not transposed, use the `'nn'` option.

```function [D1,D2] = myBatchMatMulAdd(A1,B1,C1,A2,B2,C2,alpha,beta) [D1,D2] = gpucoder.batchedMatrixMultiplyAdd(A1,B1,C1,A2,B2,C2, ... 'alpha',alpha,,'beta',beta,'transpose','nn'); end ```

To create a type for a matrix of doubles for use in code generation, use the `coder.newtype` function.

```A1 = coder.newtype('double',[12,14],[0 0]); A2 = coder.newtype('double',[12,14],[0 0]); B1 = coder.newtype('double',[14,16],[0 0]); B2 = coder.newtype('double',[14,16],[0 0]); C1 = coder.newtype('double',[12,16],[0 0]); C2 = coder.newtype('double',[12,16],[0 0]); alpha = 0.3; beta = 0.6; inputs = {A1,B1,C1,A2,B2,C2,alpha,beta}; ```

To generate a CUDA library, use the `codegen` function.

```cfg = coder.gpuConfig('lib'); cfg.GpuConfig.EnableCUBLAS = true; cfg.GpuConfig.EnableCUSOLVER = true; cfg.GenerateReport = true; codegen -config cfg-args inputs myBatchMatMulAdd ```

The generated CUDA code contains kernels `myBatchMatMulAdd_kernelNN` for initializing the input and output matrices. The code also contains the `cublasDgemmBatched` API calls to the cuBLAS library. The following code is a snippet of the generated code.

```// // File: myBatchMatMulAdd.cu // ... void myBatchMatMulAdd(const double A1, const double B1, const double C1, const double A2, const double B2, const double C2, double alpha, double beta, double D1, double D2) { double alpha1; ... myBatchMatMulAdd_kernel2<<<dim3(1U, 1U, 1U), dim3(224U, 1U, 1U)>>>(*gpu_B2, *gpu_B1, *gpu_input_cell_f4, *gpu_input_cell_f3); cudaMemcpy(gpu_C2, (void *)&C2, 1536UL, cudaMemcpyHostToDevice); cudaMemcpy(gpu_C1, (void *)&C1, 1536UL, cudaMemcpyHostToDevice); myBatchMatMulAdd_kernel3<<<dim3(1U, 1U, 1U), dim3(192U, 1U, 1U)>>>(*gpu_C2, *gpu_C1, gpu_r3, gpu_r2); myBatchMatMulAdd_kernel4<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>(gpu_r2, *gpu_out_cell); myBatchMatMulAdd_kernel5<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>(gpu_r3, *gpu_out_cell); ... cublasDgemmBatched(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N, 12, 16, 14, (double *)gpu_alpha1, (double **)gpu_Aarray, 12, (double **)gpu_Barray, 14, (double *)gpu_beta1, (double **) gpu_Carray, 12, 2); myBatchMatMulAdd_kernel6<<<dim3(1U, 1U, 1U), dim3(192U, 1U, 1U)>>>(*gpu_D2, ... } ```

## Input Arguments

collapse all

Operands, specified as vectors or matrices. `A`, `B`, and `C` must be 2-D arrays. The number of columns in `A` must be equal to the number of rows in `B`. The number of rows in `A` must be equal to the number of rows in `C`. The number of columns in `B` must be equal to the number of columns in `C`.

Data Types: `double` | `single` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64`
Complex Number Support: Yes

### Name-Value Pair Arguments

Specify optional comma-separated pairs of `Name,Value` arguments. `Name` is the argument name and `Value` is the corresponding value. `Name` must appear inside quotes. You can specify several name and value pair arguments in any order as `Name1,Value1,...,NameN,ValueN`.

Example: ```[D1,D2] = gpucoder.batchedMatrixMultiplyAdd(A1,B1,C1,A2,B2,C2,'alpha',0.3,'beta',0.6,'transpose','CC');```

Value of the scalar used for multiplication with `A`. Default value is one.

Value of the scalar used for multiplication with `C`. Default value is one.

Character vector or string composed of two characters, indicating the operation performed on the matrices `A` and `B` prior to matrix multiplication. Possible values are normal (`'N'`), transposed (`'T'`), or complex conjugate transpose (`'C'`).

## Output Arguments

collapse all

Product, returned as a scalar, vector, or matrix. Array `D` has the same number of rows as input `A` and the same number of columns as input `B`.

Introduced in R2020a