Main Content

Analyze and Model Data on GPU

This example shows how to improve code performance by executing on a graphical processing unit (GPU). Execution on a GPU can improve performance if:

  • Your code is computationally expensive, where computing time significantly exceeds the time spent transferring data to and from GPU memory.

  • Your workflow uses functions with gpuArray (Parallel Computing Toolbox) support and large array inputs.

When writing code for a GPU, start with code that already performs well on a CPU. Vectorization is usually critical for achieving high performance on a GPU. Convert code to use functions that support GPU array arguments and transfer the input data to the GPU. For more information about MATLAB functions with GPU array inputs, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Many functions in Statistics and Machine Learning Toolbox™ automatically execute on a GPU when you use GPU array input data. For example, you can create a probability distribution object on a GPU, where the output is a GPU array.

pd = fitdist(gpuArray(x),"Normal")

Using a GPU requires Parallel Computing Toolbox™ and a supported GPU device. For information about supported devices, see GPU Computing Requirements (Parallel Computing Toolbox). For the complete list of Statistics and Machine Learning Toolbox™ functions that accept GPU arrays, see Functions and then, in the left navigation bar, scroll to the Extended Capability section and select GPU Arrays.

Examine Properties of GPU

You can query and select your GPU device using the gpuDevice function. If you have multiple GPUs, you can examine the properties of all GPUs detected in your system by using the gpuDeviceTable function. Then, you can select a specific GPU for single-GPU execution by using its index (gpuDevice(index)).

D = gpuDevice
D = 
  CUDADevice with properties:

                      Name: 'TITAN V'
                     Index: 1
         ComputeCapability: '7.0'
            SupportsDouble: 1
             DriverVersion: 11.2000
            ToolkitVersion: 11.2000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152 (49.15 KB)
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 12652838912 (12.65 GB)
           AvailableMemory: 12096045056 (12.10 GB)
       MultiprocessorCount: 80
              ClockRateKHz: 1455000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 0
          CanMapHostMemory: 1
           DeviceSupported: 1
           DeviceAvailable: 1
            DeviceSelected: 1

Execute Function on GPU

Explore a data distribution on a GPU using descriptive statistics.

Generate a data set of normally distributed random numbers on a GPU.

dist = randn(6e4,6e3,"gpuArray");

Determine whether dist is a GPU array.

TF = isgpuarray(dist)
TF = logical
   1

Execute a function with a GPU array input argument. For example, calculate the sample skewness for each column in dist. Because dist is a GPU array, the skewness function executes on the GPU and returns the result as a GPU array.

skew = skewness(dist);

Verify that the output skew is a GPU array.

TF = isgpuarray(skew)
TF = logical
   1

Evaluate Speedup of GPU Execution

Evaluate function execution time on the GPU and compare performance with execution on a CPU.

Comparing the time taken to execute code on a CPU and a GPU can be useful in determining the appropriate execution environment. For example, if you want to compute descriptive statistics from sample data, considering the execution time and the data transfer time is important to evaluating the overall performance. If a function has GPU array support, as the number of observations increases, computation on the GPU generally improves compared to the CPU.

Measure the function run time in seconds by using the gputimeit (Parallel Computing Toolbox) function. gputimeit is preferable to timeit for functions that use a GPU, because it ensures operation completion and compensates for overhead.

skew = @() skewness(dist);
t = gputimeit(skew)
t = 0.2458

Evaluate the performance difference between the GPU and CPU by independently measuring the CPU execution time. In this case, execution of the code is faster on the GPU than on the CPU.

The performance of code on a GPU is heavily dependent on the GPU used. For additional information about measuring and improving GPU performance, see Measure and Improve GPU Performance (Parallel Computing Toolbox).

Single Precision on GPU

You can improve the performance of your code by calculating in single precision instead of double precision.

Determine the execution time of the skewness function using an input argument of the dist data set in single precision.

dist_single = single(dist);
skew_single = @() skewness(dist_single);
t_single = gputimeit(skew_single)
t_single = 0.0503

In this case, execution of the code with single precision data is faster than execution with double precision data.

The performance improvement is dependent on the GPU card and total number of cores. For more information about using single precision with a GPU, see Measure and Improve GPU Performance (Parallel Computing Toolbox).

Dimensionality Reduction and Model Fitting on GPU

Implement dimensionality reduction and classification workflows on a GPU.

Functions such as pca and fitcensemble can be used together to train a machine learning model.

  • The pca (principal component analysis) function reduces data dimensionality by replacing several correlated variables with a new set of variables that are linear combinations of the original variables.

  • The fitcensemble function fits many classification learners to form an ensemble model that can make better predictions than a single learner.

Both functions are computationally intensive and can be significantly accelerated using a GPU.

For example, consider the humanactivity data set. The data set contains 24,075 observations of five physical human activities: sitting, standing, walking, running, and dancing. Each observation has 60 features extracted from acceleration data measured by smartphone accelerometer sensors. The data set contains the following variables:

  • actid — Response vector containing the activity IDs in integers: 1, 2, 3, 4, and 5 representing sitting, standing, walking, running, and dancing, respectively

  • actnames — Activity names corresponding to the integer activity IDs

  • feat — Feature matrix of 60 features for 24,075 observations

  • featlabels — Labels of the 60 features

load humanactivity

Use 90% of the observations to train a model that classifies the five types of human activities, and use 10% of the observations to validate the trained model. Specify a 10% holdout for the test set by using cvpartition.

Partition = cvpartition(actid,"Holdout",0.10);
trainingInds = training(Partition); % Indices for the training set
testInds = test(Partition); % Indices for the test set

Transfer the training and test data to the GPU.

XTrain = gpuArray(feat(trainingInds,:));
YTrain = gpuArray(actid(trainingInds));
XTest = gpuArray(feat(testInds,:));
YTest = gpuArray(actid(testInds));

Find the principal components for the training data set XTrain.

[coeff,score,~,~,explained,mu] = pca(XTrain);

Find the number of components required to explain at least 99% of variability.

idx = find(cumsum(explained)>99,1);

Determine the principal component scores that represent X in the principal component space.

XTrainPCA = score(:,1:idx);

Fit an ensemble of learners for classification.

template = templateTree("MaxNumSplits",20,"Reproducible",true);
classificationEnsemble = fitcensemble(XTrainPCA,YTrain, ...
    "Method","AdaBoostM2", ...
    "NumLearningCycles",30, ...
    "Learners",template, ...
    "LearnRate",0.1, ...
    "ClassNames",[1; 2; 3; 4; 5]);

To use the trained model for the test set, you need to transform the test data set by using the PCA obtained from the training data set.

XTestPCA = (XTest-mu)*coeff(:,1:idx);

Evaluate the accuracy of the trained classifier with the test data.

classificationError = loss(classificationEnsemble,XTestPCA,YTest);

Transfer to Local Workspace

Transfer data or model properties from a GPU to the local workspace for use with a function that does not support GPU arrays.

Transferring GPU arrays can be costly and is generally not necessary unless you need to use the results with functions that do not support GPU arrays, or use the results in another workspace where a GPU is unavailable.

The gather (Parallel Computing Toolbox) function transfers data from the GPU into the local workspace. Gather the dist data, and then confirm that the data is no longer a GPU array.

dist = gather(dist);
TF = isgpuarray(dist)
TF = logical
   0

The gather function transfers properties of a machine learning model from a GPU into the local workspace. Gather the classificationEnsemble model, and then confirm that the model properties that were previously GPU arrays, such as X, are no longer GPU arrays.

classificationEnsemble = gather(classificationEnsemble);
TF = isgpuarray(classificationEnsemble.X)
TF = logical
   0

See Also

(Parallel Computing Toolbox) | (Parallel Computing Toolbox) | (Parallel Computing Toolbox)

Related Topics