Optimization of loops with deep learning functions

3 views (last 30 days)
I am trying to optimize the following code, which performs the deep learning convolutions on input arrays:
parfor k=1:500 %number of images
for j =1:8 %number of channels of the relevant filters
dldf_O_dlconv3_temp_1(:,:,j,k)=dlconv(dlarray(reshape(O_maxpool2(:,:,j,k),[8 8 1]), 'SSC'),dlarray(reshape(DLDO_O_dlconv3(:,:,1,k),[8 8 1]),'SSC'),0,'Padding',padding);%1st filter
dldf_O_dlconv3_temp_2(:,:,j,k)=dlconv(dlarray(reshape(O_maxpool2(:,:,j,k),[8 8 1]), 'SSC'),dlarray(reshape(DLDO_O_dlconv3(:,:,2,k),[8 8 1]),'SSC'),0,'Padding',padding);%2nd filter
dldf_O_dlconv3_temp_3(:,:,j,k)=dlconv(dlarray(reshape(O_maxpool2(:,:,j,k),[8 8 1]), 'SSC'),dlarray(reshape(DLDO_O_dlconv3(:,:,3,k),[8 8 1]),'SSC'),0,'Padding',padding);%3rd filter
dldf_O_dlconv3_temp_4(:,:,j,k)=dlconv(dlarray(reshape(O_maxpool2(:,:,j,k),[8 8 1]), 'SSC'),dlarray(reshape(DLDO_O_dlconv3(:,:,4,k),[8 8 1]),'SSC'),0,'Padding',padding);%4th filter
dldf_O_dlconv3_temp_5(:,:,j,k)=dlconv(dlarray(reshape(O_maxpool2(:,:,j,k),[8 8 1]), 'SSC'),dlarray(reshape(DLDO_O_dlconv3(:,:,5,k),[8 8 1]),'SSC'),0,'Padding',padding);%5th filter
dldf_O_dlconv3_temp_6(:,:,j,k)=dlconv(dlarray(reshape(O_maxpool2(:,:,j,k),[8 8 1]), 'SSC'),dlarray(reshape(DLDO_O_dlconv3(:,:,6,k),[8 8 1]),'SSC'),0,'Padding',padding);%6th filter
dldf_O_dlconv3_temp_7(:,:,j,k)=dlconv(dlarray(reshape(O_maxpool2(:,:,j,k),[8 8 1]), 'SSC'),dlarray(reshape(DLDO_O_dlconv3(:,:,7,k),[8 8 1]),'SSC'),0,'Padding',padding);%7th filter
dldf_O_dlconv3_temp_8(:,:,j,k)=dlconv(dlarray(reshape(O_maxpool2(:,:,j,k),[8 8 1]), 'SSC'),dlarray(reshape(DLDO_O_dlconv3(:,:,8,k),[8 8 1]),'SSC'),0,'Padding',padding);%8th filter
As you can see, I am already using the parfor loop. I tried using GPU arrays with O_maxpool2 and DLDO_O_dlconv3, but instead of speeding anything up I think it became a bit slower if anything. My GPU device details are as follows:
CUDADevice with properties:
Name: 'GeForce RTX 2080 Ti'
Index: 1
ComputeCapability: '7.5'
SupportsDouble: 1
DriverVersion: 11.2000
ToolkitVersion: 11
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 1.1811e+10
AvailableMemory: 9.4411e+09
MultiprocessorCount: 68
ClockRateKHz: 1545000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceAvailable: 1
DeviceSelected: 1
Please let me know if there is anything else I could do to speed this code, and also why using gpuArrays have not sped it up.
Much thanks.

Accepted Answer

Joss Knight
Joss Knight on 24 Jun 2021
Edited: Joss Knight on 24 Jun 2021
dlconv is designed to work in batch with multiple input channels, multiple filters, and multiple input observations in a single call. Read the documentation for dlconv.
  1 Comment
Radians on 28 Jun 2021
Thanks alot...Even though it was not straight forward what I was trying to do(that's why I had to use forloops), but your comment made me think about how I could utilize the built-in functionality of dlconv to remove the loops, and after reshaping my data a bit I was able to do it. Thanks again.

Sign in to comment.

More Answers (0)


Find more on Deep Learning in Parallel and in the Cloud in Help Center and File Exchange




Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by