## Speed of looped operation on a GPU depending on number of iterations in loop?

Asked by D. Plotnick

### D. Plotnick (view profile)

on 16 Oct 2017
Latest activity Commented on by D. Plotnick

### D. Plotnick (view profile)

on 19 Oct 2017
Accepted Answer by Joss Knight

### Joss Knight (view profile)

This is a question that I think will get a bit into the weeds of MATLAB's JIT and GPU toolbox. I will be including a MWE sample code below, and it should be stated that I am using 2017a and have a Titan-X 12GB Pascal GPU.
The basic issue is this: I am performing a looped operation (e.g. an interpolation) on the GPU, and if the number of iterations in the loop is small, the operation is very fast. However, once the number of iterations passes some threshold, each operation slows way down (a factor of >100 in my case).
To illustrate this, I used my minimum working example (MWE) below. It produced on my machine these two figures.
The first shows the average time per numerical operation versus the number of iterations in the loop. At values n<200 the operations take on the order of 1E-4 s/op. After that threshold is passed, they take around 2E-2 s/op, a massive slowdown. The second shows the total time for the loop. Again, we see a change in behavior where the number of iterations doesn't affect the total time (this is why I think its a JIT thing) until the threshold around n = 200, and then it increases linearly as expected. Finally, for each loop I output the time spent on each individual operation. For 150 iterations, We see that the time/operation is fairly constant in the 1E-4 s range, but for 200 iterations there is a sudden massive change in the time partway through the loop.  The questions are:
• (A) Why is this sudden change in speed occurring?
• (B) Is there a way to code this so that it does not occur (pre-allocation didn't seem to work, nor variable clearing).
• (C) If I cannot avoid it, can I predict it? In many cases I have the flexibility of changing the number of iterations in a loop through other means, so if keeping that number of iterations below some magic number will make my processing 400x faster, I will work on it.
My MWE code is below; it should be noted that this code shows this behavior on my machine, but it may not on yours. Also, the numerical operation being used here is a stand-in for an actual looped process and is just being used to illustrate the speed issue.
% =========================================================================
% MWE
% =========================================================================
% Clean up
clear all
close all
clc
% Set up some demo data and interpolating spaces
times2 = cell(10,1);
times1 = zeros(10,1);
x = (1:4000).';
y = (1:240);
v = rand(240,4000);
xi = 4000*rand(500);
xi = repmat(xi,1,1,240);
[Mf,Nf,~] = size(xi);
yi = repmat(y,Mf,1,Nf);
yi = permute(yi,[1,3,2]);
% Put it all on the GPU
x = gpuArray(x);
y = gpuArray(y);
v = gpuArray(v);
xi = gpuArray(xi);
yi = gpuArray(yi);
% Outer loop - changes number of iteration used in inener loop
for nn = 1:10
t1 = tic;
nn
timesIn = zeros(50*nn,1);
% Inner loop, perform our interpolation n-times
for ii = 1:50*nn
tI = tic;
vi = interp2(x,y,v,xi,yi);
vi = sum(vi,3);
timesIn(ii) = toc(tI);
end
% Plot the current time/op and save times
figure(1)
plot(timesIn); title(nn); drawnow;
times1(nn) = toc(t1);
times2{nn} = timesIn;
toc(t1)
end
% Make Figures
for nn = 1:10
mTimes(nn) = mean(times2{nn});
end
figure; plot((1:10)*50,mTimes); title('Mean time/operation'); ylabel('Time'); xlabel('n-Iterations');
figure; plot((1:10)*50,times1); title('Total Loop Time'); ylabel('Time'); xlabel('n-Iterations');

### Products

Answer by Joss Knight

### Joss Knight (view profile)

on 16 Oct 2017

You're just doing the timing in an invalid way. Most GPU operations run asynchronously, so all you were timing for the first 100 or so iterations was the kernel launch time. Eventually, you filled the queue and no more kernels could be launched until running kernels had finished. So then you are actually timing the true cost. Use wait(gpuDevice) to synchronize the device before each call to tic or toc to ensure that the timing values make sense. Even better, use gputimeit to get more accurate timings for functional code.

D. Plotnick

### D. Plotnick (view profile)

on 17 Oct 2017
Thanks as always Joss, and unfortunately this means I made an error in how I formed my MWE since in my actual code there is something odd happening with performance speed not related to the actual timing measurement. I'll have to come up with another, more appropriate MWE.
D. Plotnick

### D. Plotnick (view profile)

on 19 Oct 2017
Joss, I have revised a question posted here if you have a chance to look at it. I did not end up using gputimeit in that MWE, since I couldn't figure out a way to code it using anonymous functions not requiring an input.