Confusing results when measuring time of function on CPU/GPU with tic-toc/(gpu)timeit

7 Ansichten (letzte 30 Tage)
I am currently working on implementing an algorithm on GPU, which is already working. When measuring the time of the algorithm - and especially its subroutines - I encountered very strange results. In order to understand these results better I decided to evaluate the times again for a very simple function. Once again I was surprised by the results and am therefore asking here for an explanation of the results. I post the code as well as the resulting output and hope that someone can explain me, why I am measuring these times.
The code is given by:
%%
% Goal of this script is to better understand the disagreement of the
% time measurements with tic toc and gpu time it.
% fix random numbers
clear;
rng('default');
s = rng(5);
% clear workspace
%% Part 1 measure whether there is a time difference between b and b_gpu
disp('##################################################################')
disp('##################################################################')
disp(' apply b(_gpu) once and measure with tic toc and timeit; ')
disp('##################################################################')
disp('##################################################################')
disp('-----------------------------------------------------')
disp('measuring the times for one application of b(_gpu) with tic and toc')
disp('-----------------------------------------------------')
% 1.1: Timeit and GPU timeit
% 1.1.1: CPU
size=200000000;
input=rand(size,1);
b_st=2;
amplitude=4;
periode=5;
b_st_gpu=gpuArray(b_st);
amplitude_gpu=gpuArray(amplitude);
periode_gpu=gpuArray(periode);
b = @(u) b_st+amplitude*sin(u*2*pi/periode);
b_gpu = @(u) b_st_gpu+amplitude_gpu*sin(u*gpuArray(2)*gpuArray(pi)/periode_gpu);
tic
test=b(input);
time_111=toc;
disp(['Time for operation b on CPU is ',num2str(time_111)])
% 1.1.2 GPU with original b
input_gpu_array = gpuArray(input);
tic
test=b(input_gpu_array);
time_112 = toc;
disp(['Time for operation b on GPU is ',num2str(time_112)])
% 1.1.3 GPU with b_gpu
input_gpu_array = gpuArray(input);
tic
test=b_gpu(input_gpu_array);
time_113 = toc;
disp(['Time for operation b_gpu on GPU is ',num2str(time_113)])
% 1.2: Timeit and GPU timeit
disp('-----------------------------------------------------')
disp('measuring the times for one application of b(_gpu) with timeit')
disp('-----------------------------------------------------')
clear;
rng('default');
s = rng(5);
size=200000000;
input=rand(size,1);
b_st=2;
amplitude=4;
periode=5;
b_st_gpu=gpuArray(b_st);
amplitude_gpu=gpuArray(amplitude);
periode_gpu=gpuArray(periode);
b = @(u) b_st+amplitude*sin(u*2*pi/periode);
b_gpu = @(u) b_st_gpu+amplitude_gpu*sin(u*gpuArray(2)*gpuArray(pi)/periode_gpu);
test=b(input);
time_121=timeit(@()b(input));
disp(['Time for operation b on CPU is ',num2str(time_121)])
% 1.1.2 GPU with original b
input_gpu_array = gpuArray(input);
test=b(input_gpu_array);
time_122 = gputimeit(@()b(input_gpu_array));
disp(['Time for operation b on GPU is ',num2str(time_122)])
% 1.1.3 GPU with b_gpu
input_gpu_array = gpuArray(input);
test=b_gpu(input_gpu_array);
time_123 = gputimeit(@()b_gpu(input_gpu_array));
disp(['Time for operation b_gpu on GPU is ',num2str(time_123)])
disp(' ')
disp(' ')
disp('Q1.1: Why is the time measured with gputimeit smaller than wit tic toc')
disp('Q1.2: Why is the time on the other hand longer for the CPU when measured')
disp(' with timeit')
disp('Q1.3: Why is b_gpu faster than b on the gpu when measured with tic toc')
disp('Q1.4: Why are b_gpu and b equally fast when measured with gputimeit')
%% Part 2: measure the time it takes to do the operation iteratively
disp(' ')
disp(' ')
disp(' ')
disp(' ')
disp(' ')
disp(' ')
disp('##################################################################')
disp('##################################################################')
disp(' apply b(_gpu) iteratively and measure with tic toc; ')
disp(' important note: the input is overwritten as input = b(input) ')
disp('##################################################################')
disp('##################################################################')
disp('-----------------------------------------------------')
disp('measuring the times for one application of b(_gpu) with tic and toc')
disp('-----------------------------------------------------')
clear;
rng('default');
s = rng(5);
% 1.1:
% 2.1.1: CPU
size=200000000;
input=rand(size,1);
b_st=2;
amplitude=4;
periode=5;
b_st_gpu=gpuArray(b_st);
amplitude_gpu=gpuArray(amplitude);
periode_gpu=gpuArray(periode);
b = @(u) b_st+amplitude*sin(u*2*pi/periode);
b_gpu = @(u) b_st_gpu+amplitude_gpu*sin(u*gpuArray(2)*gpuArray(pi)/periode_gpu);
tic
for counter = 1:10
input=b(input);
end
time_211=toc;
disp(['Time for operation 10 times b on CPU is ',num2str(time_211)])
disp(['Time on average for operation b on CPU is ',num2str(time_211/10)])
disp(' ')
% 2.1.2 GPU with original b
input_gpu_array = gpuArray(input);
for counter = 1:10
input_gpu_array=b(input_gpu_array);
end
time_212=toc;
disp(['Time for operation 10 times b on GPU is ',num2str(time_212)])
disp(['Time on average for operation b on GPU is ',num2str(time_212/10)])
disp(' ')
% 2.1.3 GPU with b_gpu
input_gpu_array = gpuArray(input);
for counter = 1:10
input_gpu_array=b_gpu(input_gpu_array);
end
time_213=toc;
disp(['Time for operation 10 times b_gpu on GPU is ',num2str(time_213)])
disp(['Time on average for operation b_gpu on GPU is ',num2str(time_213/10)])
disp(' ')
disp(' ')
disp(' ')
disp('Q2.1: Why is now the CPU faster than the GPU with tic toc?')
disp('Q2.2: Why are now the b and b_gpu equally fast when measured with tic toc?')
%% Part 3: measure the time it takes to do the operation iteratively but
% without overwriting stuff
disp(' ')
disp(' ')
disp(' ')
disp(' ')
disp(' ')
disp(' ')
disp('##################################################################')
disp('##################################################################')
disp(' apply b(_gpu) iteratively and measure with tic toc; ')
disp('important note: the input is not! overwritten as result = b(input)')
disp('##################################################################')
disp('##################################################################')
disp('-----------------------------------------------------')
disp('measuring the times for one application of b(_gpu) with tic and toc')
disp('-----------------------------------------------------')
clear;
rng('default');
s = rng(5);
% 3.1.1: CPU
size=200000000;
input=rand(size,1);
b_st=2;
amplitude=4;
periode=5;
b_st_gpu=gpuArray(b_st);
amplitude_gpu=gpuArray(amplitude);
periode_gpu=gpuArray(periode);
b = @(u) b_st+amplitude*sin(u*2*pi/periode);
b_gpu = @(u) b_st_gpu+amplitude_gpu*sin(u*gpuArray(2)*gpuArray(pi)/periode_gpu);
tic
for counter = 1:10
result=b(input);
end
time_311=toc;
disp(['Time for operation 10 times b on CPU is ',num2str(time_311)])
disp(['Time on average for operation b on CPU is ',num2str(time_311/10)])
disp(' ')
% 3.1.2 GPU with original b
input_gpu_array = gpuArray(input);
for counter = 1:10
result=b(input_gpu_array);
end
time_312=toc;
disp(['Time for operation 10 times b on GPU is ',num2str(time_312)])
disp(['Time on average for operation b on GPU is ',num2str(time_312/10)])
disp(' ')
% 3.1.3 GPU with b_gpu
input_gpu_array = gpuArray(input);
for counter = 1:10
result=b_gpu(input_gpu_array);
end
time_313=toc;
disp(['Time for operation 10 times b_gpu on GPU is ',num2str(time_313)])
disp(['Time on average for operation b_gpu on GPU is ',num2str(time_313/10)])
disp(' ')
disp(' ')
disp(' ')
disp('Q3.1: Why are all three times now shorter than in 2?')
and the corresponding output by:
##################################################################
##################################################################
apply b(_gpu) once and measure with tic toc and timeit;
##################################################################
##################################################################
-----------------------------------------------------
measuring the times for one application of b(_gpu) with tic and toc
-----------------------------------------------------
Time for operation b on CPU is 0.34433
Time for operation b on GPU is 0.14935
Time for operation b_gpu on GPU is 0.093943
-----------------------------------------------------
measuring the times for one application of b(_gpu) with timeit
-----------------------------------------------------
Time for operation b on CPU is 0.42076
Time for operation b on GPU is 0.036677
Time for operation b_gpu on GPU is 0.036649
Q1.1: Why is the time measured with gputimeit smaller than wit tic toc
Q1.2: Why is the time on the other hand longer for the CPU when measured
with timeit
Q1.3: Why is b_gpu faster than b on the gpu when measured with tic toc
Q1.4: Why are b_gpu and b equally fast when measured with gputimeit
##################################################################
##################################################################
apply b(_gpu) iteratively and measure with tic toc;
important note: the input is overwritten as input = b(input)
##################################################################
##################################################################
-----------------------------------------------------
measuring the times for one application of b(_gpu) with tic and toc
-----------------------------------------------------
Time for operation 10 times b on CPU is 5.5627
Time on average for operation b on CPU is 0.55627
Time for operation 10 times b on GPU is 6.3557
Time on average for operation b on GPU is 0.63557
Time for operation 10 times b_gpu on GPU is 7.091
Time on average for operation b_gpu on GPU is 0.7091
Q2.1: Why is now the CPU faster than the GPU with tic toc?
Q2.2: Why are now the b and b_gpu equally fast when measured with tic toc?
##################################################################
##################################################################
apply b(_gpu) iteratively and measure with tic toc;
important note: the input is not! overwritten as result = b(input)
##################################################################
##################################################################
-----------------------------------------------------
measuring the times for one application of b(_gpu) with tic and toc
-----------------------------------------------------
Time for operation 10 times b on CPU is 4.2037
Time on average for operation b on CPU is 0.42037
Time for operation 10 times b on GPU is 4.6343
Time on average for operation b on GPU is 0.46343
Time for operation 10 times b_gpu on GPU is 5.008
Time on average for operation b_gpu on GPU is 0.5008
Q3.1: Why are all three times now shorter than in 2?

Antworten (1)

Edric Ellis
Edric Ellis am 1 Jun. 2020
Ok, there's a lot going on here. A few observations:
  1. Putting scalar values on the GPU is often counter-productive, since the implementation often needs to implicitly gather them to the CPU before proceeding.
  2. Timing GPU operations with tic and toc is not reliable because the GPU can perform some (but not all) operations asynchronously. This is why in your first section the timings look odd - some time that should be accounted for in the first operation actually ends up being timed in the second operation
  3. You're missing some tic calls in sections 2.1.2, 2.1.3, 3.1.2, and 3.1.3, so the timings there are completely messed up. Fixing this as per my suggestion below should give you sensible timings.
To get accurate timings of iterative GPU operations, you should do something like this:
dev = gpuDevice();
timer = tic();
for i = 1:N
myGpuAlgorithm();
end
wait(dev);
t = toc(timer);
Calling wait on the gpuDevice ensures all asynchronous work has completed.
  3 Kommentare
Edric Ellis
Edric Ellis am 1 Jun. 2020
  1. Usually, I'd expect better performance when scalars are kept as CPU data, but it shouldn't make a huge difference either way if the whole operation is non-trivial.
  2. Hm, I'm not quite sure what's going on there. Perhaps you could narrow that down to a simple reproduction showing exactly what you're trying.
Florian
Florian am 1 Jun. 2020
I think the script, which I upload with this message, should show what I am trying to do. In principle I just (try to) compare the performance on CPU and GPU for b and b_gpu. For this I
  1. apply the function once and measure this with tic toc and timeit. There, I already encounter the difference between b_gpu and gpu
  2. apply the function in a for loop, where I (1) overwrite each time the input like input=b(input) and (2) just return the output each time output=b(input). While for (1) the results of b_gpu and b are the same, for (2) they are different.

Melden Sie sich an, um zu kommentieren.

Kategorien

Mehr zu Programming finden Sie in Help Center und File Exchange

Tags

Produkte


Version

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by