# Why system need to wait gpudevice complete？

5 views (last 30 days)
Bill Wen on 26 Jan 2019
Commented: Joss Knight on 28 Jan 2019
Hi all,
I have a big problem to use gpu to accelerate the speed of NMF from Liu in matlab.
Here is the part of code:
XU = X'*U; % mnk or pk (p<<mn)
UU = U'*U; % mk^2
VUU = V*UU; % nk^2
V = V.*(XU./max(VUU,1e-10));
XV = X*V; % mnk or pk (p<<mn)
VV = V'*V; % nk^2
UVV = U*VV; % mk^2
U = U.*(XV./max(UVV,1e-10)); % 3mk
......
newobj = CalculateObj(X, U, V);
The original variables X, U and V are normal matrix, which use CPU to calculate. And I transform the X to gpuArray with gpuArray, which leads to all calculation use gpu. Everything goes well untill the last line, which calculate the object function of NMF. The code of CalculateObj is:
function [obj, dV] = CalculateObj(X, U, V, deltaVU, dVordU)
if ~exist('deltaVU','var')
deltaVU = 0;
end
if ~exist('dVordU','var')
dVordU = 1;
end
dV = [];
maxM = 62500000;
[mFea, nSmp] = size(X);
mn = numel(X);
nBlock = floor(mn*3/maxM);
if mn < maxM
dX = U*V'-X;
obj_NMF = sum(sum(dX.^2));
if deltaVU
if dVordU
dV = dX'*U;
else
dV = dX*V;
end
end
else
obj_NMF = 0;
if deltaVU
if dVordU
dV = zeros(size(V));
else
dV = zeros(size(U));
end
end
for i = 1:ceil(nSmp/nBlock)
if i == ceil(nSmp/nBlock)
smpIdx = (i-1)*nBlock+1:nSmp;
else
smpIdx = (i-1)*nBlock+1:i*nBlock;
end
dX = U*V(smpIdx,:)'-X(:,smpIdx);
obj_NMF = obj_NMF + sum(sum(dX.^2));
if deltaVU
if dVordU
dV(smpIdx,:) = dX'*U;
else
dV = dU+dX*V(smpIdx,:);
end
end
end
if deltaVU
if dVordU
dV = dV ;
end
end
end
%obj_Lap = alpha*sum(sum((L*V).*V));
obj = obj_NMF;
I find it will take a long time to execute the line 40, which is obj_NMF = obj_NMF + sum(sum(dX.^2)). Because the the class of dX is gpuArray, but obj_NMF is a normal variable, it seems that the system needs to wait the gpu execution complete before the addtion, which will take a long time. Moreover, even if I set the obj_NMF to be a gpuArray object, it still needs to wait the gpu complete. I want to know:
1. why the system needs to wati the gpu complete?
2. why gpu doesn't complete after executing a line?
3. Is there any solution to accelate the process?

Joss Knight on 26 Jan 2019
The fundamental problem is that GPU execution is asynchronous so the point where you think all the time is being spent isn't actually where the time is being spent. You should read up on GPU and performance, and perhaps look at vectorizing your code better to eliminate as many loops as possible.
Try profiling your code with the MATLAB profiler, which puts MATLAB into a synchronous state so you can see where the real cost is. I can't exactly see an obvious place in your code where the GPU might need to synchronize: the first time you hit line 40 obj_NMF may indeed be a double array, but it is immediately converted into a gpuArray from then on so this is unlikely to be much of an issue. Perhaps is it simply that your reduction ( sum(sum()) ) (which you should convert to sum(.., 'all') by the way) is the mostly costly part of your algorithm, or it may be that dX has to be evaluated at this point because it depends on an indexing operation and its size isn't known (therefore the reduction operation cannot be queued) until it's been evaluated; that makes that line a synchronization point. However, someone else here might be willing to put more effort into working out precisely what's happening.

Bill Wen on 28 Jan 2019
Thanks so much for your answer. I finally find the problem is that gpuarray is double instead of single, which take longer time to do matrix calculate than normal matrix.
Joss Knight on 28 Jan 2019
Well, that isn't true. A gpuArray can be double, single, int or logical. All arrays are double by default. On some GPUs, double precision computation is slow.