Asked by Bill Wen
on 26 Jan 2019

Hi all,

Here is the part of code:

XU = X'*U; % mnk or pk (p<<mn)

UU = U'*U; % mk^2

VUU = V*UU; % nk^2

V = V.*(XU./max(VUU,1e-10));

XV = X*V; % mnk or pk (p<<mn)

VV = V'*V; % nk^2

UVV = U*VV; % mk^2

U = U.*(XV./max(UVV,1e-10)); % 3mk

......

newobj = CalculateObj(X, U, V);

The original variables X, U and V are normal matrix, which use CPU to calculate. And I transform the X to gpuArray with gpuArray, which leads to all calculation use gpu. Everything goes well untill the last line, which calculate the object function of NMF. The code of CalculateObj is:

function [obj, dV] = CalculateObj(X, U, V, deltaVU, dVordU)

if ~exist('deltaVU','var')

deltaVU = 0;

end

if ~exist('dVordU','var')

dVordU = 1;

end

dV = [];

maxM = 62500000;

[mFea, nSmp] = size(X);

mn = numel(X);

nBlock = floor(mn*3/maxM);

if mn < maxM

dX = U*V'-X;

obj_NMF = sum(sum(dX.^2));

if deltaVU

if dVordU

dV = dX'*U;

else

dV = dX*V;

end

end

else

obj_NMF = 0;

if deltaVU

if dVordU

dV = zeros(size(V));

else

dV = zeros(size(U));

end

end

for i = 1:ceil(nSmp/nBlock)

if i == ceil(nSmp/nBlock)

smpIdx = (i-1)*nBlock+1:nSmp;

else

smpIdx = (i-1)*nBlock+1:i*nBlock;

end

dX = U*V(smpIdx,:)'-X(:,smpIdx);

obj_NMF = obj_NMF + sum(sum(dX.^2));

if deltaVU

if dVordU

dV(smpIdx,:) = dX'*U;

else

dV = dU+dX*V(smpIdx,:);

end

end

end

if deltaVU

if dVordU

dV = dV ;

end

end

end

%obj_Lap = alpha*sum(sum((L*V).*V));

obj = obj_NMF;

I find it will take a long time to execute the line 40, which is obj_NMF = obj_NMF + sum(sum(dX.^2)). Because the the class of dX is gpuArray, but obj_NMF is a normal variable, it seems that the system needs to wait the gpu execution complete before the addtion, which will take a long time. Moreover, even if I set the obj_NMF to be a gpuArray object, it still needs to wait the gpu complete. I want to know:

- why the system needs to wati the gpu complete?
- why gpu doesn't complete after executing a line?
- Is there any solution to accelate the process?

Answer by Joss Knight
on 26 Jan 2019

Accepted Answer

The fundamental problem is that GPU execution is asynchronous so the point where you think all the time is being spent isn't actually where the time is being spent. You should read up on GPU and performance, and perhaps look at vectorizing your code better to eliminate as many loops as possible.

Try profiling your code with the MATLAB profiler, which puts MATLAB into a synchronous state so you can see where the real cost is. I can't exactly see an obvious place in your code where the GPU might need to synchronize: the first time you hit line 40 obj_NMF may indeed be a double array, but it is immediately converted into a gpuArray from then on so this is unlikely to be much of an issue. Perhaps is it simply that your reduction ( sum(sum()) ) (which you should convert to sum(.., 'all') by the way) is the mostly costly part of your algorithm, or it may be that dX has to be evaluated at this point because it depends on an indexing operation and its size isn't known (therefore the reduction operation cannot be queued) until it's been evaluated; that makes that line a synchronization point. However, someone else here might be willing to put more effort into working out precisely what's happening.

Bill Wen
on 28 Jan 2019

Joss Knight
on 28 Jan 2019

Sign in to comment.

Opportunities for recent engineering grads.

Apply Today
## 0 Comments

Sign in to comment.