Why system need to wait gpudevice complete？

Question

Bill Wen am 26 Jan. 2019

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/441631-why-system-need-to-wait-gpudevice-complete

Kommentiert: Joss Knight am 28 Jan. 2019

Akzeptierte Antwort: Joss Knight

In MATLAB Online öffnen

Hi all,

I have a big problem to use gpu to accelerate the speed of NMF from Liu in matlab.

Here is the part of code:

XU = X'*U;  % mnk or pk (p<<mn)
UU = U'*U;  % mk^2
VUU = V*UU; % nk^2
V = V.*(XU./max(VUU,1e-10));
XV = X*V;   % mnk or pk (p<<mn)
VV = V'*V;  % nk^2
UVV = U*VV; % mk^2   
U = U.*(XV./max(UVV,1e-10)); % 3mk
......
newobj = CalculateObj(X, U, V);

The original variables X, U and V are normal matrix, which use CPU to calculate. And I transform the X to gpuArray with gpuArray, which leads to all calculation use gpu. Everything goes well untill the last line, which calculate the object function of NMF. The code of CalculateObj is:

function [obj, dV] = CalculateObj(X, U, V, deltaVU, dVordU)
    if ~exist('deltaVU','var')
        deltaVU = 0;
    end
    if ~exist('dVordU','var')
        dVordU = 1;
    end
    dV = [];
    maxM = 62500000;
    [mFea, nSmp] = size(X);
    mn = numel(X);
    nBlock = floor(mn*3/maxM);
    if mn < maxM
        dX = U*V'-X;
        obj_NMF = sum(sum(dX.^2));
        if deltaVU
            if dVordU
                dV = dX'*U;
            else
                dV = dX*V;
            end
        end
    else
        obj_NMF = 0;
        if deltaVU
            if dVordU
                dV = zeros(size(V));
            else
                dV = zeros(size(U));
            end
        end
        for i = 1:ceil(nSmp/nBlock)
            if i == ceil(nSmp/nBlock)
                smpIdx = (i-1)*nBlock+1:nSmp;
            else
                smpIdx = (i-1)*nBlock+1:i*nBlock;
            end
            dX = U*V(smpIdx,:)'-X(:,smpIdx);
            obj_NMF = obj_NMF + sum(sum(dX.^2));
            if deltaVU
                if dVordU
                    dV(smpIdx,:) = dX'*U;
                else
                    dV = dU+dX*V(smpIdx,:);
                end
            end
        end
        if deltaVU
            if dVordU
                dV = dV ;
            end
        end
    end
   %obj_Lap = alpha*sum(sum((L*V).*V));
   
    obj = obj_NMF;
        

I find it will take a long time to execute the line 40, which is obj_NMF = obj_NMF + sum(sum(dX.^2)). Because the the class of dX is gpuArray, but obj_NMF is a normal variable, it seems that the system needs to wait the gpu execution complete before the addtion, which will take a long time. Moreover, even if I set the obj_NMF to be a gpuArray object, it still needs to wait the gpu complete. I want to know:

why the system needs to wati the gpu complete?
why gpu doesn't complete after executing a line?
Is there any solution to accelate the process?

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Joss Knight am 26 Jan. 2019

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/441631-why-system-need-to-wait-gpudevice-complete#answer_358160

The fundamental problem is that GPU execution is asynchronous so the point where you think all the time is being spent isn't actually where the time is being spent. You should read up on GPU and performance, and perhaps look at vectorizing your code better to eliminate as many loops as possible.

Try profiling your code with the MATLAB profiler, which puts MATLAB into a synchronous state so you can see where the real cost is. I can't exactly see an obvious place in your code where the GPU might need to synchronize: the first time you hit line 40 obj_NMF may indeed be a double array, but it is immediately converted into a gpuArray from then on so this is unlikely to be much of an issue. Perhaps is it simply that your reduction ( sum(sum()) ) (which you should convert to sum(.., 'all') by the way) is the mostly costly part of your algorithm, or it may be that dX has to be evaluated at this point because it depends on an indexing operation and its size isn't known (therefore the reduction operation cannot be queued) until it's been evaluated; that makes that line a synchronization point. However, someone else here might be willing to put more effort into working out precisely what's happening.

2 Kommentare
Keine anzeigenKeine ausblenden

Bill Wen am 28 Jan. 2019

Thanks so much for your answer. I finally find the problem is that gpuarray is double instead of single, which take longer time to do matrix calculate than normal matrix.

Joss Knight am 28 Jan. 2019

Well, that isn't true. A gpuArray can be double, single, int or logical. All arrays are double by default. On some GPUs, double precision computation is slow.

Melden Sie sich an, um zu kommentieren.

Why system need to wait gpudevice complete？

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare
Keine anzeigenKeine ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

Why system need to wait gpudevice complete？

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare Keine anzeigenKeine ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden