PTX kernel time to run

Question

Gaszton am 16 Mai 2011

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/7511-ptx-kernel-time-to-run

Hello, i am using R2010b, CUDA toolkit 3.1 with a geforce gt425m. While is was optimalizing my cuda code i observed that calling the kernel with feval in matlab has a ~2ms constant time measured with

tic feval(k,...) toc

the kernel code:

    #define C_WIDTH 1024
    #define C_HEIGHT 768
    __global__ void timetest1(float* holo) {    
     int mindex=blockIdx.x*blockDim.x+threadIdx.x;
     int size=C_WIDTH*C_HEIGHT;
     if (mindex>=size) 
    return;
     holo[mindex]=mindex*mindex;
    }

Even if i take out the write to global memory //holo[mindex]=mindex*mindex; there is a ~2ms time

Does anybody know the origin of this lag? It would be great to somehow eliminate it.

Thanks,

Gaszton

PS: my matlab code for the kernel:

clear

import parallel.gpu.GPUArray

xsize=1024; ysize=768;

vectorsize=xsize*ysize; threadpblock=1024; k=parallel.gpu.CUDAKernel('TimeTest.ptx', 'TimeTest.cu'); k.ThreadBlockSize=[threadpblock,1,1]; k.GridSize=[ceil(vectorsize/threadpblock),1];

dholo=parallel.gpu.GPUArray.zeros(vectorsize,1,'single');

tic [dholo]=feval(k,dholo); time=toc;

['ms time= ' num2str(time*1000)]

clear

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Edric Ellis am 16 Mai 2011

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/7511-ptx-kernel-time-to-run#answer_10341

In MATLAB Online öffnen

Firstly, can I suggest that if possible you should upgrade to R2011a as we have made quite a few performance improvements in that release. Secondly, I think the main bottleneck in your code as written is that outside a function, an important optimisation called "in-place optimisation" cannot take place. If you place your code inside a function, then "dholo" will not be copied. For reference, I made a function like this:

function tmp
import parallel.gpu.GPUArray
xsize=1024; ysize=768;
vectorsize=xsize*ysize; 
threadpblock=512; % I have a C1060
k=parallel.gpu.CUDAKernel('TimeTest.ptx', 'TimeTest.cu'); 
k.ThreadBlockSize=[threadpblock,1,1]; 
k.GridSize=[ceil(vectorsize/threadpblock),1];
dholo=parallel.gpu.GPUArray.zeros(vectorsize,1,'single');
tic
for ii = 1:1000
    dholo=feval(k,dholo); 
end
time=toc;
disp(['ms time= ' num2str(time)])

And the overhead on my C1060 was down to 0.05 ms.

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Gaszton am 16 Mai 2011

Thank you for your help!

I am a PhD student in Hungary, Biological Research Centre

Hungarian Academy of Sciences,

we have a network licence (with limited number of instances of matlab to run parallel)

We used to buy a matlab update in every 1-2 year, but i dont really have an impact on that.

thank you again,

Gaszton

Melden Sie sich an, um zu kommentieren.

PTX kernel time to run

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Akzeptierte Antwort

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Community Treasure Hunt

PTX kernel time to run

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Akzeptierte Antwort

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden