Any GPU implementation of k-nearest neighbor search?

Question

Hao Zhang am 13 Dez. 2018

1
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/435562-any-gpu-implementation-of-k-nearest-neighbor-search

Kommentiert: Matt J am 14 Dez. 2018

Hi, I am developing a SPH (smoothed particle hydrodynamic) code of solving fluid equations in Matlab. I have successfully vectorized and implemented the code on GPU. The speed up is astonishing, ~10 times faster than CPU code. In the process I got invaluable suggestions from Matt J and Joss Knight for speeding up the code. Thank them for their wonderful suggestions.

In SPH, one has to find the neighbor particles in a given radius for every particles in the domain. I use the matlab function knnsearch for this purpose. Now the other part of the code (except neighbor search but solving the fluid equations) is so fast that the limiting part now is the knnsearch, which uses kdtree algorithm runing on CPU. It takes 85% percent of the runtime (see the following code profiler results)

The function 'knnCPU_kdtree_func' uses the matlab built-in function knnsearch with kdtree algorithm runing on CPU. The other functions are solving the real fluid equations runing on GPU only consumes 10% of the total time.

I wonder is there any GPU implementation of k-nearest neighbor search that I can free download and using as a function call in my matlab code? Many thanks.

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Matt J am 14 Dez. 2018

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/435562-any-gpu-implementation-of-k-nearest-neighbor-search#answer_352417

Bearbeitet: Matt J am 14 Dez. 2018

A quick Google search turned up this,

https://github.com/vincentfpgarcia/kNN-CUDA

They have Matlab bindings in their legacy version.

2 Kommentare
Keine anzeigenKeine ausblenden

Hao Zhang am 14 Dez. 2018

In MATLAB Online öffnen

Many thanks! This seems to be a very relevant code!

I have downloaded and tried to use nvcc to compile their knncuda.cu file into a ptx file. This step succeed.

In the next step I use the matlab GPUcoder to create CUDAKernel object and trying to run it on GPU, for the entry I choose '_Z17compute_distancesPfiiS_iiiS_'.

clear;close all;clc;
gd=gpuDevice();
reset(gd);
N=int32(10);
rx=rand(N,1);
ry=rand(N,1);
ref=[gpuArray(single(rx)) gpuArray(single(ry))];
ref_width=N;
ref_pitch=int32(2);
query=ref;
query_width=N;
query_pitch=int32(2);
height=int32(2);
dist=zeros(N,1,'single','gpuArray');
%%% create CUDAKernel object %%%
knncuda_kernel=parallel.gpu.CUDAKernel('knncuda.ptx','knncuda.cu','_Z17compute_distancesPfiiS_iiiS_');
%%% create CUDAKernel object %%%
%%% run CUDAKernel %%%
result=feval(knncuda_kernel,ref,ref_width,ref_pitch,query,query_width,query_pitch,height,dist)
%%% run kernel %%%

No error is reported, however I always get the same function return (result) as one of the input argument (ref). Why is that? I have very limited knowledge of cuda C... Could you please help? Thanks!

Matt J am 14 Dez. 2018

It's all 3rd party black boxes to me, I'm afraid.

Melden Sie sich an, um zu kommentieren.

Answer 2

Joss Knight am 13 Dez. 2018

2
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/435562-any-gpu-implementation-of-k-nearest-neighbor-search#answer_352267

knnsearch is supported on the GPU, so just use it!

7 Kommentare
5 ältere Kommentare anzeigen5 ältere Kommentare ausblenden

Matt J am 14 Dez. 2018

Bearbeitet: Matt J am 14 Dez. 2018

But each query points should be parallelizable since they are independent. In principle, it should be possible to gain speed on a massive parallel device?

In principle, yes, but the question is whether a GPU is the kind of device that can make that happen. Independence is not sufficient for a GPU to be able to parallelize a task well. The operations in the parallel threads must be sufficiently simple that GPU resources can handle them. Also, if there are lots of if else conditional operations in the task, GPU's cannot parallelize those well.

I tried pdist2 on GPU. Even calcualting all the pairwise distances on GPU is alreday slower than CPU knnsearch with kdtree

First, tic/toc is not a reliable way to time GPU performance. You should be using gputimeit. I'm not sure if wait(gd) gives an equivalent measurement.

Second, it is not clear from your test how much of the time is expended on the distance calculations and how much on the allocation of the very large 3e4 x 3e4 memory block that the result requires. An optimized brute force implementation of nearest neighbor search would not have to allocate so much memory, because the full distance matrix doesn't need to be saved. Rather it is the distance matrix thresholded by the neighbor radius that you would save, and that matrix is sparse.

Hao Zhang am 14 Dez. 2018

In MATLAB Online öffnen

Thanks for the explanations. I tried pdist2 with gputimeit and got nearly the same results. The argument on allocation of large matrix is a good point. It is indeed not clear how much the time is expended on distance calculations. But if I run this

clear;clc;close all
N=2.5e4;
knn_K=50;
rx=rand(N,1);
ry=rand(N,1);
BucketSize_kdtree=50;
tic;
Mdl=KDTreeSearcher([rx ry],'Distance','euclidean','BucketSize',BucketSize_kdtree); %%% kdtree search object
[idx_Neighbor1,d_Neighbor1]=knnsearch(Mdl,[rx ry],'K',knn_K);
toc
gd=gpuDevice();
reset(gd);
rx=gpuArray(single(rx));
ry=gpuArray(single(ry));
for i=1:10
    tic;
    D=pdist2([rx ry],[rx ry],'euclidean');
    wait(gd);
    toc;
end
Elapsed time is 0.097022 seconds.
Elapsed time is 0.175050 seconds.
Elapsed time is 0.199612 seconds.
Elapsed time is 0.187805 seconds.
Elapsed time is 0.191508 seconds.
Elapsed time is 0.190413 seconds.
Elapsed time is 0.191164 seconds.
Elapsed time is 0.190247 seconds.
Elapsed time is 0.191383 seconds.
Elapsed time is 0.180551 seconds.
Elapsed time is 0.181040 seconds.

Does it mean that not much time is spent on allocation since D is already there? Or each time when pdist2 is called, the memory is recallocated again? That sounds a bit stupid to me, is there any way to let matlab just allocate the memory once? Since I have to run neighbor finding at every time step of my simulation, it is ok that only the first time takes longer. Thanks

Matt J am 14 Dez. 2018

Bearbeitet: Matt J am 14 Dez. 2018

The workspace of pdist2 has no awareness of the workspace that calls it. It cannot know that you have a pre-allocated variable D there ready to house the results. To do that, you would need a different implementation of pdist2 which lets you pass in D as an input argument. This can be done with a MEX file and probably on the GPU as well with a cudaKernel object, but you would have to be willing to get into C coding for that.

Hao Zhang am 14 Dez. 2018

Ok, I am willing to write a cuda C code, but I am not a computer scientist, I would imagine cuda C will be a steep learning curve for me...

I wonder is there any standard cuda libaray that I can free download to do this?

Melden Sie sich an, um zu kommentieren.

Answer 3

Matt J am 13 Dez. 2018

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/435562-any-gpu-implementation-of-k-nearest-neighbor-search#answer_352247

Bearbeitet: Matt J am 13 Dez. 2018

In SPH, one has to find the neighbor particles in a given radius for every particles in the domain. I use the matlab function knnsearch for this purpose.

Perhaps instead, you should use rangesearch with a pre-constructed KDTreeSearcher object. You could then use parfor to process different chunks of query data in parallel.

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Hao Zhang am 13 Dez. 2018

In MATLAB Online öffnen

Hi Matt, I found using knnsearch followed by finding which neighbors are within radius is faster than directly call rangesearch. Just have to use an enough large K so that all neighbors within radius is included in the K-nearest neighbors.

I tried to use parfor to process chunks of query data in parallel, but the execution time is even longer than call knnsearch once incuding all query points. The following is a test code:

clear;clc;close all
n_parallel=4; 
if isempty(gcp('nocreate'))==1
    pooljob=parpool('local',n_parallel);
end
N=1e5;
knn_K=50;
rx=rand(N,1);
ry=rand(N,1);
BucketSize_kdtree=50;
Mdl=KDTreeSearcher([rx ry],'Distance','euclidean','BucketSize',BucketSize_kdtree); %%% kdtree search object
idx_Neighbor_cell=cell(n_parallel,1);
d_Neighbor_cell=cell(n_parallel,1);
logic_beyor_cell=cell(n_parallel,1);
N_par=round(N/n_parallel);
tic;
[idx_Neighbor1,d_Neighbor1]=knnsearch(Mdl,[rx ry],'K',knn_K);
toc
tic;
rx_query_cell=mat2cell(rx,[N_par*ones(1,n_parallel-1) N-N_par*(n_parallel-1)],1);
ry_query_cell=mat2cell(ry,[N_par*ones(1,n_parallel-1) N-N_par*(n_parallel-1)],1);
parfor i=1:n_parallel
    [idx_Neighbor_cell{i},d_Neighbor_cell{i}]=knnsearch(Mdl,[rx_query_cell{i} ry_query_cell{i}],'K',knn_K);
end
idx_Neighbor=cell2mat(idx_Neighbor_cell);
d_Neighbor=cell2mat(d_Neighbor_cell);
toc;

The runtime is:

Elapsed time is 0.389322 seconds.
Elapsed time is 1.199775 seconds.

The parfor version takes much longer time despite using 4 cores.

Does anyone know why this is and how to optimize it further? Thanks!

Melden Sie sich an, um zu kommentieren.

Any GPU implementation of k-nearest neighbor search?

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare
Keine anzeigenKeine ausblenden

Weitere Antworten (2)

7 Kommentare
5 ältere Kommentare anzeigen5 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

Any GPU implementation of k-nearest neighbor search?

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare Keine anzeigenKeine ausblenden

Weitere Antworten (2)

7 Kommentare 5 ältere Kommentare anzeigen5 ältere Kommentare ausblenden

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden

7 Kommentare
5 ältere Kommentare anzeigen5 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden