how can I output different array size than input by GPU arrayfun?

Hao Zhang (view profile)

on 8 Dec 2018
Latest activity Commented on by Hao Zhang

on 10 Dec 2018

Joss Knight (view profile)

I am doing particle simulations using matlab. So I have very large data sets which is the position of each particles, for example, [rx ry], where rx and ry are large column vecotrs containing the x and y coordinates of all the particles. So suppose there are 1e6 particles then the dimension of rx and ry both are 1e6 by 1. In the simulation I have to find the neighbor particles within a certain range R for each particles. Using CPU I wrote something like this using bruto force search
function [idx_Neighbor,d_Neighbor]=find_neighbors_Naive_func(rx,ry,N_par,r_neighbor,h_smo)
idx_Neighbor=cell(N_par,1);
d_Neighbor=cell(N_par,1);
for j=1:N_par
dj=(rx-rx(j)).^2+(ry-ry(j)).^2;
[idx_Neighbor_j]=find(dj<=r_neighbor^2);
d_Neighbor_j=((dj(idx_Neighbor_j)).^0.5)./h_smo;
idx_Neighbor{j}=idx_Neighbor_j;
d_Neighbor{j}=d_Neighbor_j;
end
This function output a N_par by 1 cell array idx_Neighbor whose individual elements are the indices of the neigbor particels for every particles in the domain. This works fine using CPU, but it is very slow so I want to write a GPU version of the code by using arrayfun. But as far as I know, the GPU arrayfun requires the output array the same size as the input array, and it does not support cell array. Is there any way to get around this or there is no simple way other than write a cuda c code? Thanks!
p.s. I realize the kd tree rangesearch function, but this function does not support GPU. And I want to avoid transfering data between GPU and CPU since I implement other part of the simulation code also on GPU.
some following up, through some experimental on Matlab GPU coder, I managed to translate the following matlab code to cuda knernal (mex):
function [Numpar_neighbor]=find_neighbors_Naive_func_test(rx,ry,r_neighbor)
Numpar_neighbor=zeros(size(rx),'single'); %%% number of neighbor particles for each particle j
coder.gpu.kernelfun;
for j=1:size(rx,1)
dj=(rx-rx(j)).^2+(ry-ry(j)).^2;
[idx_Neighbor_j]=find(dj<=r_neighbor^2);
Numpar_neighbor(j)=length(idx_Neighbor_j);
end
end
The cuda mex file runs fine, but it returns an ordinary array instead of gpuArray although the code runs on the GPU. So if I use the returned array Numpar_neighbor in the following part of the code I need to transfer it to the GPU, adding unnecessary overhead. Why is this and any way to force the mex file return gpuArray?
Thanks again!

Joss Knight (view profile)

on 8 Dec 2018

You can't output a variable sized output from GPU arrayfun, which would require atomic operations. You're going to have to compute all those distances or you're going to need to sort your data into bins so that you can reduce the search space.
Computing every distance is easy, just use pdist2, that's what it's for.
distances = pdist2([rx ry], [rx ry]);
Now threshold to get the entries within the range, and index (or call find) to get the list you're after
neighbours = distances < r_neighbour;
neighboursList = cellfun(@(i)find(neighbours(:,i)), 1:N_par, 'UniformOutput', false);
Now, if you can't fit the whole N_par-by-N_par array distances into memory, you can compute in batches by looping over one of the inputs to pist2 in chunks of a more reasonable size.
If this is still too slow then you're going to need to bin your data so that you can reduce the search space for each particle.
Another option might involve using knnsearch. You could find the K nearest neighbours of each point, and then threshold the distance output from that to find the points within range. K would need to be large enough to ensure you got all of the neighbours in range.
Others may have more efficient solutions.

Hao Zhang

Hao Zhang (view profile)

on 10 Dec 2018
Thank you for your answer! It is very helpful! I am going to test the idea and get back to here if I got positive speed up!