version 1.1.0.0 (105 KB) by
John D'Errico

An efficient and accurate Inter-Point Distance Matrix

There are several utilities around to compute interpoint distances, but none of them did fully what I thought important. What were my goals?

1. Inter-point distances are sometimes computed within one set of points, or between two sets. So a tool must handle either case.

2. Efficiency is important, but a common method for inter-point (Euclidean) distances uses a trick that results in a loss of accuracy. The good thing is bsxfun allows us to compute distances both efficiently and accurately.

3. Many times we wish to compute an inter-point, but we only need some subset of the entire matrix. Then it might be nice to have a list of only the single nearest neighbor for each point in our set, or only the large or small distances beyond some limit.

4. Where appropriate, a sparse distance matrix might be useful.

5. Really large problems can sometimes be solved by breaking the problem into smaller chunks. IPDM does this where appropriate.

6. There are many special cases that can be solved efficiently. For example, to find the nearest neighbor for one dimensional data is a simple thing, costing no more than a sort. One does not need to compute all distances if only the closest point is of interest. (There are several other special cases that can be sped up, perhaps using k-d trees, or other algorithms. If interest is seen I'll try to provide them.)

All of these things and more are satisfied by IPDM. It uses a property/value pair interface to specify the options.

Find the nearest neighbors in 1-dimensional data:

A = randn(10000,1);

B = randn(15000,1);

tic,

d=ipdm(A,B,'subset','nearest');

toc

Elapsed time is 0.151346 seconds.

A note to those who might be worried about absolute speed on small sets of data. I've now considerably sped up the code for simple calls, reducing the basic overhead by a factor of roughly 4.

See the demo file for many examples of use.

John D'Errico (2021). IPDM: Inter-Point Distance Matrix (https://www.mathworks.com/matlabcentral/fileexchange/18937-ipdm-inter-point-distance-matrix), MATLAB Central File Exchange. Retrieved .

Created with
R2007a

Compatible with any release

**Inspired by:**
distance.m, nearestpoint(x, y, m), nearestneighbour.m, Distance Matrix, Computing Pairwise Distances and Metrics

**Inspired:**
peterson-tim-j/HydroSight, Efficient K-Nearest Neighbor Search using JIT, Experimental (Semi-) Variogram, findclump

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!Create scripts with code, output, and formatted text in a single executable document.

Josh PhilipsonJohn - Thank you!

John MiersTijs MaasDennis EhlersDennis EhlersMaksimFor (squared) Euclidean I was able to get a 100x speedup over this implementation using linear algebra operations. My code is available at github.com/maksimt/matlab_funcs

JialiIt is really efficient. I also like

ipdm([x y],'subset','nearestneighbor','result','struct'); because I can manipulate the structure for later use.

Chad GreeneFor an array of points given by x and y, this function quite elegantly and efficiently returns indices of each point's nearest neighbor like this:

ipdm([x y],'subset','nearestneighbor','result','struct');

Takes ~5 seconds to solve for 15,000 points on my laptop. That's quite fast.

RaldiBenGreat code. Thanks for your kind share!

oskenundRyanReally useful, a very nice function that I use all the time.

John D'ErricoIf you wish to see a histogram of the distances, then just call hist on the output.

But your question seems not to be about that. If you actually want to see the distance between two histograms, then you need to define what the word "distance" means in that case.

Farieexcellent...but I wonder can we do this distance on histogram?

Matti JukolaThank you for this tool. I made few small performance tweaks to input parsing, these should make ipdm more efficient when running multiple times with small datasets:

Near line 292: changed ~ismember... check to:

if ~any(strcmpi({'all','nearestneighbor','farthestneighbor'},params.Subset)) && ...

Near line 326: again changed ~ismember... to:

elseif (length(params.Metric)~=1) || ~any(params.Metric==[0 1 2 inf])

In function 'parse_pv_pairs': changed last portion of the code to:

% there was at least one pv pair. process any supplied

propnames = fieldnames(params);

for i=1:n

p_i = pv_pairs{2*i-1};

v_i = pv_pairs{2*i};

ind = find(strcmpi(propnames,p_i));

if isempty(ind)

ind = find(strncmpi(p_i,propnames,length(p_i)));

if isempty(ind)

error(['No matching property found for: ',p_i])

elseif length(ind)>1

error(['Ambiguous property name: ',p_i])

end

end

% override the corresponding default in params.

% Use setfield for comptability issues with older releases.

params.(propnames{ind}) = v_i;

end

Wolfgang SchwanghartThanks for this great contribution! Would it, however, be possible that ipdm (in case only data1 is provided) only returns the row and column indices and distances so that they are arranged in the order (distance(rowix,colix)): (2,1), (3,1), ..., (n,1), (3,2), ..., (n,2), ..., (n,n–1)). That is the way pdist (part of the statistics toolbox) returns the distance matrix in a vector to be more memory efficient. pdist, however, does not support distance subsets.

So far, I use this workaround in my code.

% remove distances were d.columnindex = d.rowindex

I = d.columnindex==d.rowindex;

d.columnindex(I) = [];

d.rowindex(I) = [];

d.distance(I) = [];

% remove double entry of pairs

[dtemp, m] = unique(sort([d.rowindex d.columnindex],2),'rows');

d.rowindex = dtemp(:,1);

d.columnindex = dtemp(:,2);

d.distance = d.distance(m);

This reduces the size of d by more than a half. But it doesn't avoid the generation of the huge struct returned by ipdm, which is a real memory bottleneck for me. Would it be possible to calculate this more efficient output within ipdm? So far I don't see a straightforward way, but perhaps you do?

Best regards,

Wolfgang

John D'ErricoYes, for small problems, the naive code can be more efficient. This tool has many options however, and just the task of parsing the possibilities takes a significant amount of time. So for a tiny problem, almost all the work is in the problem overhead. Note that for these small problems, the time required for ipdm is almost constant, roughly 0.0025 seconds.

For example, the following test shows the times required in a simple test. I computed the distances from each of 100 points to a set of N points (N is the first column) in 2 dimensions.

[N',ipdmtimes,looptimes]

ans =

2 0.0025589 0.0006432

10 0.0033113 0.00076637

50 0.0051397 0.0038982

100 0.0071307 0.0077328

500 0.02467 0.037465

1000 0.04776 0.07536

2000 0.092433 0.15087

Since the minimum overhead on my CPU to run ipdm is roughly 0.0025 seconds, the simple loop will be faster for small problems. As you can see, the loop is slower for the larger problems, anything larger than 100x100 in this case, at least on my CPU. Other tests show that the break-even point was indeed roughly 10,000 total distances to compute for a simple, full distance matrix.

What ipdm provides is the ability to generate a wide variety of distance matrices, especially when you only need some limited subset of the complete set of distances. If you only want a subset of the complete set of distances, but that complete set of distances would be too large to compute in the memory that you have available, then you may have no choice but to use a tool like this.

As far as the issue of memory goes, I get no memory errors for the problem size indicated. This just means that I've got more memory available to my system. However, ipdm also allows you to control the size of the chunks it uses, so Jim could have avoided that problem. The 'ChunkSize' parameter is available as a (currently undocumented) option in ipdm. It was not listed in the help, as I thought it would be of little common utility. Use of the 'ChunkSize' property is currently shown in the examples in the demo file. I'll add a description of 'ChunkSize' to the help for those with limited memory.

Jim StevensI am very confused. This should be efficient routine, and I replaced my own very naive (not vectorized, loop-in-loop approach) distance calculation with IPDM call, and my code slowed down essentially.

Little play with tic-toc timer gave me surprising results, I calculated distance from one point to several points. With only 10 points naive approach was 7.5 times faster, with 100 points 2.5 times, and around 500 points they were equal.

On the other hand, while calculating whole point matrix (distance from all points to other points) of point set, IPDM will give out of memory error already with 1400 two dimensional points. With naive approach, calcuation of 1300 points takes appr 10 times longer than with IPDM, but on the other hand, naive approach can handle far larger point sets.

So, as there seems to be some cases when this implementation is inferior to naive approach, I dont understand what was the talk about efficiency.

Urs (us) Schwarzsince the cyclomatic complexity of this submission is already 106 (!!) i refrain from any more redundancy and simply add - good stuff, john: help/examples/computational engine

us

Urs (us) Schwarzsince the cyclomatic complexity of this submission is already 106 (!!) i refrain from any more redundancy and simply add - good stuff, john: help/examples/computational engine

us

Duane HanselmanGreat submission. If only all submissions were of this quality in terms of documentation, features, and coding. If only submissions like this one were immune to 1 star ratings by vindictive reviewers.