# Fast method to find average pairwise distance of a very large matrix?

16 views (last 30 days)
Anish Potnis on 10 Jan 2020
Commented: Anish Potnis on 13 Jan 2020
So I have a matrix that is 330,000 observations = rows x 160 variables = columns. I'd like to compute the average distance between each observation in my matrix, but pdist() fails here, because apparently it would take 405.6 GB to store the distance matrix...
Is there a reasonable vectorized (or fast) way to do something like this?
Any thoughts appreciated.
Thanks.

John D'Errico on 10 Jan 2020
You cannot compute the 100 billion pairwise distances, because the matrix won't fit in memeory. (Ok, symmetry reduces it to be only 50 billion or so distances.)
But do you seriously, really, desperately need to compute ALL such pairwise distances? You want to know the mean distance. Do you desperately need to know the EXACT value of that mean distance?
The point is, just use a Monte Carlo sampling to get a very good estimate of the mean pairwise distance.
For example, I'll do this using a smaller matrix, just so we can get some result out in a reasonable time.
X = randn(10000,160);
mean(pdist(X))
ans =
17.868
Now, compute a simple random sample. Then compute just a very small subset of the total distances you would need for an exact solution.
mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))
ans =
17.864
mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))
ans =
17.868
mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))
ans =
17.856
mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))
ans =
17.872
mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))
ans =
17.874
So 5 simple random permutations gave 5 different estimates. All were correct to within about 0.1% error. For a matrix with 330000 rows, the error would be even smaller. And if even that much error bothers you, then just take the average or median of several such samples.
I would point out that even for an array of size 330000x160, the computation I describe takes about a half second on my computer, almost as much time as it took to generate the array in the first place.
X = randn(330000,160);
tic,mean(vecnorm(X - X(randperm(size(X,1)),:),2,2));toc
Elapsed time is 0.585973 seconds.

#### 1 Comment

Anish Potnis on 13 Jan 2020
I came to the same conclusion, thanks for your help!

Matt J on 10 Jan 2020
Edited: Matt J on 10 Jan 2020
I don't know about the mean distance, but the mean squared or root mean squared distance is pretty easy. Compare:
A=rand(330000/25,160); %The data
%% Without pdist2
tic;
D=mean(vecnorm(A,2,2).^2);
M=norm(mean(A,1))^2;
avgDist=2*(D-M);
toc %Elapsed time is 0.017454 seconds.
%% With pdist2
tic;
P=pdist2(A,A).^2;
avgPdist2=mean(P(:));
toc %Elapsed time is 1.716697 seconds.
Check:
>> avgDist,
avgDist =
26.6471
>> avgPdist2,
avgPdist2 =
26.6471