16 views (last 30 days)

So I have a matrix that is 330,000 observations = rows x 160 variables = columns. I'd like to compute the average distance between each observation in my matrix, but pdist() fails here, because apparently it would take 405.6 GB to store the distance matrix...

Is there a reasonable vectorized (or fast) way to do something like this?

Any thoughts appreciated.

Thanks.

John D'Errico
on 10 Jan 2020

You cannot compute the 100 billion pairwise distances, because the matrix won't fit in memeory. (Ok, symmetry reduces it to be only 50 billion or so distances.)

But do you seriously, really, desperately need to compute ALL such pairwise distances? You want to know the mean distance. Do you desperately need to know the EXACT value of that mean distance?

The point is, just use a Monte Carlo sampling to get a very good estimate of the mean pairwise distance.

For example, I'll do this using a smaller matrix, just so we can get some result out in a reasonable time.

X = randn(10000,160);

mean(pdist(X))

ans =

17.868

Now, compute a simple random sample. Then compute just a very small subset of the total distances you would need for an exact solution.

mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))

ans =

17.864

mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))

ans =

17.868

mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))

ans =

17.856

mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))

ans =

17.872

mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))

ans =

17.874

So 5 simple random permutations gave 5 different estimates. All were correct to within about 0.1% error. For a matrix with 330000 rows, the error would be even smaller. And if even that much error bothers you, then just take the average or median of several such samples.

I would point out that even for an array of size 330000x160, the computation I describe takes about a half second on my computer, almost as much time as it took to generate the array in the first place.

X = randn(330000,160);

tic,mean(vecnorm(X - X(randperm(size(X,1)),:),2,2));toc

Elapsed time is 0.585973 seconds.

Matt J
on 10 Jan 2020

Edited: Matt J
on 10 Jan 2020

I don't know about the mean distance, but the mean squared or root mean squared distance is pretty easy. Compare:

A=rand(330000/25,160); %The data

%% Without pdist2

tic;

D=mean(vecnorm(A,2,2).^2);

M=norm(mean(A,1))^2;

avgDist=2*(D-M);

toc %Elapsed time is 0.017454 seconds.

%% With pdist2

tic;

P=pdist2(A,A).^2;

avgPdist2=mean(P(:));

toc %Elapsed time is 1.716697 seconds.

Check:

>> avgDist,

avgDist =

26.6471

>> avgPdist2,

avgPdist2 =

26.6471

Sign in to comment.

Sign in to answer this question.

Opportunities for recent engineering grads.

Apply Today
## 0 Comments

Sign in to comment.