BUG (#2)? kmeans is sensitive to rows (points) order

2 Ansichten (letzte 30 Tage)
micholeodon
micholeodon am 12 Mär. 2019
Bearbeitet: micholeodon am 12 Mär. 2019
Dear All,
I have noticed that kmeans gives different results for different points order !
This does not make any sense in my opinion.
I guess row order in matrix should have no impact on centroids location if random generator is set to fixed seed.
Anybody can explain that?
clear; close all; clc;
nPoints = 100;
nDimensions = 2;
nClusters = 3;
data = rand(nPoints,nDimensions) % points from uniform distr.
scatter(data(:,1), data(:,2), 'b')
rndGenSeed = 1;
%% cluster unshuffled data
rng(rndGenSeed) % set random generator's seed
[~, clusters] = kmeans(data, nClusters)
hold on
scatter(clusters(:,1), clusters(:,2), 'rv') % red triangles
hold off
%% cluster shuffled data
rng(rndGenSeed) % set random generator's seed - same seed
[~, clusters_sh] = kmeans(sortrows(data), nClusters)
hold on
scatter(data(:,1), data(:,2), 'k*') % control - plot shuffeled points - they should be ion same spots
scatter(clusters_sh(:,1), clusters_sh(:,2), 'gv') % these points should cover red triangles
hold off
grid on
  1 Kommentar
micholeodon
micholeodon am 12 Mär. 2019
Bearbeitet: micholeodon am 12 Mär. 2019
I think I have some clue, but it would be highly recommended that somebody from MathWorks Team verify it.
So my clue is this:
  1. Kmeans needs to choose some initial clusters positions. It can select randomly k INPUT POINTS to start.
  2. If you set rng(seed), seed=const. you will always get SAME row indices from data matrix as a starting cluster position.
  3. If you shuffle input data (input points locations are the same, only order in data structure is shuffled), even if you set rng(seed), seed=const. , you will get SAME row indices, BUT points under that indices are DIFFERENT !
  4. That means that kmeans will converge differently for shuffled input data points.
This would explain also my puzzle in another question: https://www.mathworks.com/matlabcentral/answers/448832-bug-evalclusters-is-sensitive-to-rows-points-order
What do you think MathWorks experts? :) Does k-means select input data points as a starting centroids locations?

Melden Sie sich an, um zu kommentieren.

Antworten (0)

Kategorien

Mehr zu Cluster Analysis and Anomaly Detection finden Sie in Help Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by