MathWorks Machine Translation
The automated translation of this page is provided by a general purpose third party translator tool.
MathWorks does not warrant, and disclaims all liability for, the accuracy, suitability, or fitness for purpose of the translation.
kmeans clustering
idx = kmeans(X,k)
idx = kmeans(X,k,Name,Value)
[idx,C]
= kmeans(___)
[idx,C,sumd]
= kmeans(___)
[idx,C,sumd,D]
= kmeans(___)
performs kmeans
clustering to partition the observations of the nbyp data
matrix idx
= kmeans(X
,k
)X
into k
clusters, and
returns an nby1 vector (idx
)
containing cluster indices of each observation. Rows of X
correspond
to points and columns correspond to variables.
By default, kmeans
uses the squared Euclidean distance metric and the kmeans++
algorithm for cluster center initialization.
returns
the cluster indices with additional options specified by one or more idx
= kmeans(X
,k
,Name,Value
)Name,Value
pair
arguments.
For example, specify the cosine distance, the number of times to repeat the clustering using new initial values, or to use parallel computing.
Cluster data using kmeans clustering, then plot the cluster regions.
Load Fisher's iris data set. Use the petal lengths and widths as predictors.
load fisheriris X = meas(:,3:4); figure; plot(X(:,1),X(:,2),'k*','MarkerSize',5); title 'Fisher''s Iris Data'; xlabel 'Petal Lengths (cm)'; ylabel 'Petal Widths (cm)';
The larger cluster seems to be split into a lower variance region and a higher variance region. This might indicate that the larger cluster is two, overlapping clusters.
Cluster the data. Specify k = 3 clusters.
rng(1); % For reproducibility
[idx,C] = kmeans(X,3);
kmeans
uses the kmeans++ algorithm for centroid initialization and squared Euclidean distance by default. It is good practice to search for lower, local minima by setting the 'Replicates'
namevalue pair argument.
idx
is a vector of predicted cluster indices corresponding to the observations in X
. C
is a 3by2 matrix containing the final centroid locations.
Use kmeans
to compute the distance from each centroid to points on a grid. To do this, pass the centroids (C
) and points on a grid to kmeans
, and implement one iteration of the algorithm.
x1 = min(X(:,1)):0.01:max(X(:,1)); x2 = min(X(:,2)):0.01:max(X(:,2)); [x1G,x2G] = meshgrid(x1,x2); XGrid = [x1G(:),x2G(:)]; % Defines a fine grid on the plot idx2Region = kmeans(XGrid,3,'MaxIter',1,'Start',C);
Warning: Failed to converge in 1 iterations.
% Assigns each node in the grid to the closest centroid
kmeans
displays a warning stating that the algorithm did not converge, which you should expect since the software only implemented one iteration.
Plot the cluster regions.
figure; gscatter(XGrid(:,1),XGrid(:,2),idx2Region,... [0,0.75,0.75;0.75,0,0.75;0.75,0.75,0],'..'); hold on; plot(X(:,1),X(:,2),'k*','MarkerSize',5); title 'Fisher''s Iris Data'; xlabel 'Petal Lengths (cm)'; ylabel 'Petal Widths (cm)'; legend('Region 1','Region 2','Region 3','Data','Location','SouthEast'); hold off;
Randomly generate the sample data.
rng default; % For reproducibility X = [randn(100,2)*0.75+ones(100,2); randn(100,2)*0.5ones(100,2)]; figure; plot(X(:,1),X(:,2),'.'); title 'Randomly Generated Data';
There appears to be two clusters in the data.
Partition the data into two clusters, and choose the best arrangement out of five initializations. Display the final output.
opts = statset('Display','final'); [idx,C] = kmeans(X,2,'Distance','cityblock',... 'Replicates',5,'Options',opts);
Replicate 1, 3 iterations, total sum of distances = 201.533. Replicate 2, 5 iterations, total sum of distances = 201.533. Replicate 3, 3 iterations, total sum of distances = 201.533. Replicate 4, 3 iterations, total sum of distances = 201.533. Replicate 5, 2 iterations, total sum of distances = 201.533. Best total sum of distances = 201.533
By default, the software initializes the replicates separately using kmeans++.
Plot the clusters and the cluster centroids.
figure; plot(X(idx==1,1),X(idx==1,2),'r.','MarkerSize',12) hold on plot(X(idx==2,1),X(idx==2,2),'b.','MarkerSize',12) plot(C(:,1),C(:,2),'kx',... 'MarkerSize',15,'LineWidth',3) legend('Cluster 1','Cluster 2','Centroids',... 'Location','NW') title 'Cluster Assignments and Centroids' hold off
You can determine how well separated the clusters are by passing idx
to silhouette
.
Clustering large data sets might take time,
particularly if you use online updates (set by default). If you have
a Parallel
Computing Toolbox™ license and you invoke a pool of workers,
then kmeans
runs each clustering task (or replicate)
in parallel. Therefore, if Replicates
> 1, then
the parallel computing decreases time to convergence.
Randomly generate a large data set from a Gaussian mixture model.
Mu = bsxfun(@times,ones(20,30),(1:20)'); % Gaussian mixture mean rn30 = randn(30,30); Sigma = rn30'*rn30; % Symmetric and positivedefinite covariance Mdl = gmdistribution(Mu,Sigma); rng(1); % For reproducibility X = random(Mdl,10000);
Mdl
is a 30dimensional gmdistribution
model with 20
components. X
is a
10000
by30
matrix of data
generated from Mdl
.
Invoke a parallel pool of workers. Specify options for parallel computing.
pool = parpool; % Invokes workers stream = RandStream('mlfg6331_64'); % Random number stream options = statset('UseParallel',1,'UseSubstreams',1,... 'Streams',stream);
Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers.
The input argument 'mlfg6331_64'
of RandStream
specifies
to use the multiplicative lagged Fibonacci generator algorithm. options
is
a structure array containing fields that specify options for controlling
estimation.
The Command Window indicates that four workers are available. The number of workers might vary on your system.
Cluster the data using kmeans clustering. Specify that there are k = 20 clusters in the data and increase the number of iterations. Typically, the objective function contains local minima. Specify 10 replicates to help find a lower, local minimum.
tic; % Start stopwatch timer [idx,C,sumd,D] = kmeans(X,20,'Options',options,'MaxIter',10000,... 'Display','final','Replicates',10); toc % Terminate stopwatch timer
Replicate 7, 44 iterations, total sum of distances = 7.55218e+06. Replicate 4, 95 iterations, total sum of distances = 7.53848e+06. Replicate 2, 104 iterations, total sum of distances = 7.54232e+06. Replicate 6, 80 iterations, total sum of distances = 7.54237e+06. Replicate 8, 111 iterations, total sum of distances = 7.54445e+06. Replicate 1, 52 iterations, total sum of distances = 7.55817e+06. Replicate 5, 70 iterations, total sum of distances = 7.55278e+06. Replicate 3, 94 iterations, total sum of distances = 7.54858e+06. Replicate 10, 56 iterations, total sum of distances = 7.54547e+06. Replicate 9, 83 iterations, total sum of distances = 7.53701e+06. Best total sum of distances = 7.53701e+06 Elapsed time is 3.239232 seconds.
The Command Window displays the number of iterations and the
terminal objective function value for each replicate. The output arguments
contain the results of replicate 9
because it has
the lowest total sum of distances.
X
— DataData, specified as a numeric matrix. The rows of X
correspond
to observations, and the columns correspond to variables.
If X
is a numeric vector, then kmeans
treats
it as an nby1 data matrix, regardless of its
orientation.
Data Types: single
 double
k
— Number of clustersNumber of clusters in the data, specified as a positive integer.
Data Types: single
 double
Specify optional
commaseparated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside single quotes (' '
). You can
specify several name and value pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
'Distance','cosine','Replicates',10,'Options',statset('UseParallel',1)
specifies
the cosine distance, 10
replicate clusters at different
starting values, and to use parallel computing.'Display'
— Level of output to display'off'
(default)  'final'
 'iter'
Level of output to display in the Command Window, specified
as the commaseparated pair consisting of 'Display'
and
one of the following options:
'final'
— Displays results
of the final iteration
'iter'
— Displays results
of each iteration
'off'
— Displays nothing
Example: 'Display','final'
'Distance'
— Distance metric'sqeuclidean'
(default)  'cityblock'
 'cosine'
 'correlation'
 'hamming'
Distance metric, in p
dimensional space, used for
minimization, specified as the commaseparated pair consisting of
'Distance'
and 'sqeuclidean'
,
'cityblock'
, 'cosine'
,
'correlation'
, or
'hamming'
.
kmeans
computes centroid clusters differently for
the different, supported distance metrics. This table summarizes the
available distance metrics. In the formulae, x is an
observation (that is, a row of X
) and
c is a centroid (a row vector).
Distance Metric  Description  Formula 

'sqeuclidean' 
Squared Euclidean distance (default). Each centroid is the mean of the points in that cluster. 
$$d(x,c)=(xc)(xc{)}^{\prime}$$ 
'cityblock' 
Sum of absolute differences, i.e., the L1 distance. Each centroid is the componentwise median of the points in that cluster. 
$$d(x,c)={\displaystyle \sum _{j=1}^{p}\left{x}_{j}{c}_{j}\right}$$ 
'cosine' 
One minus the cosine of the included angle between points (treated as vectors). Each centroid is the mean of the points in that cluster, after normalizing those points to unit Euclidean length. 
$$d(x,c)=1\frac{xc\prime}{\sqrt{\left(x{x}^{\prime}\right)\left(cc\prime \right)}}$$ 
'correlation' 
One minus the sample correlation between points (treated as sequences of values). Each centroid is the componentwise mean of the points in that cluster, after centering and normalizing those points to zero mean and unit standard deviation. 
$$d(x,c)=1\frac{\left(x\overrightarrow{\overline{x}}\right){\left(c\overrightarrow{\overline{c}}\right)}^{\prime}}{\sqrt{\left(x\overrightarrow{\overline{x}}\right){\left(x\overrightarrow{\overline{x}}\right)}^{\prime}}\sqrt{\left(c\overrightarrow{\overline{c}}\right){\left(c\overrightarrow{\overline{c}}\right)}^{\prime}}},$$

'hamming' 
This metric is only suitable for binary data. It is the proportion of bits that differ. Each centroid is the componentwise median of points in that cluster. 
$$d(x,y)=\frac{1}{p}{\displaystyle \sum}_{j=1}^{p}I\left\{{x}_{j}\ne {y}_{j}\right\},$$ 
Example: 'Distance','cityblock'
'EmptyAction'
— Action to take if cluster loses all member observations'singleton'
(default)  'error'
 'drop'
Action to take if a cluster loses all its member observations,
specified as the commaseparated pair consisting of 'EmptyAction'
and
one of the following options.
Value  Description 

'error'  Treat an empty cluster as an error. 
'drop'  Remove any clusters that become empty. 
'singleton'  Create a new cluster consisting of the one point furthest from its centroid (default). 
Example: 'EmptyAction','error'
'MaxIter'
— Maximum number of iterations100
(default)  positive integerMaximum number of iterations, specified as the commaseparated
pair consisting of 'MaxIter'
and a positive integer.
Example: 'MaxIter',1000
Data Types: double
 single
'OnlinePhase'
— Online update flag'off'
(default)  'on'
Online update flag, specified as the commaseparated pair consisting
of 'OnlinePhase'
and 'off'
or 'on'
.
If OnlinePhase
is on
,
then kmeans
performs an online update phase in
addition to a batch update phase. The online phase can be time consuming
for large data sets, but guarantees a solution that is a local minimum
of the distance criterion. In other words, the software finds a partition
of the data in which moving any single point to a different cluster
increases the total sum of distances.
Example: 'OnlinePhase','on'
'Options'
— Options for controlling iterative algorithm for minimizing fitting criteria[]
(default)  structure array returned by statset
Options for controlling the iterative algorithm for minimizing
the fitting criteria, specified as the commaseparated pair consisting
of 'Options'
and a structure array returned by statset
. These options require Parallel
Computing Toolbox™.
This table summarizes the available options.
Option  Description 

'Streams' 
A
In that case, use a cell array the same size
as the parallel pool. If a parallel pool is not
open, then 
'UseParallel' 

'UseSubstreams'  Set to true to compute in parallel in a
reproducible fashion. Default is false . To compute
reproducibly, set Streams to a type allowing substreams: 'mlfg6331_64' or 'mrg32k3a' . 
To ensure more predictable
results, use parpool
and explicitly
create a parallel pool before invoking kmeans
and
setting 'Options',statset('UseParallel',1)
.
Example: 'Options',statset('UseParallel',1)
Data Types: struct
'Replicates'
— Number of times to repeat clustering using new initial cluster centroid positions1
(default)  positive integerNumber of times to repeat clustering using new initial cluster
centroid positions, specified as the commaseparated pair consisting
of 'Replicates'
and an integer. kmeans
returns
the solution with the lowest sumd
.
You can set 'Replicates'
implicitly by supplying
a 3D array as the value for the 'Start'
namevalue
pair argument.
Example: 'Replicates',5
Data Types: double
 single
'Start'
— Method for choosing initial cluster centroid positions'plus'
(default)  'cluster'
 'sample'
 'uniform'
 numeric matrix  numeric arrayMethod for choosing initial cluster centroid positions (or seeds),
specified as the commaseparated pair consisting of
'Start'
and 'cluster'
,
'plus'
, 'sample'
,
'uniform'
, a numeric matrix, or a numeric array.
This table summarizes the available options for choosing
seeds.
Value  Description 

'cluster'  Perform a preliminary clustering phase on a
random 10% subsample of X . This
preliminary phase is itself initialized using
'sample' . 
'plus' (default)  Select k seeds by implementing
the kmeans++ algorithm
for cluster center initialization. 
'sample'  Select k observations from
X at random. 
'uniform'  Select k points uniformly at
random from the range of X . Not
valid with the Hamming distance. 
numeric matrix  k byp
matrix of centroid starting locations. The rows of
Start correspond to seeds. The
software infers k from the first
dimension of Start , so you can
pass in [] for
k . 
numeric array  k bypr
array of centroid starting locations. The rows of
each page correspond to seeds. The third dimension
invokes replication of the clustering routine. Page
j contains the set of seeds for
replicate j. The software infers
the number of replicates (specified by the
'Replicates' namevalue pair
argument) from the size of the third
dimension. 
Example: 'Start','sample'
Data Types: char
 string
 double
 single
The software treats NaN
s as missing data,
and removes any row of X
containing at least one NaN
.
Removing rows of X
reduces the sample size.
idx
— Cluster indicesCluster indices, returned as a numeric column vector. idx
has
as many rows as X
, and each row indicates the cluster
assignment of the corresponding observation.
C
— Cluster centroid locationsCluster centroid locations, returned as a numeric matrix. C
is
a k
byp matrix, where row j is
the centroid of cluster j.
sumd
— Withincluster sums of pointtocentroid distancesWithincluster sums of pointtocentroid distances, returned
as a numeric column vector. sumd
is a k
by1
vector, where element j is the sum of pointtocentroid
distances within cluster j.
D
— Distances from each point to every centroidDistances from each point to every centroid, returned as a numeric
matrix. D
is an nbyk
matrix,
where element (j,m) is the distance
from observation j to centroid m.
kmeans clustering, or Lloyd’s algorithm [2], is an iterative, datapartitioning algorithm that assigns n observations to exactly one of k clusters defined by centroids, where k is chosen before the algorithm starts.
The algorithm proceeds as follows:
Choose k initial cluster centers
(centroid). For example, choose k observations
at random (by using 'Start','sample'
) or use the kmeans
++ algorithm for cluster center initialization (the default).
Compute pointtoclustercentroid distances of all observations to each centroid.
There are two ways to proceed (specified by OnlinePhase
):
Batch update — Assign each observation to the cluster with the closest centroid.
Online update — Individually assign observations to a different centroid if the reassignment decreases the sum of the withincluster, sumofsquares pointtoclustercentroid distances.
For more details, see Algorithms.
Compute the average of the observations in each cluster to obtain k new centroid locations.
Repeat steps 2 through 4 until cluster assignments do not change, or the maximum number of iterations is reached.
The kmeans++ algorithm uses an heuristic to find centroid seeds for kmeans clustering. According to Arthur and Vassilvitskii [1], kmeans++ improves the running time of Lloyd’s algorithm, and the quality of the final solution.
The kmeans++ algorithm chooses seeds as follows, assuming the number of clusters is k.
Select an observation uniformly at random from the data set, X. The chosen observation is the first centroid, and is denoted c_{1}.
Compute distances from each observation to c_{1}. Denote the distance between c_{j} and the observation m as $$d\left({x}_{m},{c}_{j}\right)$$.
Select the next centroid, c_{2} at random from X with probability
$$\frac{{d}^{2}\left({x}_{m},{c}_{1}\right)}{{\displaystyle \sum}_{j=1}^{n}{d}^{2}\left({x}_{j},{c}_{1}\right)}.$$
To choose center j:
Compute the distances from each observation to each centroid, and assign each observation to its closest centroid.
For m = 1,...,n and p = 1,...,j – 1, select centroid j at random from X with probability
$$\frac{{d}^{2}\left({x}_{m},{c}_{p}\right)}{{\displaystyle \sum}_{\{h;{x}_{h}\in {C}_{p}\}}^{}{d}^{2}\left({x}_{h},{c}_{p}\right)},$$
That is, select each subsequent center with a probability proportional to the distance from itself to the closest center that you already chose.
Repeat step 4 until k centroids are chosen.
Arthur and Vassilvitskii [1] demonstrate, using a simulation study for several cluster orientations, that kmeans++ achieves faster convergence to a lower sum of withincluster, sumofsquares pointtoclustercentroid distances than Lloyd’s algorithm.
kmeans
uses a twophase iterative
algorithm to minimize the sum of pointtocentroid distances, summed
over all k
clusters.
This first phase uses batch updates, where each iteration consists of reassigning points to their nearest cluster centroid, all at once, followed by recalculation of cluster centroids. This phase occasionally does not converge to solution that is a local minimum. That is, a partition of the data where moving any single point to a different cluster increases the total sum of distances. This is more likely for small data sets. The batch phase is fast, but potentially only approximates a solution as a starting point for the second phase.
This second phase uses online updates, where points are individually reassigned if doing so reduces the sum of distances, and cluster centroids are recomputed after each reassignment. Each iteration during this phase consists of one pass though all the points. This phase converges to a local minimum, although there might be other local minima with lower total sum of distances. In general, finding the global minimum is solved by an exhaustive choice of starting points, but using several replicates with random starting points typically results in a solution that is a global minimum.
If Replicates
= r >
1 and Start
is plus
(the default),
then the software selects r possibly different
sets of seeds according to the kmeans++
algorithm.
If you enable the UseParallel
option
in Options
and Replicates
>
1, then each worker selects seeds and clusters in parallel.
[1] Arthur, David, and Sergi Vassilvitskii. “Kmeans++: The Advantages of Careful Seeding.” SODA ‘07: Proceedings of the Eighteenth Annual ACMSIAM Symposium on Discrete Algorithms. 2007, pp. 1027–1035.
[2] Lloyd, Stuart P. “Least Squares Quantization in PCM.” IEEE Transactions on Information Theory. Vol. 28, 1982, pp. 129–137.
[3] Seber, G. A. F. Multivariate Observations. Hoboken, NJ: John Wiley & Sons, Inc., 1984.
[4] Spath, H. Cluster Dissection and Analysis: Theory, FORTRAN Programs, Examples. Translated by J. Goldschmidt. New York: Halsted Press, 1985.
This function supports tall arrays for outofmemory data with some limitations.
Only random sample initialization is supported. Supported syntaxes:
idx = kmeans(X,k)
performs classic
kmeans clustering.
[idx,C] = kmeans(X,k)
also returns
the k
cluster centroid locations.
[idx,C,sumd] = kmeans(X,k)
additionally
returns the k
withincluster sums of pointtocentroid
distances.
[___] = kmeans(___,Name,Value)
specifies
additional namevalue pair options using any of the other syntaxes.
Valid options are:
'Start'
— Method used to
choose the initial cluster centroid positions. Value can be:
'plus'
(default) — Select k
observations
from X
using a variant of the kmeans++ algorithm
adapted for tall data.
'sample'
— Select k
observations
from X
at random.
Numeric matrix — A kbyp matrix to explicitly specify starting locations.
'Options'
— An options structure
created using the statset
function. For tall
arrays, kmeans
uses the fields listed here and
ignores all other fields in the options structure:
'Display'
— Level of display.
Choices are 'iter'
(default), 'off'
,
and 'final'
.
'MaxIter'
— Maximum number
of iterations. Default is 100
.
'TolFun'
— Convergency tolerance
for the withincluster sums of pointtocentroid distances. Default
is 1e4
. This option field only works with tall
arrays.
For more information, see Tall Arrays (MATLAB).
Usage notes and limitations:
If the Start
method uses random
selections, the initial centroid cluster positions might not match MATLAB^{®}.
If the number of rows in X
is fixed,
code generation does not remove rows of X
that
contain a NaN
.
The cluster centroid locations in C
can
have a different order than in MATLAB. In this case, the cluster
indices in idx
have corresponding differences.
If you provide Display
, its value
must be 'off'
.
If you provide Streams
, it must
be empty and UseSubstreams
must be false
.
When you set the UseParallel
option to true
:
Some computations can execute in parallel even when
Replicates
is 1
. For
large data sets, when Replicates
is
1
, consider setting the
UseParallel
option to
true
.
kmeans
uses parfor
to create
loops that run in parallel on supported sharedmemory multicore
platforms. Loops that run in parallel can be faster than loops
that run on a single thread. If your compiler does not support
the Open Multiprocessing (OpenMP) application interface or you
disable OpenMP library, MATLAB
Coder™ treats
the parfor
loops as
for
loops. To find supported compilers,
see Supported Compilers.
clusterdata
 gmdistribution
 linkage
 parpool
 silhouette
 statset
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
Select web siteYou can also select a web site from the following list:
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.