CalinskiHarabaszEvaluation

Calinski-Harabasz criterion clustering evaluation object

Description

CalinskiHarabaszEvaluation is an object consisting of sample data (X), clustering data (OptimalY), and Calinski-Harabasz criterion values (CriterionValues) used to evaluate the optimal number of clusters (OptimalK). The Calinski-Harabasz criterion is sometimes called the variance ratio criterion (VRC). Well-defined clusters have a large between-cluster variance and a small within-cluster variance. The optimal number of clusters corresponds to the solution with the highest Calinski-Harabasz index value. For more information, see Calinski-Harabasz Criterion.

Creation

Create a Calinski-Harabasz criterion clustering evaluation object by using the evalclusters function and specifying the criterion as "CalinskiHarabasz".

You can then use compact to create a compact version of the Calinski-Harabasz criterion clustering evaluation object. The function removes the contents of the properties X, OptimalY, and Missing.

Properties

expand all

Clustering Evaluation Properties

`ClusteringFunction` — Clustering algorithm
`'kmeans'` | `'linkage'` | `'gmdistribution'` | function handle | `[]`

This property is read-only.

Clustering algorithm used to cluster the sample data, returned as 'kmeans', 'linkage', 'gmdistribution', or a function handle. If you specify the clustering solutions as an input argument to evalclusters when you create the clustering evaluation object, then ClusteringFunction is empty.

Value	Description
`'kmeans'`	Cluster the data in `X` using the `kmeans` clustering algorithm, with `EmptyAction` set to `"singleton"` and `Replicates` set to `5`.
`'linkage'`	Cluster the data in `X` using the `clusterdata` agglomerative clustering algorithm, with `Linkage` set to `"ward"`.
`'gmdistribution'`	Cluster the data in `X` using the `gmdistribution` Gaussian mixture distribution algorithm, with `SharedCov` set to `true` and `Replicates` set to `5`.

Data Types: double | char | function_handle

`CriterionName` — Name of criterion
`'CalinskiHarabasz'`

This property is read-only.

Name of the criterion used for clustering evaluation, returned as 'CalinskiHarabasz'.

`CriterionValues` — Criterion values
numeric vector

This property is read-only.

Criterion values, returned as a numeric vector. Each value corresponds to a proposed number of clusters in InspectedK.

Data Types: double

`InspectedK` — List of number of proposed clusters
positive integer vector

This property is read-only.

List of the number of proposed clusters for which to compute criterion values, returned as a positive integer vector.

Data Types: double

`OptimalK` — Optimal number of clusters
positive integer scalar

This property is read-only.

Optimal number of clusters, returned as a positive integer scalar.

Data Types: double

`OptimalY` — Optimal clustering solution
positive integer column vector | `[]`

This property is read-only.

Optimal clustering solution corresponding to OptimalK, returned as a positive integer column vector. Each row of OptimalY represents the cluster index of the corresponding observation (or row) in X. If you specify the clustering solutions as an input argument to evalclusters when you create the clustering evaluation object, or if the clustering evaluation object is compact (see compact), then OptimalY is empty.

Data Types: double

Sample Data Properties

`Missing` — Excluded data
logical column vector | `[]`

This property is read-only.

Excluded data, returned as a logical column vector. If an element of Missing is true, then the corresponding observation (or row) in the data matrix X is not used in the clustering solutions. If the clustering evaluation object is compact (see compact), then Missing is empty.

Data Types: double | logical

`NumObservations` — Number of observations
positive integer scalar

This property is read-only.

Number of observations in the data matrix X, ignoring observations with missing (NaN) values, returned as a positive integer scalar.

Data Types: double

`X` — Data used for clustering
numeric matrix | `[]`

This property is read-only.

Data used for clustering, returned as a numeric matrix. Rows correspond to observations, and columns correspond to variables. If the clustering evaluation object is compact (see compact), then X is empty.

Data Types: single | double

Object Functions

`addK`	Evaluate additional numbers of clusters
`compact`	Compact clustering evaluation object
`plot`	Plot clustering evaluation object criterion values

Examples

collapse all

Evaluate Clustering Solution Using Calinski-Harabasz Criterion

Open Live Script

Evaluate the optimal number of clusters using the Calinski-Harabasz clustering evaluation criterion.

Load the fisheriris data set. The data contains length and width measurements from the sepals and petals of three species of iris flowers.

load fisheriris

Evaluate the optimal number of clusters using the Calinski-Harabasz criterion. Cluster the data using kmeans.

rng("default") % For reproducibility
evaluation = evalclusters(meas,"kmeans","CalinskiHarabasz","KList",1:6)

evaluation = 
  CalinskiHarabaszEvaluation with properties:

    NumObservations: 150
         InspectedK: [1 2 3 4 5 6]
    CriterionValues: [NaN 513.9245 561.6278 530.4871 456.1279 469.5068]
           OptimalK: 3

The OptimalK value indicates that, based on the Calinski-Harabasz criterion, the optimal number of clusters is three.

Plot the Calinski-Harabasz criterion values for each number of clusters tested.

plot(evaluation)

Figure contains an axes object. The axes object with xlabel Number of Clusters, ylabel CalinskiHarabasz Values contains 2 objects of type line.

The plot shows that the highest Calinski-Harabasz value occurs at three clusters, suggesting that the optimal number of clusters is three.

Create a grouped scatter plot to examine the relationship between petal length and width. Group the data by suggested clusters.

PetalLength = meas(:,3);
PetalWidth = meas(:,4);
clusters = evaluation.OptimalY;
gscatter(PetalLength,PetalWidth,clusters,[],"xod");

Figure contains an axes object. The axes object with xlabel PetalLength, ylabel PetalWidth contains 3 objects of type line. One or more of the lines displays its values using only markers These objects represent 1, 2, 3.

The plot shows cluster 3 in the lower-left corner, completely separated from the other two clusters. Cluster 3 contains flowers with the smallest petal widths and lengths. Cluster 1 is in the upper-right corner, and contains flowers with the largest petal widths and lengths. Cluster 2 is near the center of the plot, and contains flowers with measurements between these two extremes.

More About

expand all

Calinski-Harabasz Criterion

The Calinski-Harabasz criterion is sometimes called the variance ratio criterion (VRC). The Calinski-Harabasz index is defined as

$V R C_{k} = \frac{S S_{B}}{S S_{W}} \times \frac{(N - k)}{(k - 1)},$

where SS_B is the overall between-cluster variance, SS_W is the overall within-cluster variance, k is the number of clusters, and N is the number of observations.

The overall between-cluster variance SS_B is defined as

$S S_{B} = \sum_{i = 1}^{k} n_{i} {‖ m_{i} - m ‖}^{2},$

where k is the number of clusters, n_i is the number of observations in cluster i, m_i is the centroid of cluster i, m is the overall mean of the sample data, and $‖ m_{i} - m ‖$ is the L² norm (Euclidean distance) between the two vectors.

The overall within-cluster variance SS_W is defined as

$S S_{W} = \sum_{i = 1}^{k} {\sum_{x \in c_{i}} ‖ x - m_{i} ‖}^{2},$

where k is the number of clusters, x is a data point, c_i is the ith cluster, m_i is the centroid of cluster i, and $‖ x - m_{i} ‖$ is the L² norm (Euclidean distance) between the two vectors.

Well-defined clusters have a large between-cluster variance (SS_B) and a small within-cluster variance (SS_W). The larger the VRC_k ratio, the better the data partition. To determine the optimal number of clusters, maximize VRC_k with respect to k. The optimal number of clusters corresponds to the solution with the highest Calinski-Harabasz index value.

The Calinski-Harabasz criterion is best suited for k-means clustering solutions with squared Euclidean distances.

References

[1] Calinski, T., and J. Harabasz. “A dendrite method for cluster analysis.” Communications in Statistics. Vol. 3, No. 1, 1974, pp. 1–27.

Version History

Introduced in R2013b

CalinskiHarabaszEvaluation

Description

Creation

Properties

Clustering Evaluation Properties

ClusteringFunction — Clustering algorithm 'kmeans' | 'linkage' | 'gmdistribution' | function handle | []

CriterionName — Name of criterion 'CalinskiHarabasz'

CriterionValues — Criterion values numeric vector

InspectedK — List of number of proposed clusters positive integer vector

OptimalK — Optimal number of clusters positive integer scalar

OptimalY — Optimal clustering solution positive integer column vector | []

Sample Data Properties

Missing — Excluded data logical column vector | []

NumObservations — Number of observations positive integer scalar

X — Data used for clustering numeric matrix | []

Object Functions

Examples

Evaluate Clustering Solution Using Calinski-Harabasz Criterion

More About

Calinski-Harabasz Criterion

References

Version History

See Also

`ClusteringFunction` — Clustering algorithm
`'kmeans'` | `'linkage'` | `'gmdistribution'` | function handle | `[]`

`CriterionName` — Name of criterion
`'CalinskiHarabasz'`

`CriterionValues` — Criterion values
numeric vector

`InspectedK` — List of number of proposed clusters
positive integer vector

`OptimalK` — Optimal number of clusters
positive integer scalar

`OptimalY` — Optimal clustering solution
positive integer column vector | `[]`

`Missing` — Excluded data
logical column vector | `[]`

`NumObservations` — Number of observations
positive integer scalar

`X` — Data used for clustering
numeric matrix | `[]`