Cluster Data

Cluster data using k-means or hierarchical clustering in the Live Editor

Since R2021b

Description

The Cluster Data Live Editor Task enables you to interactively perform k-means or hierarchical clustering. The task generates MATLAB^® code for your live script and returns the resulting cluster indices to the MATLAB workspace. If you perform k-means clustering, the task also returns the cluster centroid locations.

You can:

Specify the number of clusters manually. For hierarchical clustering, you can specify the cutoff for the underlying hierarchical cluster tree.
Determine the optimal number of clusters for your data automatically by specifying criteria such as gap values, silhouette values, Davies-Bouldin index values, and Calinski-Harabasz index values.
Customize the parameters for clustering your data, such as the distance metric to use.
Automatically visualize the clustered data.

For general information about Live Editor tasks, see Add Interactive Tasks to a Live Script.

Open the Task

To add the Cluster Data task to a live script:

On the Live Editor tab, select Task > Cluster Data.
In a code block in the live script, type a relevant keyword, such as clustering, kmeans, or hierarchical. Select Cluster Data from the suggested command completions.

Examples

expand all

Specify Number of Clusters for k-Means Clustering Using Live Editor Task

This example shows how to use the Cluster Data task to interactively perform k-means clustering for a specified number of clusters.

Load the sample data. The data contains length and width measurements from the sepals and petals of three species of iris flowers.

load fisheriris

Open the Cluster Data task. To open the task, begin typing the keyword clustering in a code block and select Cluster Data from the suggested command completions.

List of suggested command completions. The first suggestion in the list is for the Cluster Data task, and is selected.

In the task, select the k-Means Clustering algorithm. (since R2024a)

Cluster the data into two clusters.

Select the meas variable as the input data.
Set the number of clusters to 2, if necessary.
In the Live Editor tab, click the Run button to run the task.

MATLAB displays the clustered data and the cluster means in a scatter plot.

Cluster Data task showing the selected parameters and the resulting scatter plot with the sample data divided into two clusters

Increase the number of clusters to 3 and rerun the task. MATLAB displays the updated clustered data and the cluster means in a scatter plot.

Cluster Data task showing the selected parameters and the resulting scatter plot with the sample data divided into three clusters

The task generates code in your live script. The generated code reflects the parameters and options that you select, and includes code to generate the scatter plot. To see the generated code, click Show code at the bottom of the task parameter area. The task expands to display the generated code.

Generated code for the Cluster Data task. The code uses the kmeans function to cluster the data and the scatter function to display the results.

By default, the generated code uses clusterIndices and centroids as the name of the output variables returned to the MATLAB workspace. The clusterIndices vector is a numeric column vector containing the cluster indices. Each row in clusterIndices indicates the cluster assignment of the corresponding observation. The centroids matrix is a numeric matrix containing the cluster centroid locations. To specify a different output variable name, enter a new name in the summary line at the top of the task. For instance, change the two variable names to c_indices and c_locations.

First row of the Cluster Data task with the renamed output c_indices and c_locations

When the task runs, the generated code is updated to reflect the new variable names. The new variables c_indices and c_locations appear in the MATLAB workspace.

Evaluate Optimal Number of Clusters for k-Means Clustering Using Live Editor Task

This example shows how to use the Cluster Data task to interactively evaluate clustering solutions based on selected criteria.

Load the sample data. The data contains length and width measurements from the sepals and petals of three species of iris flowers.

load fisheriris

Open the Cluster Data task. To open the task, begin typing the keyword clustering in a code block and select Cluster Data from the suggested command completions.

List of suggested command completions. The first suggestion in the list is for the Cluster Data task, and is selected.

In the task, select the k-Means Clustering algorithm. (since R2024a)

Evaluate the optimal number of clusters.

Select the meas variable as the input data.
Set the number of clusters selection method to Optimal.
Set the range min and max to 2 and 6.
In the Live Editor tab, click the Run button to run the task.

Cluster Data task showing the selected parameters

MATLAB displays a bar chart with evaluation results, indicating that, based on the Calinski-Harabasz criterion, the optimal number of clusters is 3. A scatter plot shows the clustered data and the cluster means using the optimal number of clusters, 3. Your results might differ.

Cluster Data task showing two plots. The first plot is a bar chart displaying the evaluation results for each cluster number, and the second plot is a scatter plot with the sample data divided into three clusters.

Specify Threshold for Hierarchical Clustering Using Live Editor Task

Since R2024a

This example shows how to use the Cluster Data task to interactively perform hierarchical clustering for a specified cluster tree cutoff.

Load the sample data. The data contains length and width measurements from the sepals and petals of three species of iris flowers.

load fisheriris

Open the Cluster Data task. To open the task, begin typing the keyword clustering in a code block and select Cluster Data from the suggested command completions.

List of suggested command completions. The first suggestion in the list is for the Cluster Data task, and is selected.

In the task, select the Hierarchical Clustering algorithm.

Cluster the data using the default number of clusters.

Select the meas variable as the input data.
Set the maximum number of clusters to 2, if necessary.
In the Live Editor tab, click the Run button to run the task.

Cluster Data task showing the selected parameters

MATLAB displays the cluster tree in a dendrogram and the clustered data in a scatter plot.

Cluster Data task showing the resulting plots with the sample data divided into two clusters

Use a cutoff to split the data into three clusters and rerun the task.

Set the selection method for the number of clusters to Manual cutoff.
Set the threshold to 1.8 and the cluster criterion to Distance. The previous dendrogram shows that this cutoff value splits the hierarchical cluster tree into three clusters.
To see the three clusters in the dendrogram, set the color threshold to 45 percent.
In the Live Editor tab, click the Run button to run the task.

Cluster Data task showing the selected parameters

MATLAB displays the updated dendrogram and scatter plot.

Cluster Data task showing the resulting plots with the sample data divided into three clusters

Generated code for the Cluster Data task. The code uses the pdist, linkage, and cluster functions to cluster the data and the dendrogram function to display the results.

Generated code for the Cluster Data task. The code uses the gscatter function to display the results.

By default, the generated code uses clusterIndices as the name of the output variable returned to the MATLAB workspace. The clusterIndices vector is a numeric column vector containing the cluster indices. Each row in clusterIndices indicates the cluster assignment of the corresponding observation. To specify a different output variable name, enter a new name in the summary line at the top of the task. For instance, change the variable name to c_indices.

When the task runs, the generated code is updated to reflect the new variable name. The new variable c_indices appears in the MATLAB workspace.

Related Examples

Parameters

expand all

`Input data` — Data to cluster
numeric matrix

Specify the data to cluster by selecting a variable from the available workspace variables. The variable must be a numeric matrix to appear in the list.

`Selection Method` — Cluster selection method
`Manual` | `Optimal` | `Manual num clusters` | `Manual cutoff` | `Optimal num clusters`

Specify the method for determining the optimal number of clusters for your data.

k-Means Clustering Options

Manual (default) — Specify the number of clusters to group your data into manually.
Optimal — Use the evalclusters function to find the optimal number of clusters based on criteria such as gap values, silhouette values, Davies-Bouldin index values, and Calinski-Harabasz index values.

Hierarchical Clustering Options

Manual num clusters (default) — Specify the maximum number of clusters to group your data into manually.
Manual cutoff — Specify the threshold for cutting the hierarchical cluster tree and determining the number of clusters to group your data into manually. If you use the Inconsistency criterion, then the Cluster Data task groups clusters whose subclusters have inconsistency coefficients less than the threshold. If you use the Distance criterion, then the Cluster Data task groups clusters whose subclusters have a height less than the threshold.
Optimal num clusters — Use the evalclusters function to find the optimal number of clusters based on criteria such as gap values, silhouette values, Davies-Bouldin index values, and Calinski-Harabasz index values.

`Range` — List of number of clusters to evaluate
min and max positive integer values

Specify the list of number of clusters to evaluate as a range consisting of a min value and a max value. For example, if you specify a min value of 2 and a max value of 6, the task evaluates the number of clusters 2, 3, 4, 5, and 6 to determine the optimal number.

For k-means clustering, the default range is 2:5. For hierarchical clustering, the default range is 2:3.

`Display results` — Plots of results
check boxes

To display the clustered data, select from the available options.

k-Means Clustering Options

Select 2D scatter plot (PCA) to display the principal components of the clustered data in a 2D scatter plot. The Cluster Data task uses the pca and gscatter functions to create the scatter plot.
Select Matrix of scatter plots to display the clustered data in a matrix of scatter plots. When you select Matrix of scatter plots, a list appears to the right of the check box. Each item in the list represents a column in the specified input data. Press the Ctrl key and select a maximum of four input data columns from the list. The Cluster Data task uses the gplotmatrix function to create the matrix of scatter plots from the selected columns.
The scatter plots in the matrix compare the selected input data columns across cluster indices. The diagonal plots in the matrix are histograms showing the distribution of the selected columns for each cluster indices.

For both plots, you can choose whether to display the clustered data and the cluster means.

Hierarchical Clustering Options

Select Dendrogram to display the hierarchical cluster tree. When you select Dendrogram, three parameters appear to the right of the check box. The first parameter specifies the color threshold as a percentage of the maximum (linkage) distance in the tree. The second parameter controls the maximum number of leaf nodes to display in the tree. The third parameter changes the orientation of the tree to Top, Bottom, Left, or Right. The Cluster Data task uses the dendrogram function to create the plot. The dendrogram is not available when you use the Optimal num clusters selection method.
Select 2D scatter plot to display the clustered data in a 2D scatter plot. When you select 2D scatter plot, two lists appear to the right of the check box. The items in the lists represent columns in the specified input data. The first list determines the x-axis variable in the plot, and the second list determines the y-axis variable. The Cluster Data task uses the gscatter function to create the scatter plot.
Instead of selecting 2D scatter plot, you can select 3D scatter plot to display the clustered data in a 3D scatter plot. When you select 3D scatter plot, three lists appear to the right of the check box. The lists determine the x-axis, y-axis, and z-axis variables. The Cluster Data task uses the scatter3 function to create the scatter plot.
Select Matrix of scatter plots to display the clustered data in a matrix of scatter plots. When you select Matrix of scatter plots, a list appears to the right of the check box. Each item in the list represents a column in the specified input data. Press the Ctrl key and select a maximum of four input data columns from the list. The Cluster Data task uses the gplotmatrix function to create the matrix of scatter plots from the selected columns.

Tips

By default, the Cluster Data task does not automatically run when you modify the task parameters. To have the task run automatically after any change, select the Autorun box at the top right of the task. If your data set is large, do not enable this option.

Version History

Introduced in R2021b

expand all

R2024a: Cluster data using hierarchical clustering

You can now use the Cluster Data Live Editor Task to interactively perform hierarchical clustering in a live script.

Select the maximum number of clusters, or specify an appropriate cutoff for the underlying hierarchical cluster tree (dendrogram). Optionally, specify the metric for computing the distance between observations and the method for computing the distance between clusters. The task plots the dendrogram, allowing you to interactively explore the effects of changing parameter values and options.
Alternatively, evaluate the optimal number of clusters. You can optionally specify the criterion for defining clusters in the hierarchical cluster tree. In this case, the task does not plot the dendrogram. Use scatter plots to visualize the clusters.

The task automatically generates code that becomes part of your live script.

Cluster Data

Description

Open the Task

Examples

Specify Number of Clusters for k-Means Clustering Using Live Editor Task

Evaluate Optimal Number of Clusters for k-Means Clustering Using Live Editor Task

Specify Threshold for Hierarchical Clustering Using Live Editor Task

Related Examples

Parameters

Input data — Data to cluster numeric matrix

Selection Method — Cluster selection method Manual | Optimal | Manual num clusters | Manual cutoff | Optimal num clusters

Range — List of number of clusters to evaluate min and max positive integer values

Display results — Plots of results check boxes

Tips

Version History

R2024a: Cluster data using hierarchical clustering

See Also

`Input data` — Data to cluster
numeric matrix

`Selection Method` — Cluster selection method
`Manual` | `Optimal` | `Manual num clusters` | `Manual cutoff` | `Optimal num clusters`

`Range` — List of number of clusters to evaluate
min and max positive integer values

`Display results` — Plots of results
check boxes