Matching the labels of a clustering with ground truth labels for performance analysis

Question

Samuel L. Polk am 9 Okt. 2021

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/1470216-matching-the-labels-of-a-clustering-with-ground-truth-labels-for-performance-analysis

Beantwortet: Shubham am 1 Mär. 2024

I'm working on clustering a dataset I have ground truth labels. I want to evaluate the confusion matrix between the predicted and ground truth labels, but the labels assigned by my clustering algorithm are not necessarily the same numerically as those assigned in the ground truth labels.

For example, suppose that I assign labels cHat=[2,2,1,1] using some clustering algorithm, but the true labels are cTrue=[1,1,2,2]. I've achieved perfect performance (splitting the two clusters apart). However, if I construct a confusion matrix before any preprocessing steps, I will get a matrix of the form: C = [[0,2]; [2,0]], implying zero accuracy. How do I preprocess my cHat to better match the labels of cTrue?

I understand that there are other ways to evaluate clustering accuracy (such as the Rand Index, NMI, etc.). However, the metrics that I need to use (Overall Accuracy, Average Accuracy, and Kappa coefficient) rely on the construction of a confusion matrix.

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Shubham am 1 Mär. 2024

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/1470216-matching-the-labels-of-a-clustering-with-ground-truth-labels-for-performance-analysis#answer_1419411

Hi Samuel,

When evaluating clustering results with a confusion matrix, the absolute values of the labels are not important; rather, it's the consistency of labeling within the clusters that matters. Since clustering algorithms like k-means arbitrarily assign numerical labels to clusters, these labels often do not match the ground truth labels. To compare the predicted labels with the ground truth, you need to align the labels first.

One common approach to align the labels is to use the Hungarian algorithm, also known as the Kuhn-Munkres algorithm, which can solve the assignment problem in polynomial time. The algorithm finds the best one-to-one mapping between two sets of labels that minimizes the total cost (or maximizes the total similarity).

Here's a theoretical approach to preprocess cHat to better match the labels of cTrue:

Construct a contingency table (similar to a confusion matrix) where each cell (i, j) represents the number of samples in predicted cluster i and true cluster j.
Use the Hungarian algorithm to find the optimal one-to-one mapping between the predicted labels and the true labels based on the contingency table. The goal is to maximize the sum of the diagonal (which represents correctly clustered samples) in the confusion matrix.
According to the mapping obtained from the Hungarian algorithm, relabel the predicted clusters so that they correspond to the true clusters as closely as possible.
With the relabeled predicted clusters, construct the confusion matrix again. This time, the diagonal should reflect the actual number of correctly clustered samples.
Now that you have a properly aligned confusion matrix, you can calculate Overall Accuracy, Average Accuracy, and the Kappa coefficient as needed.

In MATLAB, you can use the matchpairs function from the Optimization Toolbox to perform the Hungarian algorithm.

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Matching the labels of a clustering with ground truth labels for performance analysis

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Antworten (1)

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

Matching the labels of a clustering with ground truth labels for performance analysis

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Antworten (1)

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden