How to Cluster Dataset and remove outlier in MATLAB

I understand that you want to cluster the 4-feature dataset and remove the outliers from the dataset. This task can be carried out using the following workflow:

Determine the optimal number of clusters: The elbow method involves plotting the within-cluster sum of squares (WCSS) against the number of clusters and looking for the "elbow" point where the rate of decrease sharply changes. This point is often considered a good choice for the number of clusters.
Perform K-means clustering: After determining the optimal number of clusters, perform k-means clustering.
Removing outliers: Outliers can be detected and removed based on their distance from the centroid of their assigned cluster. A common approach is to remove points that are farthest from the centroid beyond a certain threshold.

Please refer to the below code snippet that illustrates the above workflow:

data = Dataset;
wcss = [];
for k = 1:10 % Test up to 10 clusters
    [idx, C, sumd] = kmeans(data, k, 'Replicates', 10);
    wcss(k) = sum(sumd);
end
plot(1:10, wcss);
xlabel('Number of clusters');
ylabel('WCSS');
title('Elbow Method');
optimalK = % the optimal number of clusters you determined
[idx, C, sumd] = kmeans(data, optimalK, 'Replicates', 10);
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
    clusterPoints = data(idx == i, :);
    centroid = C(i, :);
    distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
end
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);

Hope it helps!

2 Kommentare
Keine anzeigenKeine ausblenden

Med Future am 23 Apr. 2024

Bearbeitet: Walter Roberson am 24 Apr. 2024

In MATLAB Online öffnen

Question.mat

@Sai Pavan

I have implement the code you shared with my code. But still there is an error Arrays have incompatible sizes for this operation. I have attached the dataset and the code below. Please modified the code for that. As i know the ground truth there should be only 1 cluster the remaining are the noise. Based on the distance calculation

load Question
dataset1=data(:,[2 4]);
% Step 1: Identify and remove outliers
freq_outliers = isoutlier(dataset1(:, 1));
pw_outliers = isoutlier(dataset1(:, 2));
outliers = freq_outliers | pw_outliers;
% Step 2: Remove rows with outliers from all columns
dataset1_no_outliers = dataset1(~outliers, :);
pdw_no_outliers = data(~outliers, :);
% Now, continue with your existing code using 'dataset1_no_outliers'
eva = evalclusters(dataset1_no_outliers, 'kmeans', 'silhouette', 'KList', [1:8]);
%eva = evalclusters(dataset1, 'kmeans', 'silhouette', 'KList', [1:8]);
K = eva.OptimalK;
[idx,C,sumdist] = kmeans(dataset1,K);
dataset=data;
dataset_idx=zeros(length(dataset),5);
dataset_idx=dataset(:,1:5);
dataset_idx(:,6)=idx;
clusters = cell(K,1);
for i = 1:K
   clusters{i} = dataset_idx(dataset_idx(:,6) == i,:);
end
cluster_assignments=idx;
optimalK=K
optimalK = 4
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
    clusterPoints = data(idx == i, :);
    centroid = C(i, :);
    distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
end
Arrays have incompatible sizes for this operation.
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);

Med Future am 24 Apr. 2024

@Image Analyst @Walter Roberson Can you please look it how to solve this issue?

Melden Sie sich an, um zu kommentieren.

Answer 2

Walter Roberson am 24 Apr. 2024

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/1894680-how-to-cluster-dataset-and-remove-outlier-in-matlab#answer_1447371

Verschoben: Walter Roberson am 24 Apr. 2024

In MATLAB Online öffnen

Question.mat

load Question
dataset1=data(:,[2 4]);

dataset1 is created from 2 columns of data

% Step 1: Identify and remove outliers
freq_outliers = isoutlier(dataset1(:, 1));
pw_outliers = isoutlier(dataset1(:, 2));
outliers = freq_outliers | pw_outliers;
% Step 2: Remove rows with outliers from all columns
dataset1_no_outliers = dataset1(~outliers, :);
pdw_no_outliers = data(~outliers, :);
% Now, continue with your existing code using 'dataset1_no_outliers'
eva = evalclusters(dataset1_no_outliers, 'kmeans', 'silhouette', 'KList', [1:8]);
%eva = evalclusters(dataset1, 'kmeans', 'silhouette', 'KList', [1:8]);
K = eva.OptimalK;
[idx,C,sumdist] = kmeans(dataset1,K);

C is created from dataset1 so it has two columns

dataset=data;
dataset_idx=zeros(length(dataset),5);
dataset_idx=dataset(:,1:5);
dataset_idx(:,6)=idx;
clusters = cell(K,1);
for i = 1:K
   clusters{i} = dataset_idx(dataset_idx(:,6) == i,:);
end
cluster_assignments=idx;
optimalK=K
optimalK = 4
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
    clusterPoints = data(idx == i, :);

data has 6 columns, so clusterPoints has 6 columns

centroid = C(i, :);

centroid is created from C so it has two columns

    whos clusterPoints centroid
    distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));

You are trying to subtract something with 2 columns from something with 6 columns, which is an error

end
  Name                 Size            Bytes  Class     Attributes

  centroid             1x2                16  double              
  clusterPoints      177x6              8496  double              
Arrays have incompatible sizes for this operation.
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Med Future am 25 Apr. 2024

@Walter Roberson Thank you for explaining it that much. Basically the problem is to reassign the clusters which are already made by K-means. means i want to remove the outliers. as you see the solution the each distance of each centroid from the clusterpoints are recalculated by facing the error. can you please help me to solve this problem.

Melden Sie sich an, um zu kommentieren.

How to Cluster Dataset and remove outlier in MATLAB

2 Kommentare
Keine anzeigenKeine ausblenden

Antworten (2)

2 Kommentare
Keine anzeigenKeine ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

How to Cluster Dataset and remove outlier in MATLAB

2 Kommentare Keine anzeigenKeine ausblenden

Antworten (2)

2 Kommentare Keine anzeigenKeine ausblenden

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

2 Kommentare
Keine anzeigenKeine ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden