question about k means clustering
Ältere Kommentare anzeigen
How can we figure out a data set using all columns of a dataset with k=2 means clustering? Data set is here: https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/
7 Kommentare
KALYAN ACHARJYA
am 3 Jan. 2021
Bearbeitet: KALYAN ACHARJYA
am 3 Jan. 2021
What is the problem? Issue with dataset or k-means?
Note: if you want help, then you need to make it easy to be helped.
Eeengineer
am 3 Jan. 2021
Image Analyst
am 3 Jan. 2021
I saved "hepatitis.data" at that web site and it didn't work
load('hepatitis.data')
X=hepatitis(:,16:17);
figure;
plot(X,'k*');
title 'Hepatitis Data';
hold on;
opts = statset('Display','final');
[idx,C] = kmeans(X,2,'Distance','sqeuclidean',...
'Replicates',5,'Options',opts);
Please post the actual data file and code that actually works with it.
Image Analyst
am 3 Jan. 2021
Doesn't run. load doesn't work. You're not making it easy for us, are you? I'll try to fix it. In the meantime, edit yoru post and format your code as code by highlighting and clicking the code icon.
Image Analyst
am 3 Jan. 2021
Come on Eeengineer. Please don't waste my time when I try to help you. I used xlsread() instead of load() and that got the data in, but there is no 17th column. Please fix or post your actual code. I'm going to do other stuff now and I'll check back later.
clear all;
close all;
clc;
format long g;
format compact;
fontSize = 15;
fprintf('Beginning to run %s.m ...\n', mfilename);
hepatitis = xlsread('hepatitis.xlsx')
X = hepatitis(:,16:17)
plot(X,'k*');
title 'Hepatitis Data';
hold on;
idx=kmeans(X,2);
opts = statset('Display','final');
[idx,C] = kmeans(X,2,'Distance','sqeuclidean',...
'Replicates',5,'Options',opts);
figure;
plot(X(idx==1,1),X(idx==1,2),'r.','MarkerSize',12)
hold on
plot(X(idx==2,1),X(idx==2,2),'b.','MarkerSize',12)
plot(C(:,1),C(:,2),'kx',...
'MarkerSize',15,'LineWidth',3)
legend('Cluster 1','Cluster 2','Centroids',...
'Location','NW')
title 'Cluster Assignments and Centroids'
hold off
Image Analyst
am 3 Jan. 2021
Only columns 2 and 15 look like there is any real data in them. The rest of the columns just have 1, 2, or nan in them. Which columns do you want to take as "observations"? Are all of them observations, or just the columns 2 and 15?
If I scatter columns 1 and 2 and 15, I see this:
hepatitis = xlsread('hepatitis.xlsx')
x = hepatitis(:,1);
y = hepatitis(:, 2);
z = hepatitis(:, 15);
scatter3(x, y, z, 'Filled');
title('Hepatitis Data', 'FontSize', 20);
xlabel('Column 1', 'FontSize', 20);
ylabel('Column 2', 'FontSize', 20);
zlabel('Column 15', 'FontSize', 20);

So where are the clusters? If you're going to include columns 1 and 3-14, and 16 in the observations, then the clusters might be dominated by what's in those columns since they're very discrete - either 1 or 2. Looking at just columns 2 and 15, it doesn't look like there are any meaningful clusters.
Eeengineer
am 3 Jan. 2021
Antworten (2)
Eeengineer
am 3 Jan. 2021
0 Stimmen
Eeengineer
am 3 Jan. 2021
0 Stimmen
Kategorien
Mehr zu k-Means and k-Medoids Clustering finden Sie in Hilfe-Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!