How to split a column's elements to two vectors based on lables?

I attached a part of lung dataset(32X57), It's last column is the lables(1 or 2), I want to split each column to two vectors based on the lables:
F(i).normal vector for saving matrix's elements with lable 1 ,
F(i).tumor vector for saving elements with lable 2 .
I attached my matlab code.
For adding each column's elements in a vector, It seems this code is not true. I'll be very gratefull to have your opinion.
close all;
clc
load lung.mat
F=lung;
[n,m]=size(F);
for i=1: m
s1=0; s2=0;
for j=1: n
if (F(j,m)==1)
for z=1:s1
F(i).normal(z)=F(j,i);
s1=s1+1;
end
else
for x=1:s2
F(i).tumor(x)=F(j,i);
s2=s2+1;
end
end
end
end

 Akzeptierte Antwort

You didn't attach lung.mat. But is this what you want:
% Create sample data.
data = randi(9, 32, 57); % Random integers in the range 1-9.
data(:, end) = randi(2, 32, 1) % Last columns is 1 or 2 ONLY.
% Find out what rows are labeled 1 and 2
% by looking in the last column.
rowsLabeled1 = data(:, end) == 1;
rowsLabeled2 = data(:, end) == 2;
% Extract rows labeled 1 and 2 into their own matrices.
data1 = data(rowsLabeled1, :);
data2 = data(rowsLabeled2, :);
% You can get vectors from each column by extracting it into a new variable
% e.g. to get 2 vectors for column 5, do
col51 = data1(:, 5); % Get col 5 with label 1.
col52 = data2(:, 5); % Get col 5 with label 2.

14 Kommentare

Thanks highly for your guidance.
I attached the lung dataset.
I run this program:
close all;
clc
load lung.mat
data=lung;
[n,m]=size(data);
rowslabled1=data(:,m)==1;
rowslabled2=data(:,m)==2;
data1=data(rowslabled1,:);
data2=data(rowslabled2,:);
col561=data1(:,56);
col562=data2(:,56);
for i=1: m
col1(i)=data1(:,i);
col2(i)=data2(:,i);
end
col51, col52 are two column variables for fifth column, I used a loop to generalize the variables for all of the columns of the dataset(1 to 56 columns except the 57th column which is for lables) but It showed this error :
Subscripted assignment dimension mismatch.
the variable 'col1' appears to change size on every loop iteration(whithin a script).consider preallocating for speed.
Still not attached. There is a 5Mb limit, so if your dataset is bigger than that, consider postimg just a subset.
That for loop is totally not the right thing to do.
I suggest you delete the last 6 lines of your program and just use data1 and data2. You can do everything you need just by leaving them in those two matrices.
Thank you very much
Is it right to use column vectors data1(:,i) and data2(:,i) instead of variables col1(i) and col2(i) in the loop?
close all;
clc
load lung.mat
data=lung;
[n,m]=size(data);
rowslabled1=data(:,m)==1;
rowslabled2=data(:,m)==2;
data1=data(rowslabled1,:);
data2=data(rowslabled2,:);
for i=1: m
data1(:,i);
data2(:,i);
end
Yes, that's the way to do it. Just use data1(:, columnNumber) or data2(:, columnNumber) whenever you need to get the column with just 1 or 2 labels. No need to compute new variables for every single column. If you want to you can just store them and overwrite them for each iteration when you're processing a new column.
for col = 1 : m
thisCol1 = data1(:, col);
thisCol2 = data2(:, col);
% Now use thisCol1 and thisCol2 in whatever way you want.
end
Thank you very much
For computing jaccard distance measure between thisCol1 and thisCol2 for all of the columns, I used pdist2 function.
For testing the result I write d in command line.
It gave distances first but now It shows this error:
Undefined function or variable 'd'.
Is It true to use just pdist2 or I should write the distance formula?
close all;
clc
load lung.mat
data=lung;
[n,m]=size(data);
rowslabled1=data(:,m)==1;
rowslabled2=data(:,m)==2;
data1=data(rowslabled1,:);
data2=data(rowslabled2,:);
for i=1: m
thisCol1=data1(:,i);
thisCol2=data2(:,i);
d=pdist2(thisCol1,thisCol2,'jaccard');
end
This is how it SHOULD go, but your attached lung1.mat does not have any data labeled 1.
close all;
clc
s = load('lung1.mat')
data = s.lung1;
[rows, columns]=size(data)
rowslabled1 = data(:, end) == 1;
rowslabled2 = data(:, end) == 2;
data1 = data(rowslabled1,:);
data2 = data(rowslabled2,:);
data1 = data2;
for col = 1: columns
thisCol1 = data1(:,col);
if isempty(thisCol1)
fprintf('No data labeled 1 in column %d. Skipping column %d.\n', col, col);
continue;
end
thisCol2 = data2(:,col);
if isempty(thisCol1)
fprintf('No data labeled 2 in column %d. Skipping column %d.\n', col, col);
continue;
end
distances = pdist2(thisCol1, thisCol2, 'jaccard'); % A 2-D array
end
but I'm not sure what you want to do with the distances matrix.
I attached the figure of plot(distances).
My aim is to compute the distances between the elements in a column that have lable 1 and the others in the same column that have lable 2,(thisCol1,thisCol2)
It means finally the number of distance values are m (one for each column), and then sort them descending order base on these distances.
Do you think the plot figure maybe has a problem?
What do the columns represent? They are a single number, so they are not (x,y) locations. Are they features? Like mean gray level or something? Are different columns different things, like col 1 is mean gray level, col2 is std dev, and col 3 is entropy?
the columns represent the results of medical tests like diabetes and cholestrol... for each sample and lable 1 or 2 shows that the sample is normal or ill.
Well that's different than what I was assuming. I thought you had a bunch of columns and the last column said whether it was class 1 or 2. Now it sounds like you have a label column for each of the test results columns. Is that correct?
So, let's say for the cholesterol column, you will have a companion column for normal or abnormal, say below of above 200. And you're looking for find the distances for each abnormal cholesterol value to each normal cholesterol value? So if you had 4 normal value, and 3 abnormal values, you'd have 4 * 3 = 12 distance measurements? I think that makes little sense, but is that what you want to do? If so, why?
phdcomputer Eng
phdcomputer Eng am 29 Dez. 2018
Bearbeitet: phdcomputer Eng am 29 Dez. 2018
Yes thats right
As you said, lets consider the 3 normal values make a vector thisCol1 and the 4 abnormal values make another vector thisCol2.
Now I want to compute the distance between these two vectors for every column (feature) of the data.
I want to plot the distances to see the values and maybe find a threshold for the values.
You already know how to use pdist2, and you can plot all those distances, and even get a histogram of them. If you want to split into two zones, you can use graythresh(), imbinarize() or kmeans(), though like before I think that makes little to no sense. You still haven't explained why. Anyway, you should use a fixed threshold for consistency. Using an automatic threshold that varies depending on how many points are class 1 or class 2 is not good for comparing data sets. What if the distances were normally distributed? What does that mean? The numbers are uniformly distributed??? What if the distances had two clusters? What does that mean? That the measurements were in two tight clusters? It seems that by having the data for that measurement already labeled that someone has already somehow thresholded something, and it's probably the values themselves rather than the distance between them. But go ahead and do it and show us the values and the histograms, and the distance values and the distance value histogram and we can see if the distance histogram gives any additional insight.
It would be easy for you to make up data sets that range from clustered to uniformly distributed and compute the distances in each case. For example, in my K Nearest Neighbor demo, I create two classes, each with a spread, and a separation between the two classes. Though it's in 2-D for 2 variables. You could actually just make two classes in 1-D simply by using rand() and randn() and setting the mean and spread for each class.
OK, I programmed up a simple Monte Carlo Simulation for you with uniform, non-overlapping distributions for two classes. It is attached. You can see the measurement values, the distance values, and the histogram of the distance values. I think you can do a lot of your experimentation and discovery of insights just by trying different distributions in a Monte Carlo fashion. For example, maybe the distribution of distances is the convolution of the distributions of the two measurement class distributions. What do you think?
% Program to do a Monte Carlo simulation of measurements between two classes of patients.
clc; % Clear the command window.
close all; % Close all figures (except those of imtool.)
imtool close all; % Close all imtool figures if you have the Image Processing Toolbox.
clear; % Erase all existing variables. Or clearvars if you want.
workspace; % Make sure the workspace panel is showing.
format long g;
format compact;
fontSize = 16;
% Specify parameters.
numClass1 = 120; % Number of measurements in class 1.
numClass2 = 80; % Number of measurements in class 2.
meanClass1 = 25;
meanClass2 = 75;
spread1 = 25;
spread2 = 25;
% Generate measurements
class1Values = meanClass1 + spread1 * (rand(numClass1, 1) - 1);
class2Values = meanClass2 + spread2 * (rand(numClass2, 1) - 1);
% Plot measurements
subplot(2, 2, 1);
plot(class1Values, 'b*', 'MarkerSize', 10, 'LineWidth', 2);
hold on;
plot(class2Values, 'r*', 'MarkerSize', 10, 'LineWidth', 2);
xlabel('Measurement Number', 'FontSize', fontSize);
ylabel('Measurement Value', 'FontSize', fontSize);
title('Measurement Value for Every Patient', 'FontSize', fontSize);
grid on;
legend1 = sprintf('%d in Class 1', numClass1);
legend2 = sprintf('%d in Class 2', numClass2);
legend(legend1, legend2, 'location', 'east');
% Enlarge figure to full screen.
set(gcf, 'Units', 'Normalized', 'OuterPosition', [0, 0.04, 1, 0.96]);
drawnow;
% Compute distances of every point to every other point.
set1 = [zeros(length(class1Values), 1), class1Values];
set2 = [zeros(length(class2Values), 1), class2Values];
distances = pdist2(set1, set2);
subplot(2, 2, 2);
bar(distances);
grid on;
title('Distances between Class 1 Points and Class 2 Points', 'FontSize', fontSize);
xlabel('Pair Number', 'FontSize', fontSize);
ylabel('Distance between pair', 'FontSize', fontSize);
% Show histogram of distances.
subplot(2, 2, 3:4);
histogram(distances);
grid on;
caption = sprintf('Histogram of %d Distances between Class 1 Points and Class 2 Points', numel(distances));
title(caption, 'FontSize', fontSize);
xlabel('Distance', 'FontSize', fontSize);
ylabel('Count', 'FontSize', fontSize);
0000 Screenshot.png

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (1)

Cris LaPierre
Cris LaPierre am 27 Dez. 2018

0 Stimmen

Your data is not attached, so nothing to test but have you looked into using a table and the functions findgroup and splitapply? See some examples here.

1 Kommentar

Thanks greatly
I attached a part of the data.(lung1.mat)
In the following code:
I used pdist2 function to compute distance between two column vectors by using jaccard measure.
I wrote this in command line to see the distance result:
pdist2(data(:,2),data(:,2),'jaccard');
but there is an error:
Undefined function or variable 'data'.
I'll be grateful to have your opinion.
close all;
clc
load lung.mat
data=lung;
[n,m]=size(data);
rowslabled1=data(:,m)==1;
rowslabled2=data(:,m)==2;
data1=data(rowslabled1,:);
data2=data(rowslabled2,:);
for i=1: m
data1(:,i);
data2(:,i);
d=pdist2(data(:,i),data(:,i),'jaccard');
end

Melden Sie sich an, um zu kommentieren.

Kategorien

Mehr zu Data Distribution Plots finden Sie in Hilfe-Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by