Trying to average values from specific cells in a similarity matrix

I have a group of 10 vectors that represent 10 unique items I've compared to each other to assess their similarity in relation to each other. That is, they've been assigned into categories if their similarity exceeds a threshold. What I have from this process is an upper triangle similarity matrix that looks something like this where the top row and left column are the names of the categories:
10 20 20 20 20 7 7 7 7 12
10 NaN 0 0 0 0 51.3 50.5 50.4 50.5 76.5
20 NaN NaN 99.7 99.6 99.3 85.3 86.0 85.9 85.9 0
20 NaN NaN NaN 99.5 99.3 85.2 85.8 85.8 85.8 0
20 NaN NaN NaN NaN 99.5 85.4 86.0 86.0 86.0 0
20 NaN NaN NaN NaN NaN 85.3 85.9 85.9 85.9 0
7 NaN NaN NaN NaN NaN NaN 99.2 99.0 99.2 0
7 NaN NaN NaN NaN NaN NaN NaN 99.8 99.7 0
7 NaN NaN NaN NaN NaN NaN NaN NaN 99.7 0
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0
12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
For my next step, what I want to do is find the average similarity for items that have been placed into a category together as compared to their similarity with items that do not share their category. That is, I want to average the similarity of the Cat20s (99.7, 99.6, 99.3, 99.5, and 99.5) and the Cat7s (99.2, 99.0, 99.2, 99.8, 99.7, and 99.7) so that I can compare it to the similarity values of out-of-category items (0, 0, 0, 0, 51.3, 50.4, 50.5, 76.5, 85.3, 86.0, 85.9, 85.9, 0, etc). What I'm trying to do is assess the effectiveness of the categorization scheme.
I have tried to think through this, but I can't find an approach that I think will work. (I'm pretty new at this, so maybe there is something obvious I haven't thought of.)
Many thanks in advance!

 Akzeptierte Antwort

Wendi Fellner
Wendi Fellner am 10 Sep. 2022
Bearbeitet: Wendi Fellner am 10 Sep. 2022
I went back to the drawing board and figured out a way to do it. :-) Here's what I came up with. (Thank you dpb for all the time and effort and patience in working on this. I may not have communicated clearly what I was trying to do.)
% Create an index for values that are within-category and another index
% for those that are between categories
idxwithin = zeros(size(label_matrix)); %create a matrix of zeros the size of label_matrix to hold markers for values that are within the same category
idxbetween = zeros(size(label_matrix)); %create a matrix of zeros the size of label_matrix to hold markers for values that are NOT within the same category
for column = 2:length(label_matrix) %loop across each column header
for row =2:length(label_matrix) %loop down each row header
if label_matrix(1,column) == label_matrix(row,1) %if column header = row header...
idxwithin(row,column) = 1; %enter 1 at the intersection of row,column into the 'idxwithin' matrix
else
idxbetween(row,column) = 1; %otherwise enter 1 at the intersection of row,column into the 'idxbetween' matrix
end
end
end
idxwithin = logical(idxwithin); %convert idxwithin matrix into a logical
idxbetween = logical(idxbetween); %convert idxbetween matrix into a logical
% find the means of within- and between-category values
withinCatMean = mean(label_matrix(idxwithin),'all','omitnan') %calculate the mean of the within category values from label_matrix, exluding NaNs
betweenCatMean = mean(label_matrix(idxbetween),'all','omitnan') %calculate the mean of the between category values from label_matrix, exluding NaNs

Weitere Antworten (1)

dpb
dpb am 3 Sep. 2022
Bearbeitet: dpb am 5 Sep. 2022
Not too bad ... use logical addressing to find the locations and the mean with the 'omitnan' argument over the values returned...
Generically, you can write something like (augment the array with a NaN in 1,1 position or build the CATS array independently as here depending on how you have the data originally--
CATS=[10 20 20 20 20 7 7 7 7 12].'; % the categories in respective position in array
C=unique(CATS); % the unique categories over which to iterate
%A=A(2:end,2:end); % or A if you don't include the extraneous row/column to begin with
M=zeros(size(C)); % how many means there are possible -- one/category
for i=1:numel(M)
ixcat=(CATS==C(i)); % get the index into the array column/row -- same since symmetric
M(i)=mean(A(logical(ixcat.*ixcat.'),'all','omitnan')); % expand vector to logical array, select, compute
end
results in
>> disp([C M])
7.0000 99.4333
10.0000 NaN
12.0000 NaN
20.0000 99.4833
>>
In this case only the two categories have any finite elements, but the above will work in general regardless the size or number rows/columns per category. You can always retain only finite results in the end.

9 Kommentare

Thank you, dpb! This looks like it will work. I'm playing with it now and will report back soon!
Thanks, dpb. I appreciate your help on this! I've been trying to make this work, but still having some trouble. I can see that the ixcat line is creating an index for each category based on the for loop, but I need the intersection of when the categories match in both the column and row headers. That is, I don't want to average the entire column. Isn't that what this would do? I also don't think I understand multiplying the index by the index inside the logical. I'm pretty new at this, so maybe it does it correctly, but I'm just not getting it.
I'm also using version 2016b to maintain compatibility with other parts of the program. Is there something similar to the 'all' syntax that I could use?
dpb
dpb am 7 Sep. 2022
Bearbeitet: dpb am 7 Sep. 2022
Compare the output of the expression
logical(ixcat.*ixcat.')
to the array and you'll see it is precisely the selection that is the intersection of the same values in both directions -- the only presumption is the categories are the same in both directions since only the one vector is used for both directions. The selection is NOT the whole row/column; it's the product and is a square logical addressing array the size of the array with TRUE elements at the specific interesection.
ADDENDUM
Oh. I don't recall when the automatic array expansion was introduced -- the above is the same as matrix multiplication to return a matrix product with recent releases of MATLAB. You MAY need to write the above as
logical(ixcat*ixcat.')
instead to get the matrix multiplication in earlier releases.
I don't know when the 'all' syntax was introduced; the early MATLAB idiom would be (:) which returns the whole array as a vector and serves thus the same purpose as 'all'. To apply the colon reference, however, requires having a temporary variable; MATLAB doesn't support the syntax to dereference a function return. So, another idiom one will often see, particularly in older code, is the somewhat peculiar-looking
mean(mean(x))
which serves the same purpose since mean is vectorized to return column means from a 2D array, the first call returns a vector; the second then averages the elements of the columns for the overall array average. The above is for 2D array, one has to continue to add terms as the dimensionality of the array increases, of course, which is why the alternate syntax was introduced.
However, if the 'all' syntax isn't supported, the 'omitnan' argument may not be either -- I don't recall (and am too lazy to go back thru the release notes to look it up) if they were itnroduced at the same time or not. If this is an issue, then there's a (now deprecated) family of special-purpose functions nanXXX for the various statistics where XXX is mean, std, var, min, max, ... that older release can still use.
All these little warts and improvements and that R2016 is now pretty old (as releases go) makes me suggest you should look into seeing if you could update your version to something closer to current.
I have tried the 2020b version and everything in the script seems to be working until it gets to the M(i) line. I've tried with ixcat.*ixcat and also ixcat*ixcat. I'll post my code below. Perhaps I've not incorporated your code correctly. 's_matrix' is the full similarity matrix where bother upper and lower triangles are included and there are no labels along the top row or left column, so the first part of my script is creating the 'label_matrix' matrix that removes the lower triangle and adds the category names. Then I use your code to try to extract and average the within-category values. (I'll also need to extract and average the between-category values at some point, but would like to solve this part first and then maybe I'll understand how to do the between-category values.) The code and then the error messages are below. Can you see where I've gone wrong?
% modify the s_matrix to remove the lower triangle and diagonal values to
% eliminate repeats
idx = ones(size(s_matrix)); %generate a matrix of ones the same size as the similarity matrix
idx = logical(triu(idx,1)); %keep only upper triangle and make into 'logical'
s_uptri_matrix = NaN(numSamples); %create a new matrix filled with NaN
s_uptri_matrix(idx) = s_matrix(idx); %create 'upper triangle' matrix with only the upper triangle values from s_matrix
% add DATA.category values to the s_uptri-matrix as row and column headers
cats = [DATA.category];
l_cats = [NaN(1); cats'];
label_matrix = [cats; s_uptri_matrix]; %add row of category numbers from ARTwarp's DATA struct
label_matrix = [l_cats, label_matrix]; %add column of category numbers transposed from ARTwarp's DATA struct
% Identify within-category values
C = unique(cats); %create vector of unique category names
M = zeros(size(C)); %create matrix of 0s that is the same size as C
for i=1:numel(M)
ixcat = (cats == C(i)); %create an index of where the category names equal the 'for loop' counter?
M(i) = mean(label_matrix(logical(ixcat.*ixcat.'), 'all', 'omitnan'));
end
Error when I include the period:
The logical indices in position 1 contain a true value outside of the
array bounds.
Error in sim_matrix_wf (line 42)
M(i) = mean(label_matrix(logical(ixcat.*ixcat.'), 'all',
'omitnan'));
Error when I don't include the period:
Index in position 2 exceeds array bounds (must not exceed 81).
Error in sim_matrix_wf (line 42)
M(i) = mean(label_matrix(logical(ixcat*ixcat.'), 'all', 'omitnan'));
Thanks for your help!
The code will work as written given the assumptions made...can't see anything to do about anything with the data to go with it, though.
It's all dependent upon the CATS array matching up to the data array sizes, though -- if they're consistent there can't be an array index out of bounds because the indexing logical vector can't be longer than the size of the array. Again, of course, it also has to be square.
Sorry -- here is the data. Does it work for you?
>> sim_matrix_wf
Error using load
Unable to read file 'ARTwarp095_0.mat'. No such file or directory.
Error in sim_matrix_wf (line 6)
load ARTwarp095_0.mat; %load the .mat file that was generated in the ARTwarp run
>>
So, no...but it also very belligerently clear'ed my workspace....that was rude!
>> whos -file s_matrix.mat
Name Size Bytes Class Attributes
s_matrix 80x80 51200 double
>>
Clearly from the above your CATS array must be wrong -- the data array is 80x80 but you're generating a reference to position 81. Ergo, it must be one element too long to match.
I'm sorry about that!

Melden Sie sich an, um zu kommentieren.

Produkte

Version

R2016b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by