Using kmedoids with custom distance function with several input variables

8 Ansichten (letzte 30 Tage)
I have a matrix X = [XCat XNum] where:
XCat is a matrix made of dummy variables resulting from encoding categorical variables
XNum is a matrix of continuous variables.
I want to apply a clustering algorithm, that keeps into account the categorical nature of part of the features in X. So I create a custom distance function, that uses the Hamming distance for the encoded categorical variables (dummies), and L1 (cityblock) for the continuous variable. This is the function:
function D = MixDistance(XCat,XNum)
% Mixed categorical/numerical distance
% INPUT:
% XCat = matrix nObsCat x nFeatures of categorical features
% XNum = matrix nObsNum x nFeatures of numerical features
% OUTPUT:
% D = matrix of distances (nObsCat+nObsNum) x (nObsCat+nObsNum)
% Number of categorical and numerical features
nCat = size(XCat,2);
nNum = size(XNum,2);
% Compute distances, separately
DCat = pdist2(XCat, XCat, 'hamming');
DNum = pdist2(XNum, XNum, 'cityblock');
% Compute relative weight based on the number of categorical variables
wCat = nCat/(nCat + nNum);
D = wCat*DCat + (1 - wCat)*DNum;
Now, one should be tempted to call kmedoids like this:
[IDX, C, SUMD, D, MIDX, INFO] = kmedoids(X,3,'distance', @MixDistance,'replicates',3);
but of course it doesn't work as the function MixDistance need XCat,XNum as input, not just X.
also, because of the way handles work, this doesn't work either:
[IDX, C, SUMD, D, MIDX, INFO] = kmedoids(X,3,'distance', MixDistance(XCat, XNum),'replicates',3);
Any idea?
Or alternatively, any idea on clustering when data are mixed, that is BOTH categorical AND continuous?
  2 Kommentare
the cyclist
the cyclist am 5 Feb. 2021
Can you upload a sample of the X data in a MAT file, to make it easier for folks to investigate?
Raffaele Zenti
Raffaele Zenti am 5 Feb. 2021
Yes, sure - already uploaded a sample of this matrix. The first 16 columns are dummies (i.e., XCat), the others are continuous (i.e., XNum).

Melden Sie sich an, um zu kommentieren.

Akzeptierte Antwort

the cyclist
the cyclist am 6 Feb. 2021
I think you need to do something like this. First define your MixDistance function as
function D = MixDistance(X,Y)
% Mixed categorical/numerical distance
% INPUT:
% XCat = matrix nObsCat x nFeatures of categorical features
% XNum = matrix nObsNum x nFeatures of numerical features
% OUTPUT:
% D = matrix of distances (nObsCat+nObsNum) x (nObsCat+nObsNum)
% Number of categorical and numerical features
nCat = 16;
nNum = 12;
% Compute distances, separately
DCat = pdist2(X(:,1:nCat), Y(:,1:nCat), 'hamming');
DNum = pdist2(X(:,nCat+1:end), Y(:,nCat+1:end), 'cityblock');
% Compute relative weight based on the number of categorical variables
wCat = nCat/(nCat + nNum);
D = wCat*DCat + (1 - wCat)*DNum;
end
I did two things here. First, I changed it to accept two arguments, as a distance function needs to.
Second, I explicit define the categorical and numerical columns inside the function. If you don't know those ahead of the function call, you could write some code to figure it out, based on columns that have only the (0,1) dummy indices.
This function will work when called as
[IDX, C, SUMD, D, MIDX, INFO] = kmedoids(X,3,'Distance', @MixDistance,'Replicates',3);

Weitere Antworten (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by