Algorithms for imbalanced multi class classification in Matlab?

Hi,
I have been browsing for quite a while both in the state of the art and statistical packages around and I am having some difficulties on finding available algorithms. I notice some implementations for the imbalanced problem have already been posted in Matlab but they were focused on imbalanced two class. My situation is more dry. Most if not all algorithms I came across on academia did not release their algorithms.
My data has two rare classes and 3 other classes who can be considered majority.
Thank you,
Carlos

 Akzeptierte Antwort

Ilya
Ilya am 13 Okt. 2012

4 Stimmen

I described approaches for learning on imbalanced data here http://www.mathworks.com/matlabcentral/answers/11549-leraning-classification-with-most-training-samples-in-one-category This advice is applicable to any number of classes.
If you have Statistics Tlbx in R2012b, I recommend RUSBoost algorithm available from fitensemble function. It is described here http://www.mathworks.com/help/stats/ensemble-methods.html#btfwpd3 and an example is shown here http://www.mathworks.com/help/stats/ensemble-methods.html#btgw1m1

5 Kommentare

Carlos Paradis
Carlos Paradis am 13 Okt. 2012
Bearbeitet: Carlos Paradis am 13 Okt. 2012
Dear Ilya,
Thank you. It really brings me relief and joy to finally come a cross a good documented solution with an example. Praise MathWorks for having such good documentation writers. One of the main reasons it was love on first sight for me.
I saw your answer before to that post, but I am having some difficulties on seeing how to generalize the approach to the multi class problem. I understand that RUSBoost is an solution of undersampling, that is, it would remove points from the majority class. I also noticed that in your post you mentioned this work from Rusboost:
Seiffert, C., Khoshgoftaar, T., Hulse, J.V., and Napolitano, A. (2008) Rusboost: Improving classification performance when training data is skewed, in International Conference on Pattern Recognition, pp. 1–4
Is that the paper that describes the algorithm for Rusboost in Matlab in details? I saw the page you pointed to me but it seems more genetic on Boosting or very specific in the formula.
I also have three other questions if you don't mind so I can use the approach knowing what I am doing:
The example you pointed me out shows more than one value that is considerable under represented (4 to 7), although 4 is more. I didn't notice in the tutorial any parameter referring to the value 4. Does that means that the results from the ensembling methods of the classifier take into account their distributions as well to better predict them? If that is so, then the tutorial is as well addressing the multi class problem for me!
For the third question: Both the tutorial as well as your post suggest as an alternative using the probability of the classes to fit in the model in case other approach is used if I got it right. Does assigning a cost, say, for one of my two rare classes would mean I am adding the assumption to my classifier that my data is not representative of the real world? Or that effect is just when I change the probability in the multi class situation? I do have two rare classes, they are representative, but one is very expensive if it occurs, and the other is the total opposite, no one cares. Still it is there, I wouldn't want my classifier to consider noise or is not representative of the real world. They are both representative, it is just the cost of one is higher than the other.
Lastly, I noticed this approach uses trained sets. For not so many data points, I saw that matlab has stratified k-fold classes for imbalanced data. Is it possible to fit this in Matlab for this new released boost algorithm? If so, is there any tutorial referring to its usage or doc function for it?
I do not own R2012b, but I don't mind acquiring it to solve the problem. My university currently has R2012a.
Thank you!
I'll make sure to pass your joy to the doc writer who worked on that page.
RUSboost undersamples the majority class(es) for every weak learner in the ensemble (decision tree, most usually). For example, if the majority class has 10 times as many observations as the minority class, it is undersampled 1/10. If the ensemble has say 100 trees, every observation in the majority class is used 100/10=10 times by the ensemble on average. Every observation in the minority class is used 100 times, once for every tree.
The MATLAB implementation follows the paper by Seiffert et al. If you are not certain about a specific detail, post your question to Answers or call our Tech Support.
Take a look at the doc for fitensemble function: http://www.mathworks.com/help/stats/fitensemble.html If you scroll down the somewhat lengthy list of input arguments, you will come to the description of RatioToSmallest parameter. By default, fitensemble counts the number of observations in the smallest class and samples that many observations from every class.
Assigning a large misclassification cost to a class tells fitensemble that misclassifying this class is penalized more heavily than misclassifying other classes, nothing less and nothing more. This shifts the decision boundaries away from this class toward the other classes, so fewer observations of this class and more observations of the other classes are misclassified.
It's ok to skew your data thus making it not representative of the real world if it gives you a better confusion matrix for the classes that you care to classify correctly. If you assign uniform prior, the accuracy for the rare classes will likely improve and the accuracy for the popular classes will likely go down.
The page for fitensemble describes cross-validation parameters you can pass to this function. In addition, every object returned by fitensemble has crossval method. For classification, cross-validation is stratified by default.
I typed 'cross-validate ensemble' in the online doc search box, and the 2nd hit was this page http://www.mathworks.com/help/stats/classificationensemble.crossval.html There is a short example at the bottom. Does this suffice?
Carlos Paradis
Carlos Paradis am 14 Okt. 2012
Bearbeitet: Carlos Paradis am 14 Okt. 2012
Hi Ilya,
Yes those were all great answers, thanks for covering all my questions. Please forward the compliment, it is well deserved. I learn a lot from them things that are sometimes over complicated in my textbooks.
I did not know cross validation in classification were stratified by default, this just makes me more happy :-)
I have one more question and one last concern.
The last question is in respect to the paper you pointed me out. The paper index refers to binary imbalanced problems. To my understanding, you and also the documentation of the method suggest it can be used to either 2 classes or more. From your comment, I understood that this is what some authors have been calling a one versus all approach. What I mean one versus all is that the algorithm is creating a weak classifier that only sees two classes, one is a weak (positive) and all the remaining classes are considered negative and grouped as a single class. So we would have k classifiers where k is defined by the number of classes in the dataset I want to predict. The final class label would be judged based on the agreement of all the weak classifiers of the ensemble. Is that so? I just want to make sure I am following how matlab extended the binary classification to a multi class classification and if it is already one I have seen (but only in theory).
I have one last concern, in respect to licensing, since it includes this problem I will post it here, I hope this is not a problem:
I have a package associated to my institution (Stevens Institute of Technology) which is currently in Matlab R2012a. What is the best option in my case to move to obtain this algorithm? Is it possible to buy only a toolbox and plug it in Matlab R2012a from my institution, or since this is a student version I must buy a completely separated version for R2012b? In any case, what minimum licenses would I need to be able to run Rusboost? Matlab R2012b plus the Statistical Package? And lastly, is it possible to run any trial version of this algorithm to see how it behaves with our datasets if requested from an institution or professor from academia?
Thank you,
Carlos
RUSboost uses AdaBoost.M2 algorithm underneath. This is a multiclass algorithm proposed by Freund and Schapire. It is not reducible to one-vs-all strategy. I don't remember a published reference off top of my head, but a google search finds this http://users.eecs.northwestern.edu/~yingwu/teaching/EECS510/Reading/Freund_ICML96.pdf. An observation is assigned to the class with the largest score.
You need Statistics Tlbx R2012b. For licensing and trial questions, please call our customer support.
Hello Sir, Assuming that you have three different classes (1,2,3). The first class contains two samples, the second contain one, the third contain one. From each class, you will extract two values (Average and median) of the color (for example). It will give you that: classe 1: (15, 20) classe 1: (16, 21) classe 2: ( 18, 22) classe 3: (22, 24) . On matlab, we make a matrix (Matrix for learning), which contains two columns, four lines and which contain (15, 20; 16, 21;18, 22; 22, 24). And we made a matrix composed of a single column (label matrix), this matrix (1, 1, 2, 3). We execute learning SVM with SVMtrain from libSVM. The parameters I have given you as an example correspond to the RBF kernel. The gamma value, c (varies between 10 and 100,000). Please, can you help me to execute this scenario in Matlab using LibSVM?? Thanks

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (1)

Walter Roberson
Walter Roberson am 13 Okt. 2012

0 Stimmen

Usually multi-class problems are handled by doing pairwise discrimination. Class 1 vs everything else, to pull out class 1. Take the "everything else" and run it against class 2 to get class 2 and a new "everything else". And so on.
You can find the algorithms for multi-class SVM (e.g.), but the papers warn that it is computationally very expensive even just for 3 classes.

1 Kommentar

Hi Walter,
Thank you for your reply. By pairwise, are you referring to what they call the One versus all approach? I found some papers on them, specially on doing this together with AdaBoost and Ensemble methods, but I only found one implementation in R. The implementation requires splitting the data, while I found MATLAB stratified k-fold to be more appropriate to validate it in such case. Could you point out any implementation in MATLAB for this that already takes into account in the algorithm the Ensemble method? The only ones I have found so far do not address it looking as multi class.
Thank you,
Carlos

Melden Sie sich an, um zu kommentieren.

Gefragt:

am 13 Okt. 2012

Kommentiert:

am 11 Jan. 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by