Decision tree and pruning optimization with imbalanced data

8 Ansichten (letzte 30 Tage)
Kai Doenges
Kai Doenges am 5 Mär. 2020
Beantwortet: Ayush Aniket am 4 Jun. 2025
Hi,
I am trying to build a decsion tree for a data set of imbalanced class probabilities. Furthermore my data contains discrete and continous predicter variables. In order to consider both I have opted for the following setting in the fitctree function.
fitctree(tr_Input, tr_Output, 'prior', 'uniform', 'PredictorSelection','interaction-curvature')
After bulding the tree with the test set data (70%) I prune the tree by evaluating the losses depending on the tree level for both the test (ts) and the training (tr) set of data.
%PRUNE THE INITIAL TREE
%Explore training and test error rates varying the number of nodes
[lTR,~,~] = loss(treeIni,tr, c_VariableAExplicar{:,:}, 'Subtrees', 'all', 'LossFun','classiferror');
[lTS,~,~] = loss(treeIni,ts, c_VariableAExplicar{:,:}, 'Subtrees', 'all', 'LossFun','classiferror');
To find the best pruning level I plot the error rates
f=figure(i_fig);
hold on;
plot(0:max(treeIni.PruneList), lTR,'.-');
plot(0:max(treeIni.PruneList), lTS,'.-');
set(gca,'Xdir','reverse');
xlabel('Pruning level (1 node - full tree)'); grid on;
legend('TR', 'TS'); ylabel('Error rate');
The follwig graphs show prune level 13 to be reasonable for training and test set yielding an error rate of ~6.7%
After pruning the tree
treeOpt = prune(treeIni,'Level', optPrunLevel);
I plot the confussion matrix for the train and test set.
As you can see, I am getting good results as in identifying the few occurences of the positive class (1) , however the tree performs very por concerning the negative class (0). Not percentage wise regarding all negative occurences but percentage wise regarding the few positve occurences (see red cicrcles).
What am I doing wrong?
I have tried defining differnt misqualification costs, which does balance the results but obviously I get less accuracy regarding the positive class identification, which is not desired at all. Additionally I have tried oversampling but it also doesnt improve the tree.
Do I need to change any setting in the tree function or chose a different loss function to improve pruning?
Thanks in advanced for any help!

Antworten (1)

Ayush Aniket
Ayush Aniket am 4 Jun. 2025
Since your dataset is imbalanced, the tree is biased toward the majority class. Even though you tried oversampling, decision trees can still struggle with minority classes if the features don’t provide strong separability. Here are few suggestions that you should try:
1. Instead of uniform priors ('prior', 'uniform'), try setting class-specific priors to emphasize the minority class. The fitctree function has the default option of empirical which determines class probabilities from class frequencies in the response variable. Refer the following documenattion section to read more about this argument: https://www.mathworks.com/help/stats/fitctree.html#bt6cr7t_sep_shared-Prior
2. You are using classiferror in the loss function, which only considers misclassification rate. Try using cross-entropy loss ('LossFun', 'logloss') or Gini impurity ('LossFun', 'gdi').
3. If class 0 and class 1 have overlapping feature distributions, the tree might struggle to separate them. Try adding interaction terms or transforming features to improve separability. Refer the Classification Learner for this process: https://www.mathworks.com/help/stats/feature-selection-and-feature-transformation.html
4. Decision trees alone might not be the best choice for imbalanced data. Consider Random Forests (fitensemble) or Gradient Boosting Trees (fitcensemble) for better performance.

Produkte

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by