When growing a classification tree, finding an optimal binary split for a categorical predictor with many levels is significantly more computationally challenging than finding a split for a continuous predictor. For a continuous predictor, a tree can split halfway between any two adjacent unique values of this predictor.
In contrast, to find an exact optimal binary split for a categorical predictor with L levels, a classification tree needs to consider 2L–1–1 splits. To obtain this formula, observe that you can assign L distinct values to the left and right nodes in 2L ways. Two out of these 2L configurations leave either the left or right node empty, and therefore should be discarded. Now, divide by 2 because left and right can be swapped.
For regression and binary classification problems, with K = 2
response classes, there is a computational shortcut . The tree can order the categories by mean response (for regression) or class
probability for one of the classes (for classification). Then, the optimal split is
one of the L – 1 splits for the ordered list. When
K = 2,
fitctree always uses an exact search.
Therefore, computational challenges really only arise when growing classification
trees for data with K ≥ 3 classes. To reduce computation, there
are several heuristic algorithms for finding a good split. When using
fitctree to grow a classification tree, you can choose an algorithm
for splitting categorical predictors using the
AlgorithmForCategorical name-value pair argument. You can also set
this algorithm when creating a classification
If you do not specify an algorithm,
fitctree splits categorical predictors using the exact search
algorithm, provided the predictor has at most
levels (the default is 10 levels, and, depending on your platform, you cannot
perform an exact search on categorical predictors with more than 32 or 64 levels).
fitctree chooses a good inexact search
algorithm based on the number of classes and levels.
The available heuristic algorithms are: pull left by purity, a principal component-based partitioning, and one versus all by class.
This algorithm starts with all L categorical levels on the right branch. Inspect the K categories that have the largest class probabilities for each class. Move the category with the maximum value of the split criterion to the left branch. Continue moving categories from right to left, recording the split criterion at each move, until the right child has only one category remaining. Out of this sequence, the chosen split is the one that maximizes the split criterion.
Select this pull left by purity algorithm by using the
'AlgorithmForCategorial','PullLeft' name-value pair in
This algorithm was developed by Coppersmith, Hong, and Hosking . It finds a close-to-optimal binary partition of the L predictor levels by searching for a separating hyperplane that is perpendicular to the first principal component of the weighted covariance matrix of the centered class probability matrix.
The algorithm assigns a score to each of the L categories, computed as the inner product between the found principal component and the vector of class probabilities for that category. Then, the chosen split is the one of the L – 1 splits of the scores that maximizes the split criterion.
Select this principal component-based partitioning by using the
'AlgorithmForCategorical','PCA' name-value pair in
This algorithm starts with all L categorical levels on the right branch. For each of the K classes, order the categories based on their probability for that class.
For the first class, move each category to the left branch in order, recording the split criterion at each move. Repeat for the remaining classes. Out of this sequence, the chosen split is the one that maximizes the split criterion.
Select this one versus all by class algorithm by using the
'AlgorithmForCategorial','OVAbyClass' name-value pair in
 Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Chapman & Hall, Boca Raton, 1984.
 Coppersmith, D., S. J. Hong, and J. R. M. Hosking. “Partitioning Nominal Attributes in Decision Trees.” Data Mining and Knowledge Discovery, Vol. 3, 1999, pp. 197–217.