Improving Discriminant Analysis Models

Deal with Singular Data

Discriminant analysis needs data sufficient to fit Gaussian models with invertible covariance matrices. If your data is not sufficient to fit such a model uniquely, fitcdiscr fails. This section shows methods for handling failures.

Tip

To obtain a discriminant analysis classifier without failure, set the DiscrimType name-value pair to 'pseudoLinear' or 'pseudoQuadratic' in fitcdiscr.

“Pseudo” discriminants never fail, because they use the pseudoinverse of the covariance matrix Σk (see pinv).

Example: Singular Covariance Matrix

When the covariance matrix of the fitted classifier is singular, fitcdiscr can fail:

X = popcorn(:,[1 2]);
X(:,3) = 0; % a zero-variance column
Y = popcorn(:,3);
ppcrn = fitcdiscr(X,Y);

Error using ClassificationDiscriminant (line 635)
Predictor x3 has zero variance. Either exclude this predictor or set 'discrimType' to
'pseudoLinear' or 'diagLinear'.

Error in classreg.learning.FitTemplate/fit (line 243)
obj = this.MakeFitObject(X,Y,W,this.ModelParameters,fitArgs{:});

Error in fitcdiscr (line 296)
this = fit(temp,X,Y);

To proceed with linear discriminant analysis, use a pseudoLinear or diagLinear discriminant type:

ppcrn = fitcdiscr(X,Y,...
'discrimType','pseudoLinear');
meanpredict = predict(ppcrn,mean(X))

meanpredict =
3.5000

Choose a Discriminant Type

There are six types of discriminant analysis classifiers: linear and quadratic, with diagonal and pseudo variants of each type.

Tip

To see if your covariance matrix is singular, set discrimType to 'linear' or 'quadratic'. If the matrix is singular, the fitcdiscr method fails for 'quadratic', and the Gamma property is nonzero for 'linear'.

Choose a classifier type by setting the discrimType name-value pair to one of:

• 'linear' (default) — Estimate one covariance matrix for all classes.

• 'quadratic' — Estimate one covariance matrix for each class.

• 'diagLinear' — Use the diagonal of the 'linear' covariance matrix, and use its pseudoinverse if necessary.

• 'diagQuadratic' — Use the diagonals of the 'quadratic' covariance matrices, and use their pseudoinverses if necessary.

• 'pseudoLinear' — Use the pseudoinverse of the 'linear' covariance matrix if necessary.

• 'pseudoQuadratic' — Use the pseudoinverses of the 'quadratic' covariance matrices if necessary.

fitcdiscr can fail for the 'linear' and 'quadratic' classifiers. When it fails, it returns an explanation, as shown in Deal with Singular Data.

fitcdiscr always succeeds with the diagonal and pseudo variants. For information about pseudoinverses, see pinv.

You can set the discriminant type using dot notation after constructing a classifier:

obj.DiscrimType = 'discrimType'

You can change between linear types or between quadratic types, but cannot change between a linear and a quadratic type.

Examine the Resubstitution Error and Confusion Matrix

The resubstitution error is the difference between the response training data and the predictions the classifier makes of the response based on the input training data. If the resubstitution error is high, you cannot expect the predictions of the classifier to be good. However, having low resubstitution error does not guarantee good predictions for new data. Resubstitution error is often an overly optimistic estimate of the predictive error on new data.

The confusion matrix shows how many errors, and which types, arise in resubstitution. When there are K classes, the confusion matrix R is a K-by-K matrix with

R(i,j) = the number of observations of class i that the classifier predicts to be of class j.

Example: Resubstitution Error of a Discriminant Analysis Classifier

Examine the resubstitution error of the default discriminant analysis classifier for the Fisher iris data:

obj = fitcdiscr(meas,species);
resuberror = resubLoss(obj)

resuberror =
0.0200

The resubstitution error is very low, meaning obj classifies nearly all the Fisher iris data correctly. The total number of misclassifications is:

resuberror * obj.NumObservations

ans =
3.0000

To see the details of the three misclassifications, examine the confusion matrix:

R = confusionmat(obj.Y,resubPredict(obj))

R =
50     0     0
0    48     2
0     1    49

obj.ClassNames

ans =
'setosa'
'versicolor'
'virginica'
• R(1,:) = [50 0 0] means obj classifies all 50 setosa irises correctly.

• R(2,:) = [0 48 2] means obj classifies 48 versicolor irises correctly, and misclassifies two versicolor irises as virginica.

• R(3,:) = [0 1 49] means obj classifies 49 virginica irises correctly, and misclassifies one virginica iris as versicolor.

Cross Validation

Typically, discriminant analysis classifiers are robust and do not exhibit overtraining when the number of predictors is much less than the number of observations. Nevertheless, it is good practice to cross validate your classifier to ensure its stability.

Cross Validating a Discriminant Analysis Classifier

This example shows how to perform five-fold cross validation of a quadratic discriminant analysis classifier.

Create a quadratic discriminant analysis classifier for the data.

Find the resubstitution error of the classifier.

qerror = 0.0200

The classifier does an excellent job. Nevertheless, resubstitution error can be an optimistic estimate of the error when classifying new data. So proceed to cross validation.

Create a cross-validation model.

Find the cross-validation loss for the model, meaning the error of the out-of-fold observations.

cverror = kfoldLoss(cvmodel)
cverror = 0.0200

The cross-validated loss is as low as the original resubstitution loss. Therefore, you can have confidence that the classifier is reasonably accurate.

Change Costs and Priors

Sometimes you want to avoid certain misclassification errors more than others. For example, it might be better to have oversensitive cancer detection instead of undersensitive cancer detection. Oversensitive detection gives more false positives (unnecessary testing or treatment). Undersensitive detection gives more false negatives (preventable illnesses or deaths). The consequences of underdetection can be high. Therefore, you might want to set costs to reflect the consequences.

Similarly, the training data Y can have a distribution of classes that does not represent their true frequency. If you have a better estimate of the true frequency, you can include this knowledge in the classification Prior property.

Example: Setting Custom Misclassification Costs

Consider the Fisher iris data. Suppose that the cost of classifying a versicolor iris as virginica is 10 times as large as making any other classification error. Create a classifier from the data, then incorporate this cost and then view the resulting classifier.

1. Load the Fisher iris data and create a default (linear) classifier as in Example: Resubstitution Error of a Discriminant Analysis Classifier:

obj = fitcdiscr(meas,species);
resuberror = resubLoss(obj)

resuberror =
0.0200

R = confusionmat(obj.Y,resubPredict(obj))

R =
50     0     0
0    48     2
0     1    49

obj.ClassNames

ans =
'setosa'
'versicolor'
'virginica'

R(2,:) = [0 48 2] means obj classifies 48 versicolor irises correctly, and misclassifies two versicolor irises as virginica.

2. Change the cost matrix to make fewer mistakes in classifying versicolor irises as virginica:

obj.Cost(2,3) = 10;
R2 = confusionmat(obj.Y,resubPredict(obj))

R2 =
50     0     0
0    50     0
0     7    43

obj now classifies all versicolor irises correctly, at the expense of increasing the number of misclassifications of virginica irises from 1 to 7.

Example: Setting Alternative Priors

Consider the Fisher iris data. There are 50 irises of each kind in the data. Suppose that, in a particular region, you have historical data that shows virginica are five times as prevalent as the other kinds. Create a classifier that incorporates this information.

1. Load the Fisher iris data and make a default (linear) classifier as in Example: Resubstitution Error of a Discriminant Analysis Classifier:

obj = fitcdiscr(meas,species);
resuberror = resubLoss(obj)

resuberror =
0.0200

R = confusionmat(obj.Y,resubPredict(obj))

R =
50     0     0
0    48     2
0     1    49

obj.ClassNames

ans =
'setosa'
'versicolor'
'virginica'

R(3,:) = [0 1 49] means obj classifies 49 virginica irises correctly, and misclassifies one virginica iris as versicolor.

2. Change the prior to match your historical data, and examine the confusion matrix of the new classifier:

obj.Prior = [1 1 5];
R2 = confusionmat(obj.Y,resubPredict(obj))

R2 =
50     0     0
0    46     4
0     0    50

The new classifier classifies all virginica irises correctly, at the expense of increasing the number of misclassifications of versicolor irises from 2 to 4.