Main Content

crossval

Cross-validate machine learning model

    Description

    example

    CVMdl = crossval(Mdl) returns a cross-validated (partitioned) machine learning model (CVMdl) from a trained model (Mdl). By default, crossval uses 10-fold cross-validation on the training data.

    CVMdl = crossval(Mdl,Name,Value) sets an additional cross-validation option. You can specify only one name-value argument. For example, you can specify the number of folds or a holdout sample proportion.

    Examples

    collapse all

    Load the ionosphere data set. This data set has 34 predictors and 351 binary responses for radar returns, either bad ('b') or good ('g').

    load ionosphere
    rng(1); % For reproducibility

    Train a support vector machine (SVM) classifier. Standardize the predictor data and specify the order of the classes.

    SVMModel = fitcsvm(X,Y,'Standardize',true,'ClassNames',{'b','g'});

    SVMModel is a trained ClassificationSVM classifier. 'b' is the negative class and 'g' is the positive class.

    Cross-validate the classifier using 10-fold cross-validation.

    CVSVMModel = crossval(SVMModel)
    CVSVMModel = 
      ClassificationPartitionedModel
        CrossValidatedModel: 'SVM'
             PredictorNames: {'x1'  'x2'  'x3'  'x4'  'x5'  'x6'  'x7'  'x8'  'x9'  'x10'  'x11'  'x12'  'x13'  'x14'  'x15'  'x16'  'x17'  'x18'  'x19'  'x20'  'x21'  'x22'  'x23'  'x24'  'x25'  'x26'  'x27'  'x28'  'x29'  'x30'  'x31'  'x32'  'x33'  'x34'}
               ResponseName: 'Y'
            NumObservations: 351
                      KFold: 10
                  Partition: [1x1 cvpartition]
                 ClassNames: {'b'  'g'}
             ScoreTransform: 'none'
    
    
      Properties, Methods
    
    

    CVSVMModel is a ClassificationPartitionedModel cross-validated classifier. During cross-validation, the software completes these steps:

    1. Randomly partition the data into 10 sets of equal size.

    2. Train an SVM classifier on nine of the sets.

    3. Repeat steps 1 and 2 k = 10 times. The software leaves out one partition each time and trains on the other nine partitions.

    4. Combine generalization statistics for each fold.

    Display the first model in CVSVMModel.Trained.

    FirstModel = CVSVMModel.Trained{1}
    FirstModel = 
      CompactClassificationSVM
                 ResponseName: 'Y'
        CategoricalPredictors: []
                   ClassNames: {'b'  'g'}
               ScoreTransform: 'none'
                        Alpha: [78x1 double]
                         Bias: -0.2209
             KernelParameters: [1x1 struct]
                           Mu: [0.8888 0 0.6320 0.0406 0.5931 0.1205 0.5361 0.1286 0.5083 0.1879 0.4779 0.1567 0.3924 0.0875 0.3360 0.0789 0.3839 9.6066e-05 0.3562 -0.0308 0.3398 -0.0073 0.3590 -0.0628 0.4064 -0.0664 0.5535 -0.0749 0.3835 -0.0295 ... ]
                        Sigma: [0.3149 0 0.5033 0.4441 0.5255 0.4663 0.4987 0.5205 0.5040 0.4780 0.5649 0.4896 0.6293 0.4924 0.6606 0.4535 0.6133 0.4878 0.6250 0.5140 0.6075 0.5150 0.6068 0.5222 0.5729 0.5103 0.5061 0.5478 0.5712 0.5032 0.5639 0.5062 ... ]
               SupportVectors: [78x34 double]
          SupportVectorLabels: [78x1 double]
    
    
      Properties, Methods
    
    

    FirstModel is the first of the 10 trained classifiers. It is a CompactClassificationSVM classifier.

    You can estimate the generalization error by passing CVSVMModel to kfoldLoss.

    Specify a holdout sample proportion for cross-validation. By default, crossval uses 10-fold cross-validation to cross-validate a naive Bayes classifier. However, you have several other options for cross-validation. For example, you can specify a different number of folds or a holdout sample proportion.

    Load the ionosphere data set. This data set has 34 predictors and 351 binary responses for radar returns, either bad ('b') or good ('g').

    load ionosphere

    Remove the first two predictors for stability.

    X = X(:,3:end);
    rng('default'); % For reproducibility

    Train a naive Bayes classifier using the predictors X and class labels Y. A recommended practice is to specify the class names. 'b' is the negative class and 'g' is the positive class. fitcnb assumes that each predictor is conditionally and normally distributed.

    Mdl = fitcnb(X,Y,'ClassNames',{'b','g'});

    Mdl is a trained ClassificationNaiveBayes classifier.

    Cross-validate the classifier by specifying a 30% holdout sample.

    CVMdl = crossval(Mdl,'Holdout',0.3)
    CVMdl = 
      ClassificationPartitionedModel
        CrossValidatedModel: 'NaiveBayes'
             PredictorNames: {'x1'  'x2'  'x3'  'x4'  'x5'  'x6'  'x7'  'x8'  'x9'  'x10'  'x11'  'x12'  'x13'  'x14'  'x15'  'x16'  'x17'  'x18'  'x19'  'x20'  'x21'  'x22'  'x23'  'x24'  'x25'  'x26'  'x27'  'x28'  'x29'  'x30'  'x31'  'x32'}
               ResponseName: 'Y'
            NumObservations: 351
                      KFold: 1
                  Partition: [1x1 cvpartition]
                 ClassNames: {'b'  'g'}
             ScoreTransform: 'none'
    
    
      Properties, Methods
    
    

    CVMdl is a ClassificationPartitionedModel cross-validated, naive Bayes classifier.

    Display the properties of the classifier trained using 70% of the data.

    TrainedModel = CVMdl.Trained{1}
    TrainedModel = 
      CompactClassificationNaiveBayes
                  ResponseName: 'Y'
         CategoricalPredictors: []
                    ClassNames: {'b'  'g'}
                ScoreTransform: 'none'
             DistributionNames: {1x32 cell}
        DistributionParameters: {2x32 cell}
    
    
      Properties, Methods
    
    

    TrainedModel is a CompactClassificationNaiveBayes classifier.

    Estimate the generalization error by passing CVMdl to kfoldloss.

    kfoldLoss(CVMdl)
    ans = 0.2095
    

    The out-of-sample misclassification error is approximately 21%.

    Reduce the generalization error by choosing the five most important predictors.

    idx = fscmrmr(X,Y);
    Xnew = X(:,idx(1:5));

    Train a naive Bayes classifier for the new predictor.

    Mdlnew = fitcnb(Xnew,Y,'ClassNames',{'b','g'});

    Cross-validate the new classifier by specifying a 30% holdout sample, and estimate the generalization error.

    CVMdlnew = crossval(Mdlnew,'Holdout',0.3);
    kfoldLoss(CVMdlnew)
    ans = 0.1429
    

    The out-of-sample misclassification error is reduced from approximately 21% to approximately 14%.

    Train a regression generalized additive model (GAM) by using fitrgam, and create a cross-validated GAM by using crossval and the holdout option. Then, use kfoldPredict to predict responses for validation-fold observations using a model trained on training-fold observations.

    Load the patients data set.

    load patients

    Create a table that contains the predictor variables (Age, Diastolic, Smoker, Weight, Gender, SelfAssessedHealthStatus) and the response variable (Systolic).

    tbl = table(Age,Diastolic,Smoker,Weight,Gender,SelfAssessedHealthStatus,Systolic);

    Train a GAM that contains linear terms for predictors.

    Mdl = fitrgam(tbl,'Systolic');

    Mdl is a RegressionGAM model object.

    Cross-validate the model by specifying a 30% holdout sample.

    rng('default') % For reproducibility
    CVMdl = crossval(Mdl,'Holdout',0.3)
    CVMdl = 
      RegressionPartitionedGAM
           CrossValidatedModel: 'GAM'
                PredictorNames: {'Age'  'Diastolic'  'Smoker'  'Weight'  'Gender'  'SelfAssessedHealthStatus'}
         CategoricalPredictors: [3 5 6]
                  ResponseName: 'Systolic'
               NumObservations: 100
                         KFold: 1
                     Partition: [1x1 cvpartition]
             NumTrainedPerFold: [1x1 struct]
             ResponseTransform: 'none'
        IsStandardDeviationFit: 0
    
    
      Properties, Methods
    
    

    The crossval function creates a RegressionPartitionedGAM model object CVMdl with the holdout option. During cross-validation, the software completes these steps:

    1. Randomly select and reserve 30% of the data as validation data, and train the model using the rest of the data.

    2. Store the compact, trained model in the Trained property of the cross-validated model object RegressionPartitionedGAM.

    You can choose a different cross-validation setting by using the 'CrossVal', 'CVPartition', 'KFold', or 'Leaveout' name-value argument.

    Predict responses for the validation-fold observations by using kfoldPredict. The function predicts responses for the validation-fold observations by using the model trained on the training-fold observations. The function assigns NaN to the training-fold observations.

    yFit = kfoldPredict(CVMdl);

    Find the validation-fold observation indexes, and create a table containing the observation index, observed response values, and predicted response values. Display the first eight rows of the table.

    idx = find(~isnan(yFit));
    t = table(idx,tbl.Systolic(idx),yFit(idx), ...
        'VariableNames',{'Obseraction Index','Observed Value','Predicted Value'});
    head(t)
        Obseraction Index    Observed Value    Predicted Value
        _________________    ______________    _______________
    
                1                 124              130.22     
                6                 121              124.38     
                7                 130              125.26     
               12                 115              117.05     
               20                 125              121.82     
               22                 123              116.99     
               23                 114                 107     
               24                 128              122.52     
    

    Compute the regression error (mean squared error) for the validation-fold observations.

    L = kfoldLoss(CVMdl)
    L = 43.8715
    

    Input Arguments

    collapse all

    Machine learning model, specified as a full regression or classification model object, as given in the following tables of supported models.

    Regression Model Object

    ModelFull Regression Model Object
    Gaussian process regression (GPR) modelRegressionGP (If you supply a custom 'ActiveSet' in the call to fitrgp, then you cannot cross-validate the GPR model.)
    Generalized additive model (GAM)RegressionGAM
    Neural network modelRegressionNeuralNetwork

    Classification Model Object

    ModelFull Classification Model Object
    Generalized additive modelClassificationGAM
    k-nearest neighbor modelClassificationKNN
    Naive Bayes modelClassificationNaiveBayes
    Neural network modelClassificationNeuralNetwork
    Support vector machine for one-class and binary classificationClassificationSVM

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

    Example: crossval(Mdl,'KFold',3) specifies using three folds in a cross-validated model.

    Cross-validation partition, specified as a cvpartition partition object created by cvpartition. The partition object specifies the type of cross-validation and the indexing for the training and validation sets.

    You can specify only one of these four name-value arguments: 'CVPartition', 'Holdout', 'KFold', or 'Leaveout'.

    Example: Suppose you create a random partition for 5-fold cross-validation on 500 observations by using cvp = cvpartition(500,'KFold',5). Then, you can specify the cross-validated model by using 'CVPartition',cvp.

    Fraction of the data used for holdout validation, specified as a scalar value in the range (0,1). If you specify 'Holdout',p, then the software completes these steps:

    1. Randomly select and reserve p*100% of the data as validation data, and train the model using the rest of the data.

    2. Store the compact, trained model in the Trained property of the cross-validated model. If Mdl does not have a corresponding compact object, then Trained contains a full object.

    You can specify only one of these four name-value arguments: 'CVPartition', 'Holdout', 'KFold', or 'Leaveout'.

    Example: 'Holdout',0.1

    Data Types: double | single

    Number of folds to use in a cross-validated model, specified as a positive integer value greater than 1. If you specify 'KFold',k, then the software completes these steps:

    1. Randomly partition the data into k sets.

    2. For each set, reserve the set as validation data, and train the model using the other k – 1 sets.

    3. Store the k compact, trained models in a k-by-1 cell vector in the Trained property of the cross-validated model. If Mdl does not have a corresponding compact object, then Trained contains a full object.

    You can specify only one of these four name-value arguments: 'CVPartition', 'Holdout', 'KFold', or 'Leaveout'.

    Example: 'KFold',5

    Data Types: single | double

    Leave-one-out cross-validation flag, specified as 'on' or 'off'. If you specify 'Leaveout','on', then for each of the n observations (where n is the number of observations, excluding missing observations, specified in the NumObservations property of the model), the software completes these steps:

    1. Reserve the one observation as validation data, and train the model using the other n – 1 observations.

    2. Store the n compact, trained models in an n-by-1 cell vector in the Trained property of the cross-validated model. If Mdl does not have a corresponding compact object, then Trained contains a full object.

    You can specify only one of these four name-value arguments: 'CVPartition', 'Holdout', 'KFold', or 'Leaveout'.

    Example: 'Leaveout','on'

    Output Arguments

    collapse all

    Cross-validated machine learning model, returned as one of the cross-validated (partitioned) model objects in the following tables, depending on the input model Mdl.

    Regression Model Object

    ModelRegression Model (Mdl)Cross-Validated Model (CVMdl)
    Gaussian process regression modelRegressionGPRegressionPartitionedGP
    Generalized additive modelRegressionGAMRegressionPartitionedGAM
    Neural network modelRegressionNeuralNetworkRegressionPartitionedModel

    Classification Model Object

    ModelClassification Model (Mdl)Cross-Validated Model (CVMdl)
    Generalized additive modelClassificationGAMClassificationPartitionedGAM
    k-nearest neighbor modelClassificationKNNClassificationPartitionedModel
    Naive Bayes modelClassificationNaiveBayesClassificationPartitionedModel
    Neural network modelClassificationNeuralNetworkClassificationPartitionedModel
    Support vector machine for one-class and binary classificationClassificationSVMClassificationPartitionedModel

    Tips

    • Assess the predictive performance of Mdl on cross-validated data by using the kfold functions and properties of CVMdl, such as kfoldPredict, kfoldLoss, kfoldMargin, and kfoldEdge for classification and kfoldPredict and kfoldLoss for regression.

    • Return a partitioned classifier with stratified partitioning by using the name-value argument 'KFold' or 'Holdout'.

    • Create a cvpartition object cvp using cvp = cvpartition(n,'KFold',k). Return a partitioned classifier with nonstratified partitioning by using the name-value argument 'CVPartition',cvp.

    Alternative Functionality

    Instead of training a model and then cross-validating it, you can create a cross-validated model directly by using a fitting function and specifying one of these name-value argument: 'CrossVal', 'CVPartition', 'Holdout', 'Leaveout', or 'KFold'.

    Extended Capabilities

    Version History

    Introduced in R2012a

    expand all

    See Also