To get the RMSE results on validation data, a set of k-fold cross-validation models are needed. In the example provided, 50-fold cross-validation was used in Regression Learner. When running this model training in Regression Learning, 51 models were trained: 1 model for each cross-validation fold, plus a final model trained on all of the training data. When a model is exported from Regression Learner in 2021b, only the final model is exported. This is highlighted in a note at the top of this page: https://www.mathworks.com/help/stats/export-regression-model-to-predict-new-data.html At the high-level, two approaches are:
(1) Use the "Export Model" option from the Regression Learner, then write code to calculate the validation RMSE
(2) Use the "Generate Function" option of Regression Learner. This generates a matlab function which trains the final model and calculates the validation RMSE.
(1) Use the "Export Model" option from the Regression Learner, then write code to calculate the validation RMSE
For approach (1): After exporting the final model from the Regression Learner app as "trainedModel", one can get the validation RMSE with the code shown below.
CVMdl= crossval(trainedModel.RegressionGP,'Kfold',50);
Y_validation=kfoldPredict(CVMdl);
rmse_on_validation_data=sqrt(mean((Y_validation-tbl_training.Y).^2));
Note that the "crossval" function will do 50-fold cross-validation, since we specified 'Kfold' of 50. This means that 50 models will be trained, and stored in the resulting data structure. The "crossval" function will randomly partition the training data into 50 parts, then train 50 models, one for each fold. For example, the first model could be trained on folds 2-50, so it can be tested on fold 1. The second model could be trained on folds 1 and 3-50, so it can be tested on fold 2, etc. The crossval function accesses the original training data from inside the trainedModel.RegressionGP data structure. For more info, see https://www.mathworks.com/help/stats/classreg.learning.partition.regressionpartitionedmodel-class.html Here is some code to plot the validation predictions versus the True response:
scatter(tbl_training.Y,Y_validation,15,'filled','Color',[0 0.4470 0.7410]);
line([-1.75,1.75],[-1.75,1.75],'Color','k');
axis([-1.75 1.75 -1.75 1.75]);
xlabel('True response');ylabel('Predicted response using kfold validation models');
title('On validation data, Predicted response vs True Response');
subtitle(sprintf('RMSE of kfold validation models on validation data:%0.5f',rmse_on_validation_data));
legend("Observations","Perfect prediction","Location","southeast");
This leads to the following figure:
So, the above plot is what you are looking for. A few notes:
(1) The RMSE on validation data (0.29623) is slightly different from what you see in Regression Learner (0.29645) because the data was randomly re-partitioned into 50-folds at the command line with the function crossval, and thus the 50 cross-validation models are slightly different than what was used inside Regression Learner.
(2) The RMSE on the training data is much lower (0.18386), because testing the final model on the training data is "cheating" because the model training has seen the data being predicted. That is, in this case, the same data is being used for training and testing. A similar calculation and plot can be done using the final model on the training data:
Y_training = trainedModel.predictFcn(tbl_training);
rmse_on_training_data = sqrt(mean((Y_training-tbl_training.Y).^2))
scatter(tbl_training.Y,Y_training,15,'filled','Color',[0 0.4470 0.7410]);
line([-1.75,1.75],[-1.75,1.75],'Color','k');
axis([-1.75 1.75 -1.75 1.75]);
xlabel('True response');ylabel('Predicted response using final model');
title('On training data, Predicted response vs True Response');
subtitle(sprintf('RMSE of final model on training data:%0.5f',rmse_on_training_data));
legend("Observations","Perfect prediction","Location","southeast");
This leads to the following plot for the training data:
(2) Use the "Generate Function" option of Regression Learner. This generates a MATLAB function which trains the final model and calculates the validation RMSE.
Another way to reproduce the validation RMSE result is to use the "Generate Function" option from the Regression Learner app. The data tip indicates that this option will "Generate MATLAB code for training the currently selected model in the Models pane, including validation predictions."
So, just select the "Generate Function" option in the export area:
This outputs the following code. Notice that the last 3 lines of code calculate the validationRMSE in a way similar to that provided in the first part of this answer. For more info, see https://www.mathworks.com/help/stats/export-regression-model-to-predict-new-data.html#bvi2d8a-49. (Note, if you use PCA or feature selection in the Regression Learner app, then the generated code for calculating the validation RMSE will be much longer, and so in that case it is especially helpful to have this code auto-generated by the Regression Learner app.) function [trainedModel, validationRMSE] = trainRegressionModel(trainingData)
inputTable = trainingData;
predictorNames = {'X1', 'X2', 'X3', 'X4', 'X5', 'X6'};
predictors = inputTable(:, predictorNames);
isCategoricalPredictor = [false, false, false, false, false, false];
regressionGP = fitrgp(...
'BasisFunction', 'constant', ...
'KernelFunction', 'exponential', ...
predictorExtractionFcn = @(t) t(:, predictorNames);
gpPredictFcn = @(x) predict(regressionGP, x);
trainedModel.predictFcn = @(x) gpPredictFcn(predictorExtractionFcn(x));
trainedModel.RequiredVariables = {'X1', 'X2', 'X3', 'X4', 'X5', 'X6'};
trainedModel.RegressionGP = regressionGP;
trainedModel.About = 'This struct is a trained model exported from Regression Learner R2021b.';
trainedModel.HowToPredict = sprintf('To make predictions on a new table, T, use: \n yfit = c.predictFcn(T) \nreplacing ''c'' with the name of the variable that is this struct, e.g. ''trainedModel''. \n \nThe table, T, must contain the variables returned by: \n c.RequiredVariables \nVariable formats (e.g. matrix/vector, datatype) must match the original training data. \nAdditional variables are ignored. \n \nFor more information, see <a href="matlab:helpview(fullfile(docroot, ''stats'', ''stats.map''), ''appregression_exportmodeltoworkspace'')">How to predict using an exported model</a>.');
inputTable = trainingData;
predictorNames = {'X1', 'X2', 'X3', 'X4', 'X5', 'X6'};
predictors = inputTable(:, predictorNames);
isCategoricalPredictor = [false, false, false, false, false, false];
partitionedModel = crossval(trainedModel.RegressionGP, 'KFold', 50);
validationPredictions = kfoldPredict(partitionedModel);
validationRMSE = sqrt(kfoldLoss(partitionedModel, 'LossFun', 'mse'));