Why is the calculated Rsquare different between the embedded fit function and the EzyFit function (from File Exchange)?
2 Ansichten (letzte 30 Tage)
John D'Errico am 22 Mai 2023
Bearbeitet: John D'Errico am 22 Mai 2023
Do you understand that R^2 is not valid, when computed for a model with no constant term? Instead, I recall there are variations of R^2 that are more valid for models with no constant term.
Did you notice that R^2 for one of those computations was negative? That clearly suggests a problem. That your model has NO constant term also suggests why there is a problem.
That your data has one outlier in it, that would heavily influence the fit is another problem. I won't get into that at all.
Simple one number measures like R^2 are a problem. They are a far larger problem if you don't understand what they are telling you. And if you are worried about R^2 at all, on a problem with no constant term, then you don't understand R^2.
Let me spend some time writing and explaining...
x1 = [
y1 = [
Essentially, R^2 is a very simple measure that compares the variability in the data itself, IF we had essentially no model at all. We can get that from the variance.
Note that the variance SUBTRACTS OFF THE MEAN OF THE DATA. It implicitly assumes the model for this process is a constant model, with gaussian noise added. So the implicit model in that variance computation is just
y = a0 + noise
And we can recover the best least square estimate of a0 from the mean.
a0 = mean(y1)
But what did you fit? You tried to fit a model that lacks a constant term.
y = a1*x
We can get that directly from backslash, or we can use fit. Since we will do these computations essentially by hand, I'll use backslash.
format long g
a1 = x1\y1
I can also use fit though, just to convince you that backslash was correct.
mdl = fittype('a1*x1','indep','x1');
[fittedmdl,G] = fit(x1,y1,mdl)
So the same value, fit should be actually a little less accurate here, because fit uses an iterative procedure. So the slop lies in the convergence tolerance.
You will notice that fit returns a NEGATIVE R^2. Again, that should be a hint. NO CONSTANT TERM.
Now, what does R^2 tell us? R^2 compares how well the current model does in terms of reducing the variability in the data, compared to no model more complex than assuming the prcess is simply a constant one, plus noise.
SSbase = sum((y1 - mean(y1)).^2)
SSmdl = sum((y1 - a1*x1).^2)
As you can see, the base sum of squares, where I subtracted off only the mean is SMALLER then the sum of squares when I subtracted off the estimate from this model. The R^2 computation is now a simple one.
R2bad = 1 - SSmdl/SSbase
So again, a negative R^2 tells us that your model does not fit the data better than if you had just used a constant approximation for the process.
Finally, we might consider if a better meaure (for THIS process) is how well the fit reduces the simple sum of squares of your data, had we not subtracted off the mean.
R2nocon = 1 - SSmdl/sum(y1.^2)
This assume that the default model for the process is
y = noise
with a presumed mean of zero. And that is probably what ezyfit computed, since it apparently knew the model has no constant term.
Finally, I'll plot the various models.
The blue horizontal line is a model of the process where there is no signal at all, just random noise. In fact, that actually fits the data better in terms of explaining the sum of squares of errors, compared to the linear fit with no constant term.
In the end, mono-numerosis is a BAD thing. NEVER RELY ON A SIMPLE NUMBER TO TELL YOU IF YOUR MODEL IS ANY GOOD, certainly not if you don't understand the number in the first place. I would even go further, to tell you to rely on your eyes and your brain, NOT on any number. If the fit looks right, it is right. At the very least, think about what you are doing. Is the fit adequate for what you need?