Finding R squared in a loop

6 Ansichten (letzte 30 Tage)
Madlab
Madlab am 30 Okt. 2018
Kommentiert: Rik am 1 Nov. 2018
I am trying to extract out the various R squared values for a data set, with the actual data and the modelled data. However, I am having trouble finding R squared values with the various period value. I am getting negative R squared values, which is definitely not right. Specifically, I am not sure how to get the matrix right.
I get this warning "Warning: Matrix is singular, close to singular or badly scaled. Results may be inaccurate. RCOND = NaN. "
I have attached my code and a sample file.
for period = 0:0.5:1
% Assigning the data provided, sealevel to "d"
d = sealevel{1};
t = timesea{1};
% Finding the size of the data d
[Rdata , Cdata] = size(d);
% Creating figure for data plot
figure
% Plotting the observed data d against time
plot(t,d,'gX')
% holding the current plot and all axis properties
hold on
% Creating third variable for equation
thirdco = cos((2*pi/period) * t);
% Creating fourth variable for equation
fourthco = sin((2*pi/period) * t);
% Creating the design matrix, "G"
% G contains the variables in the equation
same = ones(Rdata,1);
G = [t same thirdco fourthco];
% Create the matrix "Parameters"
% Parameters is a vector containing the model's parameters
% Equation for finding Parameters is derived from [d] = [G] * [m]
Parameters = inv(G'*G) * G'* d;
% Assigning "v","constant","inphase" and "inquad" from calculated Parameters
v = Parameters(1);
constant = Parameters(2);
inphase = Parameters(3);
inquad = Parameters(4);
%%Finding the periodicdata output
% Since the equation is known, can assign "periodicdata", which is the output [d] as the parameters have been
% determined. Can use matrix multiplication of "G * Parameters" since
% they correspond to the coefficients and variables.
[periodicdata] = G * Parameters;
% Calculate amplitude
Amp = sqrt(inphase^2 + inquad^2);
% Calculate phase shift
phaseshf = atan(inquad/inphase);
% Finding R squared "coeff_deterloop"
Rsquared = 1 - sum((periodicdata - d).^2) / sum((mean(periodicdata) - periodicdata).^2)
end

Akzeptierte Antwort

John D'Errico
John D'Errico am 1 Nov. 2018
Bearbeitet: John D'Errico am 1 Nov. 2018
There are several issues here.
1. Using this form is a bad idea:
Parameters = inv(G'*G) * G'* d;
Sigh. It gets used over, and over again. And people still teach others to do it. Instead, use this:
Parameters = G\d;
MATLAB has the ability to solve a least squares problem with a non-square matrix, and do so efficiently. You need not turn it into a square matrix so you can use inv on it. That is a BAD idea numerically.
When you use the form that you did, it squares the condition number of the problem. So even if your problem is only moderately poorly conditioned to start with, squaring that now makes it impossible to solve. You get numerical garbage.
2. There are at least two circumstances where you can get a negative R^2. One is if you have no constant term in the model. As I recall, one way to describe what R^2 tells you, is how well your model performs, when compared to a model that has only a constant term. That is, if the default model was just to take the mean of your data and use that for prediction, then does your chosen model work better?
https://en.wikipedia.org/wiki/Coefficient_of_determination
So, as long as your model includes a constant term, then all such models must in theory be better predictors than a model that uses only the mean of your data, because that is in effect a constant term.
If you read the link I show above, you will find a form for R^2 as:
R^2 = 1 - ss_res/ss_tot
That is, ss_tot is the sum of squares that results from effectively treating your model as the mean, a constant term. ss_res is the sum of squares that results from your model. So if ss_tot is LESS than ss_res, that result will be NEGATIVE.
I'll give an example where it will be clear that you can easily get a negative R^2.
n = 100;
x = rand(n,1);
y = rand(n,1);
mu_y = mean(y); % a model that assumes a constant result.
% a model with a constant term has a column of 1's in it.
my_model = [x,x.^2]\y;
ss_tot = sum((y - mu_y).^2);
ss_res = sum(([x,x.^2]*my_model - y).^2);
R2 = 1 - ss_res/ss_tot
R2 =
-0.44063
So, I VERY carefully chose data for which the only good predictor is the constant y=0.5. No polynomial signal should exist at all.
However, my model was:
y = a1*x + a2*x^2
Clearly that will never fit better than the model that just presumed a constant term. As well, it has no constant term in it. Therefore? we get a negative R^2.
Having said all that, that it is easy to get a negative R^2, your model has a constant term in it.
That tells me, along with your comment that the model yields a singular matrix, that the second reason why it came out negative, was because your model is numerical GARBAGE. The matrix is singular. So you cannot possibly expect reasonable results from it. The computation will be pure garbage, therefore? Negative R^2 can result.
3. Next, what are you doing? Why is your model producing garbage in the first place? Lets look at what you did:
for period = 0:0.5:1
% Assigning the data provided, sealevel to "d"
d = sealevel{1};
t = timesea{1};
% Finding the size of the data d
[Rdata , Cdata] = size(d);
...
% Creating third variable for equation
thirdco = cos((2*pi/period) * t);
What happens when period == 0??????
Garbage. What happens when you divide by zero? inf results. Your matrix will be singular. Your model will be worthless. Your R^2 will probably be negative. Lots of bad things will happen.
I won't go much further on this, because we now know that bad things are happening.

Weitere Antworten (2)

Rik
Rik am 30 Okt. 2018
Just a quick note: a negative R-squared is actually possible. It just means your fit is so bad it is even worse than using the mean as your fit.
As for your actual question: you should not explicitly use the inv function to invert your matrix (so the line with Parameters = inv(G'*G) * G'* d;). This is mentioned in the doc of inv as well: link.
You should use mldivide instead (or \)
  1 Kommentar
Madlab
Madlab am 1 Nov. 2018
I am trying to use the methods as recommended in "inv" to dismiss the warning sign, but I am having trouble doing so. I can't seem to find a way to inverse G'*G using backslash

Melden Sie sich an, um zu kommentieren.


Madlab
Madlab am 31 Okt. 2018
Rik, thank you for your input. However, I did not use mldivide as I am not solving for a system of linear equation.
Further, I am having trouble saving all the R squared values. Since
period = 0:0.5:1,
saved_R = zeros(length(0:1/12:2),1);
saved_R (period,1) = Rsquared
However, this cannot work as period is not an integer. Is there any way for the period to vary in the loop? From the saved R values, I would use find to get the highest R squared value. Then I need to find the period which corresponds to that R squared value.
  7 Kommentare
Rik
Rik am 1 Nov. 2018
You should have used that index variable in the for declaration. Then you can use that to index both index the period vector and the output.
Rik
Rik am 1 Nov. 2018
So your code should become this:
period = 0:1/12:0.5;
calcper=cell(1,numel(time));
Rvalue=cell(1,numel(time));
% Find best period for different places
for k=1:numel(time)
saved_R =zeros(size(period));
% Set range of period
for periodindex =1:numel(period)
[Rsquared] = functionRsq(time{k},sealevel{k},period(periodindex),alltimesea{k});
saved_R(periodindex)=Rsquared;
end
% Finding max R squared value
[~,indexmaxR] = max(saved_R);
calcper{k} = period(indexmaxR);
Rvalue{k} = saved_R(indexmaxR);
end
Your code would have returned several errors. A nice thing to use only during debugging is clearvars. That way you can make sure you actually only use variables that you intended, instead of having old variable names hanging around.

Melden Sie sich an, um zu kommentieren.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by