fitlm returns pvalues equal to NaN without zscoring

107 Ansichten (letzte 30 Tage)
Manuela
Manuela am 31 Dez. 2025 um 10:08
Bearbeitet: dpb am 2 Jan. 2026 um 17:31
I can not understand which is the reason why the fitlm using variables without zscoring returns pvalues equal to NaN, whereas this does not happen using zscored variables. Below you can find the code I used:
medians_var1_scored = nanzscore(medians_var1);
medians_var2_scored = nanzscore(medians_var2);
medians_var1_scored_tr = medians_var1_scored';
medians_var2_scored_tr = medians_var2_scored';
tbl=table(medians_var1_scored_tr,medians_var2_scored_tr,'VariableNames', ...
{'var1','var2'});
%build your model
mdl=fitlm(tbl,'var1 ~ var2','RobustOpts','on')
which gives this result:
mdl =
Linear regression model (robust fit):
var1 ~ 1 + var2
Estimated Coefficients:
Estimate SE tStat pValue
_________ ________ ________ _______
(Intercept) 0.028674 0.094595 0.30312 0.76234
var2 -0.072919 0.094998 -0.76758 0.44429
Number of observations: 118, Error degrees of freedom: 116
Root Mean Squared Error: 1.03
R-squared: 0.00584, Adjusted R-Squared: -0.00273
F-statistic vs. constant model: 0.681, p-value = 0.411
If instead I use the original variables, I do:
tbl=table(medians_var1',medians_var2','VariableNames', ...
{'var1','var2'});
%build your model
mdl=fitlm(tbl,'var1 ~ var2','RobustOpts','on')
and obtain:
mdl =
Linear regression model (robust fit):
var1 ~ 1 + var2
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ ______ __________
(Intercept) 0 0 NaN NaN
var2 8.6386e-13 2.1087e-14 40.966 7.8796e-71
Number of observations: 118, Error degrees of freedom: 117
Root Mean Squared Error: 31.4
R-squared: 0.263, Adjusted R-Squared: 0.263
F-statistic vs. constant model: Inf, p-value = NaN
I can't understand why this happens. You can find attached both the original var1 and var2 variables and the zscored ones. But If I plot both of them, I obtain the same plot (just rescaled). Hence, there shouldn't be a problem in the nanzscore function which is "hand written".

Akzeptierte Antwort

dpb
dpb am 31 Dez. 2025 um 13:51
Bearbeitet: dpb am 1 Jan. 2026 um 13:36
d=dir('median*.mat'); % load the files w/o having to type a zillion characters...
for i=1:numel(d)
load(d(i).name)
disp(d(i).name) % figure out which is var1, var2??? show which file
whos('-file',d(i).name) % contains which variable...
end
medians_var1.mat
Name Size Bytes Class Attributes medians_energy_vec 1x118 944 double
medians_var1_scored_tr.mat
Name Size Bytes Class Attributes medians_energy_scored_tr 118x1 944 double
medians_var2.mat
Name Size Bytes Class Attributes medians_micro_parameter_vec 1x118 944 double
medians_var2_scored_tr.mat
Name Size Bytes Class Attributes medians_dwi_scored_tr 118x1 944 double
whos
Name Size Bytes Class Attributes ans 1x40 80 char d 4x1 3555 struct i 1x1 8 double medians_dwi_scored_tr 118x1 944 double medians_energy_scored_tr 118x1 944 double medians_energy_vec 1x118 944 double medians_micro_parameter_vec 1x118 944 double
es=medians_energy_scored_tr; ds=medians_dwi_scored_tr; % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
[all(isfinite(es)), all(isfinite(ds))] % is there a NaN lurking, maybe?
ans = 1×2 logical array
1 1
[all(isfinite(e)), all(isfinite(d))]
ans = 1×2 logical array
1 1
[min(es) max(es) min(ds) max(ds)] % see what the magnitude of variables are...
ans = 1×4
-2.8784 2.2034 -2.8010 1.8310
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
[min(d) max(d)], [min(e) max(e)]
ans = 1×2
1.0e+14 * 0.7538 1.7465
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
ans = 1×2
47.1018 175.4894
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
Oh....there's likely the problem; d (medians_micro_parameter_vec) is huge in magnitude so the unscaled calculation overflows when computing the sum terms in X'X. The scaling brings everything down to roughly +/-3 range if is normal scaling.
Always examine data before blindly fitting...
plot(ds,es,'x')
figure
plot(d,e,'x')
These appear to have just been randomly generated values---why such a large magnitude was used is anybody's guess here...but there's the problem.
We'll just fit the same data, still large but 10E4 instead of 10E14...
fitlm(d/1E10,e,'RobustOpts','on')
ans =
Linear regression model (robust fit): y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue __________ _________ ________ __________ (Intercept) 132.19 15.351 8.6108 4.1607e-14 x1 -0.0008596 0.0011199 -0.76758 0.44429 Number of observations: 118, Error degrees of freedom: 116 Root Mean Squared Error: 26 R-squared: 0.00584, Adjusted R-Squared: -0.00273 F-statistic vs. constant model: 0.681, p-value = 0.411
The problem is simply one of the magnitude of the independent variable is so large the sums during computation lose precision leading to an apparently singular design matrix; as the above shows, simply a reduction in the absolute magnitude of the x variable is sufficient; it isn't necessarily there being "magic" in z-scoring.
First we checked to be sure there weren't any NaN that were causing the issue that the function nanzscore() was silently taking care of for us...
If we then adjust by the mean and scale, but only by a power of 10 instead of std(), we get roughly the same result of about a zero intercept, but note the slope is the same only scaled by the additional 10E4 we introduced, but the pValue is also identical.
fitlm((d-mean(d))/1E14,e,'RobustOpts','on')
ans =
Linear regression model (robust fit): y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue ________ ______ ________ __________ (Intercept) 120.55 2.3898 50.441 9.3494e-81 x1 -8.596 11.199 -0.76758 0.44429 Number of observations: 118, Error degrees of freedom: 116 Root Mean Squared Error: 26 R-squared: 0.00584, Adjusted R-Squared: -0.00273 F-statistic vs. constant model: 0.681, p-value = 0.411
And just for good measure, we'll use standardized variables...
fitlm(zscore(d),zscore(e),'RobustOpts','on')
ans =
Linear regression model (robust fit): y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue _________ ________ ________ _______ (Intercept) 0.028674 0.094595 0.30312 0.76234 x1 -0.072919 0.094998 -0.76758 0.44429 Number of observations: 118, Error degrees of freedom: 116 Root Mean Squared Error: 1.03 R-squared: 0.00584, Adjusted R-Squared: -0.00273 F-statistic vs. constant model: 0.681, p-value = 0.411
This effect of keeping intermediate calculations of the design matrix roughly on the same order as the constant column of ones is one of the prime uses of z-scaling in regression besides putting variables of differing magnitudes on the same scale for comparison of sensitivities. When the magnitude of one or more columns becomes excessivly large in comparison, the end result is that the resulting model design matrix looks more and more ill-conditioned until the numerics finally fail.
  15 Kommentare
Manuela
Manuela am 2 Jan. 2026 um 17:13
Don't worry :) thanks for the very exhaustive answer!
dpb
dpb am 2 Jan. 2026 um 17:16
Bearbeitet: dpb am 2 Jan. 2026 um 17:31
In that case I'd suggest corr is the more appropriate. Use the two-output form to get the estimated p-value. This will avoid the issue of the regression calculation matrix inversion of the normal equations entirely.
d=dir('median*.mat'); % load the files w/o having to type a zillion characters...
for i=1:numel(d)
load(d(i).name)
end
%es=medians_energy_scored_tr; ds=medians_dwi_scored_tr; % shorten names
e=medians_energy_vec.'; d=medians_micro_parameter_vec.';
[c,p]=corr([d,e],'Type','Kendall');
[c(2,1) p(2,1)] % corr returns the full correlation matrix, not just the 2-column comparison
ans = 1×2
-0.0204 0.7447
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
I used Kendall's tau instead of default as the previous histogram indicated not normally distributed owing to what appeared to be multiple peaks in the distribution. You'll undoubtedly want to do some distribution testing first to see just what sort of assumptions can be justified.

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (0)

Kategorien

Mehr zu Descriptive Statistics finden Sie in Help Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by