Regression with tall array (Using datastore, CSV) - Error

Hi

5 Kommentare

Ive J
Ive J am 12 Jul. 2021
Bearbeitet: Ive J am 12 Jul. 2021
Do you mean?
result = fitglm(x, y, 'Distribution', 'binomial', 'Link', 'logit');
because you have an extra ) there (though I'm sure the error nags about something else).
Can you confirm you have tall arrays (for x and y)?
istall(x)
ans =
logical
1
Also, are you trying to set the fromula? because error says so, but your call to fitglm doesn't show this.
K.P.
K.P. am 12 Jul. 2021
Yes, your fitglm-line is the one I have, the ) was a copy-paste error.
And yes, x and y are both tall arrays.
No, I am not calling a special formula.
can you share the output of your dependent/independent variables?
x
y
x is a 1000x500 (tall) table. This are the first entries:
7 6 12 12 15 13 12 30 71 6
3 4 4 0 0 1 10 2 6 1
1 0 0 0 0 0 2 0 0 0
1 0 4 0 0 0 0 0 4 0
6 3 5 2 0 0 10 0 3 0
3 26 10 3 0 2 15 7 24 1
17 85 5 4 0 0 29 0 6 0
1 0 1 0 0 2 1 0 0 0
2 0 3 0 0 0 9 0 4 0
5 18 11 2 0 1 6 0 3 0
3 1 0 0 0 2 4 0 0 0
2 0 0 0 0 0 0 0 0 0
2 0 10 0 0 0 0 0 0 0
2 0 1 1 0 3 0 0 3 0
2 16 3 0 0 0 3 2 36 1
y is a 1000x1 (tall) table and the first entries are:
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
I just tried to see if it was tall arrays and fitglm
>> X=[1:1000].'; X=tall(X);
>> Y=randn(size(X)); % this is interesting sidelight on the way...
Error using randn
Size inputs must be numeric.
>> size(X)
ans =
1×2 tall double row vector
1000 1
>> Y=randn(1000,1); Y=tall(Y); % OK, have to brute-force it
>> fitglm(X,Y,'Distribution',"normal")
Iteration [1]: 0% completed
Iteration [1]: 50% completed
Iteration [1]: 100% completed
Iteration [2]: 0% completed
Iteration [2]: 50% completed
Iteration [2]: 100% completed
Iteration [3]: 0% completed
Iteration [3]: 100% completed
ans =
Compact generalized linear regression model:
y ~ 1 + x1
Distribution = Normal
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ ________ _______
(Intercept) 0.0015036 0.064429 0.023338 0.98139
x1 1.6177e-05 0.00011151 0.14507 0.88468
1000 observations, 998 error degrees of freedom
Estimated Dispersion: 1.04
F-statistic vs. constant model: 0.021, p-value = 0.885
>>
So, fitglm will accept tall arrays; the syntax must be else where it would seem...

Melden Sie sich an, um zu kommentieren.

 Akzeptierte Antwort

Ive J
Ive J am 13 Jul. 2021
Bearbeitet: Ive J am 13 Jul. 2021
Well, your data is tall table, and that's what MATLAB complains about: since your first argument is a table, MATLAB thinks y is modelspec. You have two options:
% 1-feed fitglm with matrix
mdl = fitglm(x{:, :}, y{:, :}, 'Link', 'logit', 'Distribution', 'binomial');
% 2-OR: merge x and y as a table
data = [x, y]; % last column is the dependent variable by default
mdl = fitglm(data, 'Link', 'logit', 'Distribution', 'binomial');
Btw, your data is fairly small and (I assume) fits within memory, tall arrays should be avoided for such small datasets.

2 Kommentare

K.P.
K.P. am 13 Jul. 2021
Hi Ive,
I merged the x and y tables and converted the new table before building the tall array with:
ds = transform(ds,@table2array);
Now it works, Thanks for your help!
PS: the file here was was only a smaller sample. The "real" one is 320000x30000.
Ive J
Ive J am 13 Jul. 2021
If I were you I would also test with arrays. Processing tables is almost always (based on my experience) slower than arrays.
Good luck!

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (0)

Kategorien

Gefragt:

am 12 Jul. 2021

Bearbeitet:

am 1 Aug. 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by