Why would csvread read all data into a single column

Question

Alex Mussell am 12 Mai 2017

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/340023-why-would-csvread-read-all-data-into-a-single-column

Bearbeitet: dpb am 13 Mai 2017

I am trying to read in a csv file to matlab. It has 1 million columns and 2 rows. When I use csvread it reads the file in as a 1 column by 2 million row matrix. Why would it do this?

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

dpb am 12 Mai 2017

Bearbeitet: dpb am 12 Mai 2017

In MATLAB Online öffnen

Dunno...would seem there's either

something in the file that's confusing textscan 's row count or
there is an actual bug/limitation inside textscan

csvread is just a wrapper to dlmread which in turn simply parses the inputs and calls textscan. For a .csv file, the call boils down to

delimiter = sprintf(delimiter);
whitespace  = setdiff(sprintf(' \b\t'),delimiter);
result  = textscan(fid,'',nrows, ...
    'delimiter',delimiter,'whitespace',whitespace, ...
    'headerlines',r,'headercolumns',c,...
    'returnonerror',0,'emptyvalue',0,'CollectOutput', true);

where, of course, delimiter is ','.

The "magic" occurs inside textscan as you notice there is no explicit format string but an empty string placeholder--this is the cue used internally instructing it to return the array shape as the record structure appears externally without the user having to count fields and build a format string.

Since we can't see inside textscan, this is as far as we can go.

You could try building a test file and parsing it and seeing if you can replicate the problems at a specific record length or, on the way, perhaps determine that file works correctly and the fault is in this data file.

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Matthew Eicholtz am 12 Mai 2017

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/340023-why-would-csvread-read-all-data-into-a-single-column#answer_266812

I think dpb's comment addresses potential csvread issues well, so I'll just add an alternative option that may work for you: readtable.

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Matthew Eicholtz am 12 Mai 2017

Ah yes, after re-reading the question, I agree. I was thinking 1 million instance of 2 variables, not the other way around. Good catch.

dpb am 12 Mai 2017

Actually, for such data file sizes it would seem far better to use .mat files or stream or somesuch...there's certainly no looking at them usefully by hand it would seem.

Melden Sie sich an, um zu kommentieren.

Answer 2

dpb am 12 Mai 2017

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/340023-why-would-csvread-read-all-data-into-a-single-column#answer_266831

Bearbeitet: dpb am 12 Mai 2017

In MATLAB Online öffnen

Expanding upon the above comments, I did a test that looked like--

N=1E6;                                 % the long row length 
csvwrite('test.csv',randi(127,2,N))    % write a 2-row file of same N (@)
d=csvread('test.csv');
while isvector(d)
  N=N/2;
  csvwrite('test.csv',randi(127,2,N))    % write a 2-row file of same N
  d=csvread('test.csv');
end
disp(N)

Result was for N=62500 which seems to prove conclusively there's an internal limit in textscan; probably some sort of buffer limit one would guess when the format string isn't provided.

I didn't try to refine the result to between 62500 125000 where it breaks, but that definitely seems to be the cause of the issue.

I tried the venerable textread, it never completed the 1E6 case before I gave up, so that's not a workaround.

While it would be butt-ugly as a solution, tried the explicit format and

>> d=textscan(fid,fmt,'delimiter',',','collectoutput',1);
Out of memory. Type HELP MEMORY for your options. 
>>

Looks like a support request to TMW to see they can find a workaround or put it onto the enhancement list to resolve. Certainly seems as though Matlab should be able to read any file in whatever form it is in on disk as long as it can actually fit in memory without gyrations by the user.

(@) Just to be sure, I did scan the long record file by reading as stream character file and confirmed csvwrite wrote the linefeeds where should have so that, in fact, the file was actually two records on disk.

ADDENDUM

I hate it when get fixated on something... :(

But, I did a couple of additional tests and confirmed there's a hard limit apparently buried inside the textscan code at 100000--

>> N=100000;
>> csvwrite('test.csv',randi(127,2,N))
>> isvector(csvread('test.csv'))
ans =
   1
>> N=N-1
N =
     99999
>> csvwrite('test.csv',randi(127,2,N))
>> isvector(csvread('test.csv'))
ans =
   0
>>

Fails beginning at 100,000 elements in length; 99,999 is ok, you're just not supposed to have a file with records any longer than that, it seems.

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

dpb am 12 Mai 2017

Bearbeitet: dpb am 13 Mai 2017

In MATLAB Online öffnen

Well, one way to make it work, albeit slowly

>> fid=fopen('test.csv','r');
>> dd=str2num(fread(fid,'*char').');
>> whos dd
Name      Size                    Bytes  Class     Attributes
dd        2x1000000            16000000  double              
>> fid=fclose(fid);

If you know the size a priori it would be better to just read and reshape. If the size isn't known, then two step solution is probably still significantly faster as str2num uses eval internally. But, it is interesting the interpreter can deal with that long of an internal input record while textscan can't handle that long of an external record.

fid=fopen('test.csv','r');
n=length(find(fread(fid,'*char')==10));  % 
fid=fclose(fid);
d=reshape(csvread('test.csv'),n,[]);

Melden Sie sich an, um zu kommentieren.

Why would csvread read all data into a single column

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Antworten (2)

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

Why would csvread read all data into a single column

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Antworten (2)

3 Kommentare 1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden