Why would csvread read all data into a single column

21 views (last 30 days)
I am trying to read in a csv file to matlab. It has 1 million columns and 2 rows. When I use csvread it reads the file in as a 1 column by 2 million row matrix. Why would it do this?
  1 Comment
dpb
dpb on 12 May 2017
Edited: dpb on 12 May 2017
Dunno...would seem there's either
  1. something in the file that's confusing textscan 's row count or
  2. there is an actual bug/limitation inside textscan
csvread is just a wrapper to dlmread which in turn simply parses the inputs and calls textscan. For a .csv file, the call boils down to
delimiter = sprintf(delimiter);
whitespace = setdiff(sprintf(' \b\t'),delimiter);
result = textscan(fid,'',nrows, ...
'delimiter',delimiter,'whitespace',whitespace, ...
'headerlines',r,'headercolumns',c,...
'returnonerror',0,'emptyvalue',0,'CollectOutput', true);
where, of course, delimiter is ','.
The "magic" occurs inside textscan as you notice there is no explicit format string but an empty string placeholder--this is the cue used internally instructing it to return the array shape as the record structure appears externally without the user having to count fields and build a format string.
Since we can't see inside textscan, this is as far as we can go.
You could try building a test file and parsing it and seeing if you can replicate the problems at a specific record length or, on the way, perhaps determine that file works correctly and the fault is in this data file.

Sign in to comment.

Answers (2)

Matthew Eicholtz
Matthew Eicholtz on 12 May 2017
I think dpb's comment addresses potential csvread issues well, so I'll just add an alternative option that may work for you: readtable.
  3 Comments
dpb
dpb on 12 May 2017
Actually, for such data file sizes it would seem far better to use .mat files or stream or somesuch...there's certainly no looking at them usefully by hand it would seem.

Sign in to comment.


dpb
dpb on 12 May 2017
Edited: dpb on 12 May 2017
Expanding upon the above comments, I did a test that looked like--
N=1E6; % the long row length
csvwrite('test.csv',randi(127,2,N)) % write a 2-row file of same N (@)
d=csvread('test.csv');
while isvector(d)
N=N/2;
csvwrite('test.csv',randi(127,2,N)) % write a 2-row file of same N
d=csvread('test.csv');
end
disp(N)
Result was for N=62500 which seems to prove conclusively there's an internal limit in textscan; probably some sort of buffer limit one would guess when the format string isn't provided.
I didn't try to refine the result to between 62500 125000 where it breaks, but that definitely seems to be the cause of the issue.
I tried the venerable textread, it never completed the 1E6 case before I gave up, so that's not a workaround.
While it would be butt-ugly as a solution, tried the explicit format and
>> d=textscan(fid,fmt,'delimiter',',','collectoutput',1);
Out of memory. Type HELP MEMORY for your options.
>>
Looks like a support request to TMW to see they can find a workaround or put it onto the enhancement list to resolve. Certainly seems as though Matlab should be able to read any file in whatever form it is in on disk as long as it can actually fit in memory without gyrations by the user.
(@) Just to be sure, I did scan the long record file by reading as stream character file and confirmed csvwrite wrote the linefeeds where should have so that, in fact, the file was actually two records on disk.
ADDENDUM
I hate it when get fixated on something... :(
But, I did a couple of additional tests and confirmed there's a hard limit apparently buried inside the textscan code at 100000--
>> N=100000;
>> csvwrite('test.csv',randi(127,2,N))
>> isvector(csvread('test.csv'))
ans =
1
>> N=N-1
N =
99999
>> csvwrite('test.csv',randi(127,2,N))
>> isvector(csvread('test.csv'))
ans =
0
>>
Fails beginning at 100,000 elements in length; 99,999 is ok, you're just not supposed to have a file with records any longer than that, it seems.
  1 Comment
dpb
dpb on 12 May 2017
Edited: dpb on 13 May 2017
Well, one way to make it work, albeit slowly
>> fid=fopen('test.csv','r');
>> dd=str2num(fread(fid,'*char').');
>> whos dd
Name Size Bytes Class Attributes
dd 2x1000000 16000000 double
>> fid=fclose(fid);
If you know the size a priori it would be better to just read and reshape. If the size isn't known, then two step solution is probably still significantly faster as str2num uses eval internally. But, it is interesting the interpreter can deal with that long of an internal input record while textscan can't handle that long of an external record.
fid=fopen('test.csv','r');
n=length(find(fread(fid,'*char')==10)); %
fid=fclose(fid);
d=reshape(csvread('test.csv'),n,[]);

Sign in to comment.

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by