How increase bufsize for importdata

I am using the importdata function to import data from tab-separated and comma-separated text files. This works great for files up to at least 10Mb, but fails on files with an identical format in the 70Mb range with the following error.
Caused by: Error using ==> textscan Buffer overflow (bufsize = 1000005) while reading string from file (row 1, field 1). Use 'bufsize' option. See HELP TEXTSCAN.
Is there an easy way to increase bufsize directly in importdata call, without mucking around in the textscan function? I understand that as an alternative I can rewrite my code using textscan directly, but my current M-code file is working with importdata for smaller imports and I am looking for the simplest solution to allow import of larger data sets.

 Akzeptierte Antwort

Oleg Komarov
Oleg Komarov am 12 Mär. 2011

1 Stimme

You can try to edit line 319 of importdata:
bufsize = min(1000000, max(numel(fileString),100)) + 5;
Set the minimum threshold 1000000 to something higher.
EDIT 14 March 02:14 GMT 00
You had to specify that "NA" should be treated as empty:
fid = fopen('C:\Users\Oleg\Desktop\ancestry-probs-par2.tsv');
% Column headers
colHead = fgetl(fid);
colHead = textscan(colHead,'%s');
colHead = colHead{1};
% get # data columns
numH = length(colHead);
% make fmt
fmt = ['%s', repmat('%f',1,numH)];
  • Import the file in bulk (if enough memory)
% Import file
data = textscan(fid,fmt,'HeaderLines',1,'TreatAsEmpty','NA');
fid = fclose(fid);
  • Import line by line (26 seconds on my pc, preallocation doesn't give the boost since just 191 lines...)
% Import file
data = cell(0,2);
while ~feof(fid)
data = [data; textscan(fid,fmt,1,'HeaderLines',1,'TreatAsEmpty','NA','CollectOutput',1)];
end
fid = fclose(fid);
rowHead = cat(1,data{:,1});
data = cat(1,data{:,2});
Oleg

9 Kommentare

David
David am 12 Mär. 2011
This did not work. I set
bufsize = 10000000
I left without semicolon to confirm setting.
These led to following error
Caused by:
Error using ==> textscan
Buffer overflow (bufsize = 4095) while reading characters from
file (row 1, field 1). Use 'bufsize' option. See HELP TEXTSCAN.
bufsize seems to default to 4095.
(I also tried replacing 1000000 with 10000000 in line 319 and this gave same error.)
I think importdata is trying to import the whole file in one gulp.
I think I will write up a textscan solution to this to scan line by line.
Oleg Komarov
Oleg Komarov am 13 Mär. 2011
I also noticed that inside importdata, textscan is not always called with buffsize argument...don't know if it's meant like that or you can try to add "'bufsize',bufsize".
Probably your case is one of those calls to textscan w/o the bufsize argument.
If it was me, I wouldn't try further with importdata but include a call to textscan.
You can add 3 or more lines from your text file and we can help you to process the import with textscan.
David
David am 13 Mär. 2011
Sounds like a great offer! Thanks.
Here's my function. I read one line at a time, because I know it works! Then save this to a string cell. I have managed to get the row and column labels, but I can't quite figure out how to extract the numeric data. Ideally, I would like the numeric data in an array. I have tried cell2mat, but I can't quite get it to work. At the bottom, I paste a test file.
function [data,row_labels,column_labels] = readprobdata_fgetl(filename,dir)
file = [dir,'/',filename];
[status, result] = system( ['wc -l ', file] );
numlines = textscan(result,'%f');
numlines = cell2mat(numlines);
%%Use fgetl
fid = fopen(file);
%get marker names
%read entire file into cell array
for row = 1:numlines
row_str = fgetl(fid);
row_str_cell(row,1) = textscan(row_str,'%s');
end
%get columns labels
column_labels=row_str_cell{1,1};
%get row labels and data
for row = 2:numlines
row_labels{row-1,1} = row_str_cell{row,1}{1,1};
temp_col = row_str_cell{row,1}';
%this line doesn't work
data{row-1,1}=temp_col(2:end,1);
end
%convert cell strings to numeric array
Here is the test file.
test.tsv
2:18372 2:19109 2:19683 2:19696 3:19697 4:20084 X:20117 X:20330
indivA10_GAAGTG .95 1 1 1 1 1 1 1
indivA11_AAAGCG 0 0 0 0 0 .01 .02 .03
indivA12_AATAAG 1 1 1 1 1 1 1 1
indivA1_AAATAG .5 .5 0 0 0 0 0 0
indivA2_TAATTG 1 1 1 1 1 1 1 1
Oleg Komarov
Oleg Komarov am 13 Mär. 2011
I would use the following approach to read in data:
fid = fopen('C:\Users\Oleg\Desktop\test.tsv');
colHead = textscan(fid, '%s%s%s%s%s%s%s%s',1);
data = textscan(fid, '%s%f%f%f%f%f%f%f%f','HeaderLines',1);
fid = fclose(fid);
% Then you can store a matrix with results
colHead = [colHead{:}];
rowHead = data{1};
data = [data{2:end}];
David
David am 13 Mär. 2011
Yes, I understand this would work for this number of columns. I need a solution for a variable and large number of column. I have extended your approach below to get the # cols and rows from the file to calculate the needed buffer size. Unfortunately, this still works fine for small and medium size files, but not for large files. I get col and row labels, but an empty array for data.
function [data,rowHead,colHead] = get_tsv(filename,dir)
file = [dir,'/',filename];
fid =fopen(file);
t = fgetl(fid);
colHead = textscan(t,'%s');
%get # data columns
length = size(colHead{1,1},1);
%make string with correct format for length data
format = ['%s',repmat('%f',1,length)];
%get parameters for buffer size
%get number column characters
[status, num_columns] = system( ['head -n 1 ', file, '| wc -m'] );
num_columns = str2num(num_columns);
%get number of rows
[status, num_rows] = system( ['wc -l ', file] );
num_rows = textscan(num_rows,'%f');
num_rows = cell2mat(num_rows);
bufsize = num_columns * num_rows;
data = textscan(fid,format,'BufSize',bufsize);
%store data
colHead = [colHead{:}];
rowHead = data{1};
data = [data{2:end}];
end
Oleg Komarov
Oleg Komarov am 13 Mär. 2011
Upload your file on megaupload, don't understand why it's not working. You can send me the link by mail.
David
David am 14 Mär. 2011
Perfect. This works with a slight modification.
data = cat(2,data{:,2:end});
instead of
data = cat(1,data{:,2});
Oleg Komarov
Oleg Komarov am 14 Mär. 2011
I forgot to put "'CollectOutput',1" in the bulk import with textscan.
Michael S
Michael S am 15 Jun. 2011
Thanks Oleg this was very helpful. To others if you have a CSV format do not forget that white space is the default delimiter so you need to add 'Delimiter',',' to the textscan arguments. i.e. textscan(fid,fmt,'HeaderLines',1,'Delimiter',',','CollectOutput',1)

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (1)

Walter Roberson
Walter Roberson am 12 Mär. 2011

0 Stimmen

It looks to me as if it is thinking that the first line is more than 1000000 characters.
How long is the first line?

1 Kommentar

David
David am 13 Mär. 2011
head -n 1 test_file.tsv | wc -m
1612061
So, I tried bufsize = 1612061 + 100
Same error.

Melden Sie sich an, um zu kommentieren.

Kategorien

Mehr zu Large Files and Big Data finden Sie in Hilfe-Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by