MATLAB Answers

0

Reading a very large text file of an almost regular data with empty value

Hello everybody,
I am trying to import an almost regular matrix into matlab. I used textscan with EmptyValue option to do it.
But it always give a error message 'badly formated string'. I do not understand why. Could you please give me a hand.
Below is the data file. The problem with the text file is:
first, there is empty in it. It would be better if I can get a NaN or 0 to replace the empty at the end
Second, between the column 3 and 4, sometimes, the values are attached. It also makes the input difficult.
4.417E-03 1.000E+00 2.200E+05 462 2.543878E+00 5.440884E+01
4.417E-03 1.000E+00 2.200E+05 468 2.544193E+00 7.315421E+01
4.417E-03 1.000E+00 2.200E+05 687 2.255183E+00 5.011286E+01
4.417E-03 1.000E+00 2.200E+05 943 7.015397E+00
4.417E-03 1.000E+00 2.200E+05 947 1.877077E+01
4.417E-03 1.000E+00 2.200E+0511135 2.543452E+00
4.417E-03 1.000E+00 2.200E+0511138
4.417E-03 1.000E+00 2.200E+0511141
4.417E-03 1.000E+00 2.200E+0511144 2.543891E+00 4.701584E+01
4.417E-03 1.000E+00 2.200E+0511351 2.255163E+00 4.291446E+01
4.417E-03 1.000E+00 2.200E+05 1591 2.544160E+00 2.182716E+01
4.417E-03 1.000E+00 2.200E+05 1596 2.543892E+00 3.667904E+01
4.417E-03 1.000E+00 2.200E+05 1598
4.417E-03 1.000E+00 2.200E+05 2350
4.417E-03 1.000E+00 2.200E+05 2356
4.417E-03 1.000E+00 2.200E+05 2522
4.417E-03 1.000E+00 2.200E+05 2711
The matrix I wanna obtain is
4.417E-03 1.000E+00 2.200E+05 462 2.543878E+00 5.440884E+01
4.417E-03 1.000E+00 2.200E+05 468 2.544193E+00 7.315421E+01
4.417E-03 1.000E+00 2.200E+05 687 2.255183E+00 5.011286E+01
4.417E-03 1.000E+00 2.200E+05 943 7.015397E+00 NaN
4.417E-03 1.000E+00 2.200E+05 947 1.877077E+01 NaN
4.417E-03 1.000E+00 2.200E+05 11135 2.543452E+00 NaN
4.417E-03 1.000E+00 2.200E+05 11138 NaN NaN
4.417E-03 1.000E+00 2.200E+05 11141 NaN NaN
4.417E-03 1.000E+00 2.200E+05 11144 2.543891E+00 4.701584E+01
4.417E-03 1.000E+00 2.200E+05 11351 2.255163E+00 4.291446E+01
4.417E-03 1.000E+00 2.200E+05 1591 2.544160E+00 2.182716E+01
4.417E-03 1.000E+00 2.200E+05 1596 2.543892E+00 3.667904E+01
4.417E-03 1.000E+00 2.200E+05 1598 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2350 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2356 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2522 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2711 NaN NaN
Ps. the file is very large. Data contains thousands of rows and columns. Should I do some optimisation for reading files? Thanks in advance to help me. Thank you very much.
[EDITED]
Format='%*10E %10E %9f %5d %E %E'
opt = {'EmptyValue',NaN,'CollectOutput',1};
tmp = textscan(fid,Format,'Delimiter','','Whitespace','',opt{:});

  0 Comments

Sign in to comment.

4 Answers

Answer by Jacob Palczynski on 25 Aug 2011
 Accepted Answer

Since your matrix is just almost regular, you will not be able to work with textscan easily. I guess the problem with EmptyValue is that there are no delimiters for the empty fields in your file.
I would suggest to use fgetl, and extract digits from the resulting char array with regexp. Then you get a cell array of char arrays, which you can convert to doubles.
For example:
clear
fid = fopen('test.txt','rt');
% maximum number of columns
maxlength = 6;
% preallocation
step = 2;
tm = nan(step,maxlength);
% reading line by line
k = 1;
while ~feof(fid)
thisline = fgetl(fid);
thisline = regexp(thisline,'\d*','match');
thisline = cellfun(@str2num,thisline);
tm(k,1:length(thisline)) = thisline;
r = size(tm,1);
% need to preallocate more?
if k == r
tm((r+1):(r+step), 1:maxlength) = nan;
end
k = k + 1;
end
fclose(fid);
For large files it makes sense to increase step.

  8 Comments

Hi thanks very much. It works, but I don't understand why your code works also with the joined-columns text file. Is it beacause +([E](-|+)\d\d) this expression to recongonize the E+00 ?
Otherwise, do you know how we can do the optimization for the time to read the text? Since the file has 500*10000 data to be read...
When I run the script in the profiler with a 10000*6-data-file, most time (~20 seconds) is spend in the cellfun line. Only 7 seconds are needed to read the file.
Hence, the conversion which we need due to the data format takes most time.

Sign in to comment.


Answer by Oleg Komarov on 25 Aug 2011

The txt I created is (with a newline character at the end, otherwise won't read properly):
4.417E-03 1.000E+00 2.200E+05 462 2.543878E+00 5.440884E+01
4.417E-03 1.000E+00 2.200E+05 468 2.544193E+00 7.315421E+01
4.417E-03 1.000E+00 2.200E+05 687 2.255183E+00 5.011286E+01
4.417E-03 1.000E+00 2.200E+05 943 7.015397E+00
4.417E-03 1.000E+00 2.200E+05 947 1.877077E+01
4.417E-03 1.000E+00 2.200E+0511135 2.543452E+00
4.417E-03 1.000E+00 2.200E+0511138
4.417E-03 1.000E+00 2.200E+0511141
4.417E-03 1.000E+00 2.200E+0511144 2.543891E+00 4.701584E+01
4.417E-03 1.000E+00 2.200E+0511351 2.255163E+00 4.291446E+01
4.417E-03 1.000E+00 2.200E+05 1591 2.544160E+00 2.182716E+01
4.417E-03 1.000E+00 2.200E+05 1596 2.543892E+00 3.667904E+01
4.417E-03 1.000E+00 2.200E+05 1598
4.417E-03 1.000E+00 2.200E+05 2350
4.417E-03 1.000E+00 2.200E+05 2356
4.417E-03 1.000E+00 2.200E+05 2522
4.417E-03 1.000E+00 2.200E+05 2711
.
fid = fopen('test.txt');
% Import 3rd and 4th column as fixed char, the other directly as doubles
out = textscan(fid,'%f%f%9c%5c%f%f','EmptyValue',NaN);
% Line of blank to pad char fields
pad = repmat(' ',1,numel(out{1}));
% Read in the fixed width char fields as double
out{3} = cell2mat(textscan([pad; out{3}.'],'%f'));
out{4} = cell2mat(textscan([pad; out{4}.'],'%f'));
fclose(fid);
cell2mat(out)

  4 Comments

Show 1 older comment
If you copy paste the data as I posted it and add a newline at the end on R2011a Win32 Vista it imports correctly.
Works also on the data posted by gringoire except I have to add a newline at its end.
what about this, can you solve?? this data is 9 column
02:38:00 R- .026 065.457 01** 4862 0097 0074 +19
02:39:00 NaN .000 065.457 01** 4862 0101 0074 +19
02:40:00 NaN .000 065.457 01** 4862 0099 0074 +19
02:41:00 NaN .000 065.457 01** 4862 0097 0074 +19
02:42:00 R- .129 065.459 01** 4862 0111 0074 +19
02:43:00 R- .051 065.460 01** 4862 0099 0074 +19
note: NaN is empty or 2space(" ")
thanks b4

Sign in to comment.


Answer by Jan
on 25 Aug 2011

Please post the command, which fails. It is impossible to guess exactly, what you have done.
This is a serious problem: "Second, between the column 3 and 4, sometimes, the values are attached. It also makes the input difficult." It is not trivial to split the string "2.200E+0511135". Without additional restirctions, it is even impossible. I assume, the exponent is limited to 2 digits - can you confirm this? Actually values > 1.0e100 are valid, but then it I'd consider the data file as damaged.
The file uses 4 significant digits per number in the leading columns. This is rather inprecise. I'd never draw any conclusions about the regularity of a large matrix with millions of elements, if the data are represented with such a low accuracy. Do you see any possibilities to obtain the values with SINGLE or better DOUBLE precision?
It is impossible to use TEXTSCAN due to the touching numbers. Therefore you have to insert a space after the 29th character at first and create a new file:
fidIN = fopen(FileName1, 'r');
if fidIN < 0, error('Cannot open file %s', FileName1); end
fidOUT = fopen(FileName2, 'w'); % EDITED: 'r'->'w'
if fidOUT < 0, error('Cannot open file %s', FileName2); end
while 1
S = fgetl(fidIN);
if ~ischar(S)
break;
end
% Split '4.417E-03 1.000E+00 2.200E+0511135'
S = [S(1:29), ' ', S(30:end)];
fwrite(fidOUT, S, 'uchar');
fwrite(fidOUT, 10, 'uchar'); % Or [13,10] for DOS line breaks
end
fclose(fidIN);
fclose(fidOUT);
Then TEXTSCAN has a chance to read the data. You need the 'MultipleDelimsAsOne' and the 'EmptyValue' flags.
PS. Please tell the programer who has created the file, that the format is trashy. It is near to be unusable. The strategy to format ASCII files are easy: Let the file be readable with the minimum number of rules. But you have repeated delimiters, missing data, touching columns and a large data set with low precision.

  5 Comments

Yeah sorry for the poor english...
Otherwise, could you please tell me why I have always the error message:
??? Error using ==> textscan Badly formed format string.
Which part I made a mistake..
@gringoire: Talking in English about MATLAB is not easy.
Look at "help textscan" -> Supported FORMAT specifiers: http://www.mathworks.de/help/techdoc/ref/textscan.html
There is no "%E" specifier. You can define the width as in "%10E" in newer MATLAB versions only - e,g, not in 2009a.
Your format specifier belong to FSCANF - which is btw. also a good idea to read the file line by line.
Ah ok.. I thought %10E works for every function-

Sign in to comment.


Answer by Antti
on 25 Aug 2011

Hi,
I think Jan's example code is good, but I suppose 3rd line should be:
fidOUT = fopen(FileName2, 'w'); % 'w' like write

  1 Comment

Yes. Fixed now.

Sign in to comment.