Efficiently Read in Text file with Headers Throughout File

Question

Christopher Saltonstall am 1 Feb. 2017

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/322910-efficiently-read-in-text-file-with-headers-throughout-file

Bearbeitet: Jan am 1 Feb. 2017

Akzeptierte Antwort: Christopher Saltonstall

GULPInputTest (2).txt

In MATLAB Online öffnen

Hello,

I am trying to read dispersion outputs from the GULP lattice dynamics program. The format of the output file is shown in the attached file. Basically, it has 3 header lines and then two columns of data then three header lines and two columns of data repeating roughly 300 times (depends of the simulation parameters). I want to efficiently read this output without testing every line. My current code shown below works fine but it takes 30 seconds for a small output file and I want to optimize how this data is read in preparation for much larger files. Any suggestions?

 fid = fopen([file]);
idx = 0;
idx2 = 1;
z = 0;
tic
while ~feof(fid) %Check every line in the file   
    dummy = fgetl(fid);
    test = dummy(1);    
    if ~strcmp(test,'#')
        dummy2 = strsplit(dummy,' ');        
        C(idx2,1) = str2double(cell2mat(dummy2(1,2)));
        C(idx2,2) = str2double(cell2mat(dummy2(1,3)));
        idx2 = idx2 + 1;
    else
       Header(idx+1,1) = {dummy};
       idx = idx + 1;
    end   
end
toc

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Christopher Saltonstall am 1 Feb. 2017

1
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/322910-efficiently-read-in-text-file-with-headers-throughout-file#answer_252949

In MATLAB Online öffnen

This solution works great if you know the length of data between headers. I had to write a code to determine this from the GULP input file.

 tic
N = nBranch*n1Max; %length of data between headers
formatSpec = '%f %f';
C = [];
while ~feof(fid)
  s = textscan(fid,formatSpec,N,'CommentStyle','#','Delimiter','\t');
  C = [C; s{1,1}, s{1,2}];
end
toc

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Jan am 1 Feb. 2017

Bearbeitet: Jan am 1 Feb. 2017

In MATLAB Online öffnen

+1: Nice and compact.

If you do not need the comment lines, one textscan is enough without a loop (see my 2nd approach):

dataC = textscan(fid, '%f %f', 'CommentStyle', '#');
Data  = cat(2, dataC{1}, dataC{2});

Melden Sie sich an, um zu kommentieren.

Answer 2

Stephen23 am 1 Feb. 2017

1
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/322910-efficiently-read-in-text-file-with-headers-throughout-file#answer_252950

Bearbeitet: Stephen23 am 1 Feb. 2017

In MATLAB Online öffnen

Without reading the file first, importing both headers and numeric data, and using textscan to make it nice and fast:

A = cell(0,1);
B = cell(0,2);
C = cell(0,1);
opt = {'Delimiter','=','WhiteSpace','','CollectOutput',true};
fid = fopen('GULPInputTest (2).txt','rt');
str = fgetl(fid);
while ischar(str)
    A{end+1,1} = str; %#ok<SAGROW>
    B(end+1,:) = textscan(fid,'#%s%f%f%f',opt{:}); %#ok<SAGROW>
    C(end+1,:) = textscan(fid,'%f%f','CollectOutput',true); %#ok<SAGROW>
    str = fgetl(fid);
end
fclose(fid);

And is quite fast to run:

Elapsed time is 0.231249 seconds.

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Jan am 1 Feb. 2017

Bearbeitet: Jan am 1 Feb. 2017

+1: Why is this so fast compared to the original version or to my first suggestion? Can you explain this? The missing pre-allocation should slow it down, but it doesn't. At least not for the provided input file.

Melden Sie sich an, um zu kommentieren.

Answer 3

Jan am 1 Feb. 2017

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/322910-efficiently-read-in-text-file-with-headers-throughout-file#answer_252945

Bearbeitet: Jan am 1 Feb. 2017

In MATLAB Online öffnen

Here you find several ideas - for teaching purposes. You are right, that the loops are not efficient here. Only textscan works at the possible speed:

tic
fid = fopen(file, 'r');
if fid == -1  % Seen too many codes failing...
  error('Cannot open file: %s', file);
end
iData   = 0;
iHeader = 0;
Header  = cell(100, 1);
Data    = [];
while ~feof(fid)
    s = fgetl(fid);
    if strncmp(s, '#', 1)
      iHeader         = iHeader + 1;
      Header{iHeader} = s;
    else
      iData = iData + 1;
      Data(iData, 1:2) = sscanf(s, ' %g %g');  % Missing pre-allocation!
    end   
end
fclose(fid);
Header = Header(1:iHeader);
toc

Start with this simplified version. strsplit(str2double(cell2mat(...))) wastes some time. On my Matlab R2009a in a virtual machine this uses 33 sec instead of 44 sec of the original (strsplit replaced by regexp('split') in the old Matlab version).

It still suffers from a missing pre-allocation for the data. The iterative growing wastes a lot of resources. Look for "Schlemiel the Painter" in the net.

Another approach:

tic
fid = fopen(file, 'r');
if fid == -1  % Seen too many codes failing...
  error('Cannot open file: %s', file);
end
dataC = textscan(fid, '%f %f', 'CommentStyle', '#');
Data = cat(2, dataC{1}, dataC{2});
fclose(fid);
toc

But this does not import the comment lines. Nevertheless, it is quite fast: 0.17 sec. Sounds perfect.

Now try to solve the pre-allocation problem:

tic
fid = fopen(file, 'r');
if fid == -1  % Seen too many codes failing...
  error('Cannot open file: %s', file);
end
Block   = cell(1, 10000);  % Not expensive, better too large
dataLen = 1000;            % Arbitrary limit
aData   = zeros(dataLen, 2);
iBlock  = 0;
iData   = 0;
iHeader = 0;
Header  = cell(100, 1);    % Not expensive, better too large
while ~feof(fid)
    s = fgetl(fid);
    if strncmp(s, '#', 1)    % Header line:
      iHeader         = iHeader + 1;
      Header{iHeader} = s;
    else                     % Data line
      iData = iData + 1;
      aData(iData, :) = sscanf(s, ' %g %g');
        if iData == dataLen
          iBlock        = iBlock + 1;
          Block{iBlock} = aData;
          iData         = 0;
        end
      end   
  end
  fclose(fid);
  Header = Header(1:iHeader);
% Care for last block:
iBlock        = iBlock + 1;
Block{iBlock} = aData(1:iData, :);
% Join the imported data blocks:
Data = cat(1, Block{1:iBlock});
toc

Christopher, this looks ugly. Sorry. It is such ugly. Puh. Too much clutter here which is prone to errors. It needs 4.4 sec. 10 times faster, but slow compared to textscan. The speed will even degrade, if the pre-allocated array are too small, while too large arrays costs microseconds only.

I assume that another textscan might be ways nicer:

tic
fid = fopen(file, 'r');
if fid == -1  % Seen too many codes failing...
 error('Cannot open file: %s', file);
end
dataC = textscan(fid, ' %f %f ', 'CommentStyle', '#');
Data  = cat(2, dataC{1}, dataC{2});
fseek(fid, 0, -1);
HeaderC = textscan(fid, '%s', 'CommentStyle', ' ', 'WhiteSpace', '\n');
Header  = HeaderC{1};
fclose(fid);
toc;

Now data and header lines are imported separately. It takes 0.18 sec on my slow computer. I'm not happy with using the space as comment style to ignore the data lines. There might be a better filter:

HeaderC = textscan(fid, '%s', 'WhiteSpace', '\n');  % Import all lines
Header  = HeaderC{1};
Header  = Header(strncmp(Header, '#', 1));          % Select the # lines

This let the total code run in 0.27 sec.

Conclusion: The loops can be much faster with a pre-allocation, but cannot compete with textscan.

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Christopher Saltonstall am 1 Feb. 2017

Bearbeitet: Christopher Saltonstall am 1 Feb. 2017

In MATLAB Online öffnen

additionally, this code throws an error.

Conversion to cell from double is not possible.
Error in test (line 20)
    C(iData, 1:2) = sscanf(s, ' %g %g');

Jan am 1 Feb. 2017

Bearbeitet: Jan am 1 Feb. 2017

@Cristopher: Sorry - this submission was under conctruction. I've typed it at first and run the different codes now.

Ready now.

Melden Sie sich an, um zu kommentieren.

Efficiently Read in Text file with Headers Throughout File

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Akzeptierte Antwort

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Weitere Antworten (2)

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

Efficiently Read in Text file with Headers Throughout File

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Akzeptierte Antwort

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Weitere Antworten (2)

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

3 Kommentare 1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

3 Kommentare
1 älteren Kommentar anzeigen1 älteren Kommentar ausblenden