how to extract data from the fixed-width-field format using fscanf or textscan

Question

Dmitrii Semikin am 16 Jun. 2019

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/467361-how-to-extract-data-from-the-fixed-width-field-format-using-fscanf-or-textscan

Kommentiert: Dmitrii Semikin am 18 Jun. 2019

I am trying to parase large file, containing among others numerical data fixed in fixed-width-fields format without whitespaces (or any other separators) between the fields. Consider the example:

 1223244
55 6 788
1122 3 4

In this text each line contains four numbers each - two digits wide. Expected result after parsing this flie is:

  22    33    44
   6     7    88
  22     3     4

The prerformance is important for me, as the files I read may be hunreds of MBs large. From my experiments I discovered, that fgetl() is drammatically slower, than fscanf (I did not check the performance, but I expect, that textscan should also be significantly faster.

The problem is that I cannot find the way to parse data like this with fscanf or textscan. Could someone tell, if it is possible at all? If not, is there any other way to parse such a text file with good performance?

P.S. format of actual string is, of course, somewhat more complicated. Example string is:

1101332.18685714711829.064533733772535.874264373485 0 0

this string should be parsed into the numbers with the following widths: 8, 16, 16, 16, 8, 8, which in this particular case would result in the following numbers:

110

1332.18685714711

829.064533733772

535.874264373485

0

P.S.2: Differnence between the performance of fgetl and fscanf: On my laptop for the file, which contains 35597 lines the following code with fscanf complete in 0.253614 seconds

 function [data] = test_fscanf_nodes_only_01()
    file_name = 'myfile.txt';
    file_id = fopen (file_name, 'rt');
    cleanup_obj = onCleanup(@() fclose(file_id));
    data = fscanf(file_id, '%8d%16f%16f%16f%8f%8f', [6, Inf]);
end

while the following code with fgetl needs 6.343209 seconds to complete, even though it does much less

function [data] = test_fscanf_nodes_only_02()
    file_name = 'myfile.txt';
    file_id = fopen (file_name, 'rt');
    cleanup_obj = onCleanup(@() fclose(file_id));
    lines_count = 0;
    while ~feof(file_id)
        current_line = fgetl(file_id);
        lines_count = lines_count + 1;
    end
    data = 1;
    fprintf('Lines count: %d', lines_count);
end

The main problem with the first snippet is that it returns wrong result (because for fscanf the width of the field counts from the first digit it finds and not from the current position of the file pointer (which means, that the leading whitespaces are not counted as field width).

EDIT: The dimension in fscanf is changed from [5, Inf] to [6, Inf] according to Jan's comment.

8 Kommentare
6 ältere Kommentare anzeigen6 ältere Kommentare ausblenden

Dmitrii Semikin am 17 Jun. 2019

@Jan:

The difference between questions:

In this question I ask, how to (quickly) parse the string (or file) of the given form, and more specifically, if it is possible to do it with the help of fscanf or textscan. If not, then if there are some alternatives, which have comparable performance. I.e. this question is about parsing string of this particular form.
In the other question (What is the efficient way to parse file without fscanf or testscan) I ask, if there are efficient alternatives to use of fscanf and textscan and why fgetl is so slow? I.e. that question is about efficient parsinf of files in general.

''' The code with fgetl is not useful for your problem, or do you wnt to count the lines? '''

The fgetl seemed to be too slow, thus I was searching for the alternatives. After your answer to my other question (the link above) I see, that probably fgetl (or fgets) actually can fulfil my performance requirements. Thank you for this.
My intention is not just to count lines. The example with counting lines was provided just to illustrate, how slow the fgetl is.

''' Do you mean [6, inf]? '''

Yes. Sorry, it was a typo. Thank you for noticing it. I fixed it in the text of the question.

''' Is this fscanf command the answer to your question already? '''

No, as Walter Roberson mentioned (and as I mentioned in the question description) the problem with fscanf is that it does not count whitespaces, when calculating the width of the field. Actually, this is exactly the matter of my question: is there some trick, or facanf option or some other means to overcome this fact, that fscanf does not count whitespaces as part of the field width.

Additional notice: Even though, your answer to my other question is not the answer to this question, but probably it will make this question less relevant for my implementation, because probably with fgetl (when used the way you suggest it) I will be able to fullil the perormance requirements I have.

Never the less, the educational value of the answer to this question is still high for me.

Dmitrii Semikin am 17 Jun. 2019

@Walter Roberson:

''' No that fscanf will not work for the reason that the user noted: fscanf skips leading whitespace before it starts the count. '''

Do I understand correctly, that this means, that the answer to my question is: "NO, it is impossible to parse this kind of string with fscanf or textscan (in the release I have)"?

Walter Roberson am 17 Jun. 2019

textscan() is not the same as fscanf(): with textscan it is possible to change the Whitespace and Delimiter properties in a way that does not skip leading whitespace. because of the way that the numeric fields parse numbers, a leading whitespace in a numeric field in such an arrangement would be counted as an invalid character, so this turns out to only be useful for fixed-width character fields.

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Walter Roberson am 17 Jun. 2019

1
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/467361-how-to-extract-data-from-the-fixed-width-field-format-using-fscanf-or-textscan#answer_379437

fileread into a character vector. regexprep to insert a space before every field. textscan the character vector.

4 Kommentare
2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

Walter Roberson am 17 Jun. 2019

In MATLAB Online öffnen

newS = regexprep(S, '^(.{8})(.{16})(.{16})(.{16})(.{8})(.{8})', '$1 $2 $3 $4 $5 $6', 'lineanchors');

Dmitrii Semikin am 18 Jun. 2019

@Walter Roberson: Thank you for the solution.

At the end I did it slightly differently:

I still read the whole block, which contains the data into the memory. But then I do two transformations using regex:

Move the minus sign "-" to the beginning of the whitespace area
Fill the whitespace area with zeros

Then it is possible to read the data with fscanf using fixed-width formatting.

Now I realize, that your solution is most likely better. I think, it should be faster, becaue you don't need two passes. Initially I thought, that I would need to take care about lineendings, but then I realized, that with "^" at the beginning of the regexp it should all heppen automatically.

So, I really think, this is the most helpful answer (well, the answer from Stephen Cobeldick is the implementation of the same idea).

Thank you.

Melden Sie sich an, um zu kommentieren.

Answer 2

Stephen23 am 17 Jun. 2019

2
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/467361-how-to-extract-data-from-the-fixed-width-field-format-using-fscanf-or-textscan#answer_379486

In MATLAB Online öffnen

temp2.txt

A simple solution based on regeprep and sscanf. Before the file-reading loop:

vec = [8,16,16,16,8,8];
rgx = sprintf('(.{%d})',vec);
rpl = sprintf('$%d,',1:numel(vec));

In the file-reading loop:

str = fileread('temp2.txt');
str = regexprep(str,rgx,rpl,'dotexceptnewline');
mat = sscanf(str,'%f,',[numel(vec),Inf]).'

Giving (the test file is attached):

mat =
00000   1332.18686    829.06453    535.87426      0.00000      0.00000
00000   2332.28686    829.06453    535.87426      0.00000      0.00000
00000   3332.38686    829.06453    535.87426      0.00000      0.00000
00000   4332.48686    829.06453    535.87426      0.00000      0.00000
00000   5332.58686    829.06453    535.87426      0.00000      0.00000

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Dmitrii Semikin am 18 Jun. 2019

Thank you for the solution. But as the idea was proposed by Walter Roberson, he gets the score :).

Melden Sie sich an, um zu kommentieren.

how to extract data from the fixed-width-field format using fscanf or textscan

8 Kommentare
6 ältere Kommentare anzeigen6 ältere Kommentare ausblenden

Akzeptierte Antwort

4 Kommentare
2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

Weitere Antworten (1)

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

how to extract data from the fixed-width-field format using fscanf or textscan

8 Kommentare 6 ältere Kommentare anzeigen6 ältere Kommentare ausblenden

Akzeptierte Antwort

4 Kommentare 2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

Weitere Antworten (1)

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

8 Kommentare
6 ältere Kommentare anzeigen6 ältere Kommentare ausblenden

4 Kommentare
2 ältere Kommentare anzeigen2 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden