MATLAB Answers


how to extract data from the fixed-width-field format using fscanf or textscan

Asked by Dmitrii Semikin on 16 Jun 2019
Latest activity Commented on by Dmitrii Semikin on 18 Jun 2019
I am trying to parase large file, containing among others numerical data fixed in fixed-width-fields format without whitespaces (or any other separators) between the fields. Consider the example:
55 6 788
1122 3 4
In this text each line contains four numbers each - two digits wide. Expected result after parsing this flie is:
1 22 33 44
55 6 7 88
11 22 3 4
The prerformance is important for me, as the files I read may be hunreds of MBs large. From my experiments I discovered, that fgetl() is drammatically slower, than fscanf (I did not check the performance, but I expect, that textscan should also be significantly faster.
The problem is that I cannot find the way to parse data like this with fscanf or textscan. Could someone tell, if it is possible at all? If not, is there any other way to parse such a text file with good performance?
P.S. format of actual string is, of course, somewhat more complicated. Example string is:
1101332.18685714711829.064533733772535.874264373485 0 0
this string should be parsed into the numbers with the following widths: 8, 16, 16, 16, 8, 8, which in this particular case would result in the following numbers:
P.S.2: Differnence between the performance of fgetl and fscanf: On my laptop for the file, which contains 35597 lines the following code with fscanf complete in 0.253614 seconds
function [data] = test_fscanf_nodes_only_01()
file_name = 'myfile.txt';
file_id = fopen (file_name, 'rt');
cleanup_obj = onCleanup(@() fclose(file_id));
data = fscanf(file_id, '%8d%16f%16f%16f%8f%8f', [6, Inf]);
while the following code with fgetl needs 6.343209 seconds to complete, even though it does much less
function [data] = test_fscanf_nodes_only_02()
file_name = 'myfile.txt';
file_id = fopen (file_name, 'rt');
cleanup_obj = onCleanup(@() fclose(file_id));
lines_count = 0;
while ~feof(file_id)
current_line = fgetl(file_id);
lines_count = lines_count + 1;
data = 1;
fprintf('Lines count: %d', lines_count);
The main problem with the first snippet is that it returns wrong result (because for fscanf the width of the field counts from the first digit it finds and not from the current position of the file pointer (which means, that the leading whitespaces are not counted as field width).
EDIT: The dimension in fscanf is changed from [5, Inf] to [6, Inf] according to Jan's comment.


@dpb: Thank you for mentioning this option. Indeed, as Walter Roberson noticed, in my release it is not available, but it is still good to know, that in later releases this option exists.
@Walter Roberson:
''' No that fscanf will not work for the reason that the user noted: fscanf skips leading whitespace before it starts the count. '''
Do I understand correctly, that this means, that the answer to my question is: "NO, it is impossible to parse this kind of string with fscanf or textscan (in the release I have)"?
textscan() is not the same as fscanf(): with textscan it is possible to change the Whitespace and Delimiter properties in a way that does not skip leading whitespace. because of the way that the numeric fields parse numbers, a leading whitespace in a numeric field in such an arrangement would be counted as an invalid character, so this turns out to only be useful for fixed-width character fields.

Sign in to comment.





2 Answers

Answer by Walter Roberson
on 17 Jun 2019
 Accepted Answer

fileread into a character vector. regexprep to insert a space before every field. textscan the character vector.


Show 1 older comment
@Walter Roberson:
Sorry for a bit childish question, but what should be the call (with what regex), to insert whitespaces at the given locations in the string?
Thank you in advance.
newS = regexprep(S, '^(.{8})(.{16})(.{16})(.{16})(.{8})(.{8})', '$1 $2 $3 $4 $5 $6', 'lineanchors');
@Walter Roberson: Thank you for the solution.
At the end I did it slightly differently:
I still read the whole block, which contains the data into the memory. But then I do two transformations using regex:
  1. Move the minus sign "-" to the beginning of the whitespace area
  2. Fill the whitespace area with zeros
Then it is possible to read the data with fscanf using fixed-width formatting.
Now I realize, that your solution is most likely better. I think, it should be faster, becaue you don't need two passes. Initially I thought, that I would need to take care about lineendings, but then I realized, that with "^" at the beginning of the regexp it should all heppen automatically.
So, I really think, this is the most helpful answer (well, the answer from Stephen Cobeldick is the implementation of the same idea).
Thank you.

Sign in to comment.

Answer by Stephen Cobeldick on 17 Jun 2019

A simple solution based on regeprep and sscanf. Before the file-reading loop:
vec = [8,16,16,16,8,8];
rgx = sprintf('(.{%d})',vec);
rpl = sprintf('$%d,',1:numel(vec));
In the file-reading loop:
str = fileread('temp2.txt');
str = regexprep(str,rgx,rpl,'dotexceptnewline');
mat = sscanf(str,'%f,',[numel(vec),Inf]).'
Giving (the test file is attached):
mat =
110.00000 1332.18686 829.06453 535.87426 0.00000 0.00000
220.00000 2332.28686 829.06453 535.87426 0.00000 0.00000
330.00000 3332.38686 829.06453 535.87426 0.00000 0.00000
440.00000 4332.48686 829.06453 535.87426 0.00000 0.00000
550.00000 5332.58686 829.06453 535.87426 0.00000 0.00000

  1 Comment

Thank you for the solution. But as the idea was proposed by Walter Roberson, he gets the score :).

Sign in to comment.