Regular expressions: extracting data after certain keywords

29 Ansichten (letzte 30 Tage)
Hello everyone,
I'm currently working on a task of extracting some data from a large .txt file. The file consists of certain keywords that are followed by arrays of data enclosed by "[...]" (please refer to the attached example file). I read the file into MATLAB via "fileread" and now I would like to perform the following two operations on the file content:
1) Extracting the data that comes after the first keyword "Excitation energy:". The result should be a 1D array (double).
2) Extracting the data that comes after the "Pixel number ..." keyword, e.g. after "Pixel number 0: 100656.007213". And this should be done for every "Pixel number ..." keyword in the file. The result should then be in the form of a 2D array (double) (one column/row for each Pixel number basically).
Now I started looking into how to solve this problem using regexp. However, I'm struggling in obtaining the desired parts of the text file.
For example I tried using the following expression to obtain the text enclosed by "[...]" after "Excitation energy:"
content = fileread("example_file.txt");
expr = '((?<=Excitation energy:\s\s\[)).+(?=\])';
energy_text = regexp(content,expr,'match');
But the result is basically just the complete content of the text file in the form of a char array (but in this case it should stop before hitting the first closing braket "]"). So I must be doing something wrong (I'm not very familar in using regexp). Has anyone an idea of how to extract the above mentioned data arrays? As a side note I would like to mention that the number of values within the data arrays can vary in a different text file so the expressions for regexp should really just focus on finding the data that is enclosed by "[...]" after the corresponding keyword.
Maybe there is also another solution to this problem without using regexp ...
Thank you very much in advance.
  2 Kommentare
Sindar
Sindar am 10 Okt. 2020
What version are you using? Strings have come a long way in recent years
Something like this might work:
exc_str = extractBetween(content,"Excitation energy: [","]")
exc_data = str2double(exc_str);
pixel_sets = split(content,"Pixel number");
pixel_sets(1) = [];
pixel_sets = extractBefore(pixel_sets,"]");
pixel_sets = extractAfter(pixel_sets,"[");
pixel_data = str2double(pixel_sets);
Jens Oppliger
Jens Oppliger am 10 Okt. 2020
Hello Sindar,
thank you very much for your quick answer. I'm using MATLAB R2020a. Just a few minutes ago I also came across those functions you mentioned. Especially "extractBetween" that I now used to perform the desired tasks.
Thanks again.

Melden Sie sich an, um zu kommentieren.

Akzeptierte Antwort

Stephen23
Stephen23 am 12 Okt. 2020
For such a large file I would get textscan to directly import the numeric data. With a few simple file commands you can also automatically adjust the format string to the number of columns, as shown below. Note that for R2020a you will probably need to change 'EndOfLine' to 'LineEnding'.
opt = {'CollectOutput',true, 'EndOfLine',']', 'HeaderLines',1,...
'MultipleDelimsAsOne',true, 'Whitespace',' \b\n\r\t'};
[fid,msg] = fopen('example_file.txt','rt');
assert(fid>=3,msg)
% Read "Excitation energy" block:
fscanf(fid,'Excitation energy:%*[^[][');
exe = fscanf(fid,'%f',[1,Inf]);
pos = ftell(fid);
% Read first "Pixel number" block:
fscanf(fid,'%*[^[]');
fscanf(fid,'[');
tmp = fscanf(fid,'%f',[1,Inf]);
% Create TEXTSCAN format string:
fmt = repmat('%f',1,numel(tmp));
fmt = ['Pixel number%f:%f%*[^0123456789]',fmt];
% Read all "Pixel number" blocks at once:
fseek(fid,pos,'bof');
out = textscan(fid,fmt,opt{:});
fclose(fid);
out = out{1};
Giving:
>> exe
exe =
0.6 0.599925979 0.599703936 .. 0.501484417 0.496248345 0.49088983
>> out
out =
0 100656.007213 2.08147902 .. -2.25828678 -2.35062627
1 100656.116929 2.08050533 .. -2.26393975 -2.34891882
  1 Kommentar
Jens Oppliger
Jens Oppliger am 12 Okt. 2020
Hello Stephen,
your proposed code does exactly what I wanted. Thanks again for your help!

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (1)

Walter Roberson
Walter Roberson am 10 Okt. 2020
expr = '((?<=Excitation energy:\s\s\[)).+?(?=\])';
or
expr = '((?<=Excitation energy:\s\s\[))[^]]+)';
Remember that the + and * operators immediately extend as far as possible into the string, and then the focus point is moved backwards only as needed to match anything later in the same pattern. So in your case
expr = '((?<=Excitation energy:\s\s\[)).+(?=\])';
then the .+ would first match right to the very end of the string, and then the (?=\]) would force the focus point to move back to just before the ] that is closest before that point.
The *? and +? operators, on the other hand, are minimal operators, moving the focus point forward as little as possible to match what follows in the pattern. Or, as I showed, you can just tell it to move forward past all non-] characters, which is even less work for the parser. The main difference is that the [^]]+ by itself does not promise that the next character is ] the way the other possibilities do. For example if the file ended in
Excitation energy: [1 2 3 4
The (?=\[) pattern would not match unless what followed was ] whereas the [^]]+ would match to end of string since everything after the [ is something that is not ]
You could make the two equivalent by adding a (?=\]) after the [^]]+
  1 Kommentar
Jens Oppliger
Jens Oppliger am 11 Okt. 2020
Hello Walter,
thank you for the detailed explanation. I tried both of your suggestions, but only the first one worked. Regexp doesn't return any match for the second expression where you included the [^]].
Nevertheless your answer gave me quite some additional insight into how the different quantifiers work in a regular expression.

Melden Sie sich an, um zu kommentieren.

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by